added documentation

rk1165 · Oct 17, 2024 · 2595f8a · 2595f8a
1 parent 22cacf2
commit 2595f8a
Show file tree

Hide file tree

Showing 16 changed files with 352 additions and 59 deletions.
diff --git a/Makefile b/Makefile
@@ -2,13 +2,13 @@ run:
 	go run ./cmd/web/
 
 init:
-	sqlite3 feeds.db < ddl.sql
+	sqlite3 feeds.db < ./sql/ddl.sql
 	go mod tidy
 
 build:
 	go build -o feedcreator ./cmd/web/
 
 clean:
-	sqlite3 feeds.db < clean.sql
+	sqlite3 feeds.db < ./sql/clean.sql
 
 .PHONY: run init build clean
diff --git a/README.md b/README.md
@@ -3,79 +3,74 @@
 - This project aims to turn any website to RSS Feed which we can then monitor using RSS readers.
 - This [link](https://www.xul.fr/en-xml-rss.html) explains what are RSS feeds pretty well.
 
-### Local Startup
-
-- Ensure that you have `python3` installed then run the following commands:
-    - `python3 -m venv .venv` : this creates a virtual environment
-    - `source ./.venv/bin/activate` : activates the virtual environment where we will install our dependencies.
-    - `python3 -m pip install -r requirements.txt` : installs the dependencies.
-    - `python3 app.py` : starts the app on port `8000` which you can access on `127.0.0.1:8000`
-- Once the app is started one can find few feeds that we have tested it upon for their reference.
-
 ### How to use the app?
 
 - To create RSS Feed we mainly need two things: `title` and `link`. There's an optional third thing, `description`,
-  which can be skipped.
-- A website consists of html pages which have elements like `<li>`, `<a>`, `<article>`, `<div>` etc.
-- `li`, `a`, `article` are called **tags**. The elements also **may** have `class` attribute associated with them.
-- `class` attribute are used to apply `css` to a bunch of elements together. They also uniquely identifies
-  elements in the webpage.
-- To create RSS feed we need to identify such **common** elements on a webpage. For instance, like items which appears
-  in listing formats.
-- Once we have identified the element we need to find two sub element in that item pertaining to `title` and `link` for
-  our feeds.
-- Those two sub elements can also have class to identify them uniquely.
-- With these three things we can create the main component of our RSS feed `<item>`.
+  which can be skipped. A website consists of html pages which have elements like `<li>`, `<a>`, `<article>`, `<div>`
+  etc.
+- `li`, `a`, `article` are called **tags**. The elements **may** also have a `class` attribute associated with them.
+- `class` attributes are used to apply `css` to a bunch of elements together. They also uniquely identifies
+  elements on the webpage.
+- To create RSS feed we need to identify such **common** elements on a webpage which will have `title` and `link`.
+  Mostly, these are items appearing in a list format. Those two sub elements can also have class to identify them
+  uniquely. With these three things we can create the main component of our RSS feed `<item>`.
 - Consider the below HackerNews front page
-  ![hackernews](static/img/hn.png)
+  ![hackernews](docs/img/hn.png)
 - This has 4 items in list format. In a day these items get updated, and we can use RSS feeds to track them.
 - A single item is something like :
-  ![item](static/img/item.png)
+  ![item](docs/img/item.png)
 - To identify the `element` associated with the list item we can right-click on the list title and select `inspect`
 - The result is shown below:
-  ![extractors](static/img/extractors.png)
+  ![extractors](docs/img/extractors.png)
 - The item here would be `span` with `class` attribute `titleline`
-- The title and link element both will be `a` without any class attributes.
-- Once we have identifies these items we need to fill the following form:
-  ![form](static/img/form.png)
-- *Feed title* is the title of the feed
-- *Feed name* is something to uniquely identify the feed one is tracking.
-- *Website URL* is the URL for which to create the feed.
-- *Description* is self-explanatory.
-- Now we need to fill the extractor elements. From the above example:
-    - For item extractor values we have to use `span` as tag and `titleline` as class
-    - For title extractor values we have to use just `a` tag and keep class column blank.
-    - For link extractor values also we have to use just `a` tag and keep class column blank.
+- The title and link element both will be `a` without any class attributes. The title will be the text content in the
+  link.
+- Once we have identified these items we need to fill the following form:
+
+  ![form](docs/img/form.png)
+    - *Title* is the title of the feed with which you would like to track the feed. Here it could be like
+      `HackerNews Feed`
+    - *URL* is the link of the feed. Here `https://news.ycombinator.com/newest`
+    - *Description* is optional but could be something to describe the feed.
+    - *Name* is something to uniquely identify the feed one is tracking. It could be a short name like `hn`
+- Now we need to fill the extractor parameters. From the above example:
+    - For item selector we have to use `span.titleline`
+    - For title selector we have to use just `a`.
+    - For link selector we have to use just `a`.
 - After filling the form it should look like this:
-  ![filled](static/img/filled.png)
+
+  ![filled](docs/img/filled.png)
+
 - Once we submit it we will get a page which will have the feed url with other details and will look like this:
-  ![output](static/img/output.png)
-- Copy the `hackernews.xml` link and add it to RSS Feed readers.
+  ![output](docs/img/output.png)
+- Copy the `hn.xml` link - `http://localhost:8080/static/rss/hn.xml` and you can add it to any of the RSS Feed readers
+
+### QuickStart
 
+- `make init` to setup the database and download the dependencies.
+- `make run` to start the application locally.
+- `make build` to build the application locally.
+- `make clean` to purge the contents of the database
 
 ### Design & Implementation
 
-- We are using `requests` library to get the webpage.
-- Then `BeautifulSoup` to extract the relevant elements and create the `<item>` for RSS feed.
-- We are saving the feeds in `static/feeds` directory and there is a `feeds.db` table where we save feed metadata.
-- There is an `updater.py` file which runs at a fixed interval and using the metadata rescans **urls** to update the
-  feed items.
-    - It also deletes any item older than **3** days.
-- For some websites the web page is not loaded completely after executing the JavaScript.
-- For such pages we used `selenium.webdriver` to execute it and wait a second for it to load completely.
+- This project is using [colly](https://github.com/gocolly/colly) library for scraping the webpage.
+- The feeds are saved under `ui/static/rss` directory and there is a `feeds.db` table where we save feed metadata.
+- Two functions are scheduled at a configurable interval in `main.go`. Using the metadata it rescans all **urls** to
+  update or clean the feed items.
+- For some websites the web page is not loaded completely before executing the JavaScript.
+- For such pages we used `` to execute it and wait a second for it to load completely.
     - There is a fair chance that the page could still not be loaded completely in which case we won't be able to track
       it.
 
 ### Hosting
 
-- We have tried self-hosting it in DigitalOcean. If you want to do the same there are scripts in `scripts` folder.
-- You first need to configure `doctl`. Here is a [link](https://docs.digitalocean.com/reference/doctl/how-to/install/)
-  on how to do it.
+- There is a github workflow which creates a docker image for self hosting.
 
 ### Comments
 
 - Currently, we rescan all the feeds at a fixed interval. We can optimize it to scan each site at a fixed interval.
 - We are also not passing request headers like `If-Modified-Since` or checking response headers like `Last-Modified`.
 - We can add a LLM button to find the extractors given a web page.
-- For websites which only load completely after running JS we can create a separate slightly long-running process to load them.
-- This might not work well with small screens as it has been tested only on Laptop.
+- For websites which only load completely after running JS we can create a separate slightly long-running process to
+  load them.
diff --git a/cmd/web/main.go b/cmd/web/main.go
@@ -6,6 +6,7 @@ import (
 	"github.com/go-playground/form/v4"
 	"github.com/gorilla/sessions"
 	_ "github.com/mattn/go-sqlite3"
+	"github.com/rk1165/feedcreator/internal"
 	"github.com/rk1165/feedcreator/internal/models"
 	"github.com/rk1165/feedcreator/pkg/logger"
 	"html/template"
@@ -59,8 +60,8 @@ func main() {
 		WriteTimeout: 10 * time.Second,
 	}
 
-	//internal.ScheduleFunc(60*time.Second, app.cleanFeeds)
-	//internal.ScheduleFunc(90*time.Second, app.updateFeeds)
+	internal.ScheduleFunc(60*time.Second, app.cleanFeeds)
+	internal.ScheduleFunc(90*time.Second, app.updateFeeds)
 
 	logger.InfoLog.Printf("Starting server on %s", *addr)
 	err = server.ListenAndServe()

diff --git a/docs/img/extractors.png b/docs/img/extractors.png
diff --git a/docs/img/filled.png b/docs/img/filled.png
diff --git a/docs/img/form.png b/docs/img/form.png
diff --git a/docs/img/hn.png b/docs/img/hn.png
diff --git a/docs/img/item.png b/docs/img/item.png
diff --git a/docs/img/output.png b/docs/img/output.png
diff --git a/feeds.db b/feeds.db
diff --git a/internal/models/feeds.go b/internal/models/feeds.go
@@ -107,7 +107,7 @@ func (m *FeedModel) All() ([]*Feed, error) {
 	}
 	defer rows.Close()
 
-	feeds := []*Feed{}
+	var feeds []*Feed
 
 	for rows.Next() {
 		feed := &Feed{}
@@ -126,10 +126,6 @@ func (m *FeedModel) All() ([]*Feed, error) {
 	return feeds, nil
 }
 
-//func (m *FeedModel) Update() (*Feed, error) {
-//	return nil, nil
-//}
-
 func (m *FeedModel) Delete(id int) error {
 	stmt := `DELETE FROM feed WHERE id = ?`
 	rows, err := m.DB.Exec(stmt, id)

diff --git a/clean.sql → sql/clean.sql b/clean.sql → sql/clean.sql
diff --git a/ddl.sql → sql/ddl.sql b/ddl.sql → sql/ddl.sql
diff --git a/query.sql → sql/query.sql b/query.sql → sql/query.sql