Skip to content

Commit

Permalink
added documentation
Browse files Browse the repository at this point in the history
  • Loading branch information
rk1165 committed Oct 17, 2024
1 parent 22cacf2 commit 2595f8a
Show file tree
Hide file tree
Showing 16 changed files with 352 additions and 59 deletions.
4 changes: 2 additions & 2 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -2,13 +2,13 @@ run:
go run ./cmd/web/

init:
sqlite3 feeds.db < ddl.sql
sqlite3 feeds.db < ./sql/ddl.sql
go mod tidy

build:
go build -o feedcreator ./cmd/web/

clean:
sqlite3 feeds.db < clean.sql
sqlite3 feeds.db < ./sql/clean.sql

.PHONY: run init build clean
95 changes: 45 additions & 50 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,79 +3,74 @@
- This project aims to turn any website to RSS Feed which we can then monitor using RSS readers.
- This [link](https://www.xul.fr/en-xml-rss.html) explains what are RSS feeds pretty well.

### Local Startup

- Ensure that you have `python3` installed then run the following commands:
- `python3 -m venv .venv` : this creates a virtual environment
- `source ./.venv/bin/activate` : activates the virtual environment where we will install our dependencies.
- `python3 -m pip install -r requirements.txt` : installs the dependencies.
- `python3 app.py` : starts the app on port `8000` which you can access on `127.0.0.1:8000`
- Once the app is started one can find few feeds that we have tested it upon for their reference.

### How to use the app?

- To create RSS Feed we mainly need two things: `title` and `link`. There's an optional third thing, `description`,
which can be skipped.
- A website consists of html pages which have elements like `<li>`, `<a>`, `<article>`, `<div>` etc.
- `li`, `a`, `article` are called **tags**. The elements also **may** have `class` attribute associated with them.
- `class` attribute are used to apply `css` to a bunch of elements together. They also uniquely identifies
elements in the webpage.
- To create RSS feed we need to identify such **common** elements on a webpage. For instance, like items which appears
in listing formats.
- Once we have identified the element we need to find two sub element in that item pertaining to `title` and `link` for
our feeds.
- Those two sub elements can also have class to identify them uniquely.
- With these three things we can create the main component of our RSS feed `<item>`.
which can be skipped. A website consists of html pages which have elements like `<li>`, `<a>`, `<article>`, `<div>`
etc.
- `li`, `a`, `article` are called **tags**. The elements **may** also have a `class` attribute associated with them.
- `class` attributes are used to apply `css` to a bunch of elements together. They also uniquely identifies
elements on the webpage.
- To create RSS feed we need to identify such **common** elements on a webpage which will have `title` and `link`.
Mostly, these are items appearing in a list format. Those two sub elements can also have class to identify them
uniquely. With these three things we can create the main component of our RSS feed `<item>`.
- Consider the below HackerNews front page
![hackernews](static/img/hn.png)
![hackernews](docs/img/hn.png)
- This has 4 items in list format. In a day these items get updated, and we can use RSS feeds to track them.
- A single item is something like :
![item](static/img/item.png)
![item](docs/img/item.png)
- To identify the `element` associated with the list item we can right-click on the list title and select `inspect`
- The result is shown below:
![extractors](static/img/extractors.png)
![extractors](docs/img/extractors.png)
- The item here would be `span` with `class` attribute `titleline`
- The title and link element both will be `a` without any class attributes.
- Once we have identifies these items we need to fill the following form:
![form](static/img/form.png)
- *Feed title* is the title of the feed
- *Feed name* is something to uniquely identify the feed one is tracking.
- *Website URL* is the URL for which to create the feed.
- *Description* is self-explanatory.
- Now we need to fill the extractor elements. From the above example:
- For item extractor values we have to use `span` as tag and `titleline` as class
- For title extractor values we have to use just `a` tag and keep class column blank.
- For link extractor values also we have to use just `a` tag and keep class column blank.
- The title and link element both will be `a` without any class attributes. The title will be the text content in the
link.
- Once we have identified these items we need to fill the following form:

![form](docs/img/form.png)
- *Title* is the title of the feed with which you would like to track the feed. Here it could be like
`HackerNews Feed`
- *URL* is the link of the feed. Here `https://news.ycombinator.com/newest`
- *Description* is optional but could be something to describe the feed.
- *Name* is something to uniquely identify the feed one is tracking. It could be a short name like `hn`
- Now we need to fill the extractor parameters. From the above example:
- For item selector we have to use `span.titleline`
- For title selector we have to use just `a`.
- For link selector we have to use just `a`.
- After filling the form it should look like this:
![filled](static/img/filled.png)

![filled](docs/img/filled.png)

- Once we submit it we will get a page which will have the feed url with other details and will look like this:
![output](static/img/output.png)
- Copy the `hackernews.xml` link and add it to RSS Feed readers.
![output](docs/img/output.png)
- Copy the `hn.xml` link - `http://localhost:8080/static/rss/hn.xml` and you can add it to any of the RSS Feed readers

### QuickStart

- `make init` to setup the database and download the dependencies.
- `make run` to start the application locally.
- `make build` to build the application locally.
- `make clean` to purge the contents of the database

### Design & Implementation

- We are using `requests` library to get the webpage.
- Then `BeautifulSoup` to extract the relevant elements and create the `<item>` for RSS feed.
- We are saving the feeds in `static/feeds` directory and there is a `feeds.db` table where we save feed metadata.
- There is an `updater.py` file which runs at a fixed interval and using the metadata rescans **urls** to update the
feed items.
- It also deletes any item older than **3** days.
- For some websites the web page is not loaded completely after executing the JavaScript.
- For such pages we used `selenium.webdriver` to execute it and wait a second for it to load completely.
- This project is using [colly](https://github.com/gocolly/colly) library for scraping the webpage.
- The feeds are saved under `ui/static/rss` directory and there is a `feeds.db` table where we save feed metadata.
- Two functions are scheduled at a configurable interval in `main.go`. Using the metadata it rescans all **urls** to
update or clean the feed items.
- For some websites the web page is not loaded completely before executing the JavaScript.
- For such pages we used `` to execute it and wait a second for it to load completely.
- There is a fair chance that the page could still not be loaded completely in which case we won't be able to track
it.

### Hosting

- We have tried self-hosting it in DigitalOcean. If you want to do the same there are scripts in `scripts` folder.
- You first need to configure `doctl`. Here is a [link](https://docs.digitalocean.com/reference/doctl/how-to/install/)
on how to do it.
- There is a github workflow which creates a docker image for self hosting.

### Comments

- Currently, we rescan all the feeds at a fixed interval. We can optimize it to scan each site at a fixed interval.
- We are also not passing request headers like `If-Modified-Since` or checking response headers like `Last-Modified`.
- We can add a LLM button to find the extractors given a web page.
- For websites which only load completely after running JS we can create a separate slightly long-running process to load them.
- This might not work well with small screens as it has been tested only on Laptop.
- For websites which only load completely after running JS we can create a separate slightly long-running process to
load them.
5 changes: 3 additions & 2 deletions cmd/web/main.go
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@ import (
"github.com/go-playground/form/v4"
"github.com/gorilla/sessions"
_ "github.com/mattn/go-sqlite3"
"github.com/rk1165/feedcreator/internal"
"github.com/rk1165/feedcreator/internal/models"
"github.com/rk1165/feedcreator/pkg/logger"
"html/template"
Expand Down Expand Up @@ -59,8 +60,8 @@ func main() {
WriteTimeout: 10 * time.Second,
}

//internal.ScheduleFunc(60*time.Second, app.cleanFeeds)
//internal.ScheduleFunc(90*time.Second, app.updateFeeds)
internal.ScheduleFunc(60*time.Second, app.cleanFeeds)
internal.ScheduleFunc(90*time.Second, app.updateFeeds)

logger.InfoLog.Printf("Starting server on %s", *addr)
err = server.ListenAndServe()
Expand Down
Binary file added docs/img/extractors.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/img/filled.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/img/form.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/img/hn.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/img/item.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/img/output.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified feeds.db
Binary file not shown.
6 changes: 1 addition & 5 deletions internal/models/feeds.go
Original file line number Diff line number Diff line change
Expand Up @@ -107,7 +107,7 @@ func (m *FeedModel) All() ([]*Feed, error) {
}
defer rows.Close()

feeds := []*Feed{}
var feeds []*Feed

for rows.Next() {
feed := &Feed{}
Expand All @@ -126,10 +126,6 @@ func (m *FeedModel) All() ([]*Feed, error) {
return feeds, nil
}

//func (m *FeedModel) Update() (*Feed, error) {
// return nil, nil
//}

func (m *FeedModel) Delete(id int) error {
stmt := `DELETE FROM feed WHERE id = ?`
rows, err := m.DB.Exec(stmt, id)
Expand Down
File renamed without changes.
File renamed without changes.
File renamed without changes.
Loading

0 comments on commit 2595f8a

Please sign in to comment.