Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generalize viewer database configuration #102

Merged
merged 1 commit into from
Sep 9, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
225 changes: 82 additions & 143 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,43 +1,25 @@
# website-indexer 🪱

This repository crawls a website and stores its content in a SQLite database file.
Crawl a website and search its content.

Use the SQLite command-line interface to
[make basic queries](#searching-the-crawl-database)
about website content including:
This project consists of two components:
a **crawler** application to crawl the contents of a website and store its content in a database; and a **viewer** web application that allows for searching of that crawled content.

- URLs
- Page titles
- Full text search
- HTML search
- Link URLs
- Design components (CSS class names)
- Crawler errors (404s and more)
- Redirects
Both components require
[Python 3.12](https://www.python.org/)
to run and are built using the
[Django](https://www.djangoproject.com/)
web application framework.
The crawler piece is built on top of the Archive Team's
[ludios_wpull](https://github.com/ArchiveTeam/ludios_wpull)
web crawler.

This repository also contains a Django-based
[web application](#running-the-viewer-application)
to explore crawled website content in your browser.
Make queries through an easy-to-use web form, review page details,
and export results as CSV or JSON reports.
## Getting started

## Crawling a website

### Using a Python virtual environment

Create a Python virtual environment and install required packages:

```
python3.12 -m venv venv
source venv/bin/activate
pip install -r requirements/base.txt
```

Crawl a website:

```sh
./manage.py crawl https://www.consumerfinance.gov crawl.sqlite3
```
This project can be run
[using Docker](#using-docker)
or a local
[Python virtual environment](#using-a-python-virtual-environment).

### Using Docker

Expand All @@ -47,109 +29,52 @@ To build the Docker image:
docker build -t website-indexer:main .
```

Crawl a website:
#### Viewing a sample crawl using Docker

To then run the viewer application using sample data:

```
docker run -it \
-p 8000:8000 \
-v `pwd`:/data website-indexer:main \
python manage.py crawl https://www.consumerfinance.gov /data/crawl.sqlite3
```

## Searching the crawl database

You can use the
[SQLite command-line client](https://www.sqlite.org/cli.html)
to make queries against the crawl database,
or a graphical client such as [DB4S](https://github.com/sqlitebrowser/sqlitebrowser) if you prefer.

To run the command-line client:

```
sqlite3 crawl.sqlite3
website-indexer:main
```

The following examples describe some common use cases.

### Dump database statistics

To list the total number of URLs and crawl timestamps:

```sql
sqlite> SELECT COUNT(*), MIN(timestamp), MAX(timestamp) FROM crawler_page;
23049|2022-07-20 02:50:02|2022-07-20 08:35:23
```
The web application using sample data will be accessible at http://localhost:8000/.

Note that page data is stored in a table named `crawler_page`.
#### Crawling a website and viewing the crawl results using Docker

### List pages that link to a certain URL
To crawl a website using the Docker image,
storing the result in a local SQLite database named `crawl.sqlite3`:

```sql
sqlite> SELECT DISTINCT url
FROM crawler_page
INNER JOIN crawler_page_links ON (crawler_page.id = crawler_page_links.page_id)
INNER JOIN crawler_link ON (crawler_page_links.link_id = crawler_page_link.id)
WHERE href LIKE "/plain-writing/"
ORDER BY url ASC;
```

To dump results to a CSV instead of the terminal:

```sql
sqlite> .mode csv
sqlite> .output filename.csv
sqlite> ... run your query here
sqlite> .output stdout
sqlite> .mode list
docker run -it \
-v `pwd`:/data \
website-indexer:main \
python manage.py crawl https://www.consumerfinance.gov /data/crawl.sqlite3
```

To search with wildcards, use the `%` character:
To then run the viewer web application to view that crawler database:

```sql
sqlite> SELECT DISTINCT url
FROM crawler_page
INNER JOIN crawler_page_links ON (crawler_page.id = crawler_page_links.page_id)
INNER JOIN crawler_link ON (crawler_page_links.link_id = crawler_link.id)
WHERE href LIKE "/about-us/blog/"
ORDER BY url ASC;
```

### List pages that contain a specific design component

```sql
sqlite> SELECT DISTINCT url
FROM crawler_page
INNER JOIN crawler_page_components ON (crawler_page.id = crawler_page_components.page_id)
INNER JOIN crawler_component ON (crawler_page_components.component_id = crawler_component.id)
WHERE crawler_component.class_name LIKE "o-featured-content-module"
ORDER BY url ASC
docker run -it \
-p 8000:8000 \
-v `pwd`:/data \
-e DATABASE_URL=sqlite:////data/crawl.sqlite3 \
website-indexer:main
```

See the [CFPB Design System](https://cfpb.github.io/design-system/)
for a list of common components used on CFPB websites.

### List pages with titles containing a specific string
The web application with the crawl results will be accessible at http://localhost:8000/.

```sql
SELECT url FROM crawler_page WHERE title LIKE "%housing%" ORDER BY url ASC;
```
### Using a Python virtual environment

### List pages with body text containing a certain string
Create a Python virtual environment and install required packages:

```sql
sqlite> SELECT url FROM crawler_page WHERE text LIKE "%diamond%" ORDER BY URL asc;
```

### List pages with HTML containing a certain string

```sql
sqlite> SELECT url FROM crawler_page WHERE html LIKE "%<br>%" ORDER BY URL asc;
python3.12 -m venv venv
source venv/bin/activate
pip install -r requirements/base.txt
```

## Running the viewer application

### Using a Python virtual environment

From the repo's root, compile frontend assets:

```
Expand All @@ -164,53 +89,61 @@ yarn
yarn watch
```

Create a Python virtual environment and install required packages:
#### Viewing a sample crawl using a Python virtual environment

Run the viewer application using sample data:

```
python3.12 -m venv venv
source venv/bin/activate
pip install -r requirements/base.txt
./manage.py runserver
```

Optionally set the `CRAWL_DATABASE` environment variable to point to a local crawl database:
The web application using sample data will be accessible at http://localhost:8000/.

```
export CRAWL_DATABASE=crawl.sqlite3
#### Crawling a website and viewing the crawl results using a Python virtual environment

To crawl a website and store the result in a local SQLite database named `crawl.sqlite3`:

```sh
./manage.py crawl https://www.consumerfinance.gov crawl.sqlite3
```

Finally, run the Django webserver:
To then run the viewer web application to view that crawler database:

```
./manage.py runserver
DATABASE_URL=sqlite:///crawl.sqlite3 ./manage.py runserver
```

The viewer application will be available locally at http://localhost:8000.
The web application with the crawl results will be accessible at http://localhost:8000/

### Using Docker
## Configuration

To build the Docker image:
### Database configuration

```
docker build -t website-indexer:main .
```
The `DATABASE_URL` environment variable can be used to specify the database
used for crawl results by the viewer application.
This project makes use of the
[dj-database-url](https://github.com/jazzband/dj-database-url)
project to convert that variable into a Django database specification.

To run the image using sample data:
For example, to use a SQLite file at `/path/to/db.sqlite`:

```
docker run -it -p 8000:8000 website-indexer:main
export DATABASE_URL=sqlite:////path/to/db.sqlite
```

To run the image using a local database dump:
(Note use of four slashes when referring to an absolute path;
only three are needed when referring to a relative path.)

To point to a PostgreSQL database instead:

```
docker run \
-it \
-p 8000:8000 \
-v /path/to/local/dump:/data \
-e CRAWL_DATABASE=/data/crawl.sqlite3 \
website-indexer:main
export DATABASE_URL=postgres://username:password@localhost/dbname
```

Please see
[the dj-database-url documentation](https://github.com/jazzband/dj-database-url)
for additional examples.

## Development

### Testing
Expand All @@ -227,8 +160,8 @@ To run the tests:
./manage.py test --keepdb
```

The `--keepdb` parameter is used because tests are run using a fixed,
pre-existing test database.
The `--keepdb` parameter is used because tests are run using
[a fixed, pre-existing test database](#sample-test-data).

### Code formatting

Expand Down Expand Up @@ -284,12 +217,18 @@ Then, in another terminal, start a crawl against the locally running site:
./manage.py crawl http://localhost:8000/ --recreate ./sample/src/sample.sqlite3
```

This will overwrite the test database with a fresh crawl.
(This uses a local Python virtual environment; see
[above](#crawling-a-website-and-viewing-the-crawl-results-using-docker)
for instructions on using Docker instead.)

This command will overwrite the sample database with a fresh crawl.

## Deployment

_For information on how this project is deployed at the CFPB,
employees and contractors should refer to the internal "CFGOV/crawler-deploy" repository._
employees and contractors should refer to the internal
[CFGOV/crawler-deploy](https://github.local/CFGOV/crawler-deploy/) 🔒
repository._

This repository includes a [Fabric](https://www.fabfile.org/) script
that can be used to configure a RHEL8 Linux server to run this project
Expand Down
2 changes: 1 addition & 1 deletion fabfile.py
Original file line number Diff line number Diff line change
Expand Up @@ -160,7 +160,7 @@ def deploy(conn):
--timeout 600 \\
wsgi
ExecReload=/bin/kill -s HUP $MAINPID
Environment=CRAWL_DATABASE={CRAWL_DATABASE}
Environment=CRAWL_DATABASE=sqlite:///{CRAWL_DATABASE}

[Install]
WantedBy=multi-user.target
Expand Down
1 change: 1 addition & 0 deletions requirements/base.txt
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@ beautifulsoup4==4.12.3
click==8.1.7
cssselect==1.2.0
Django==4.2.15
dj-database-url==2.2.0
django-click==2.4.0
django-debug-toolbar==4.4.6
django-filter==24.3
Expand Down
41 changes: 13 additions & 28 deletions settings.py
Original file line number Diff line number Diff line change
@@ -1,17 +1,10 @@
"""
Django settings for viewer project.

For more information on this file, see
https://docs.djangoproject.com/en/3.2/topics/settings/

For the full list of settings and their values, see
https://docs.djangoproject.com/en/3.2/ref/settings/
"""

import os
import sys
from pathlib import Path

import dj_database_url


# Build paths inside the project like this: BASE_DIR / 'subdir'.
BASE_DIR = Path(__file__).resolve().parent

Expand Down Expand Up @@ -72,28 +65,20 @@
WSGI_APPLICATION = "wsgi.application"


# Database
# https://docs.djangoproject.com/en/3.2/ref/settings/#databases

_sample_db_path = str(BASE_DIR / "sample" / "sample.sqlite3")
_env_db_path = os.getenv("CRAWL_DATABASE")

if _env_db_path and os.path.exists(_env_db_path) and "test" not in sys.argv:
CRAWL_DATABASE = _env_db_path
else:
CRAWL_DATABASE = _sample_db_path

_sqlite_db_path = f"file:{CRAWL_DATABASE}?mode=ro"
# The default database is configured to use a sample SQLite file.
# Override this by setting DATABASE_URL in the environment.
# See https://github.com/jazzband/dj-database-url for URL formatting.
_sample_db_path = f"{BASE_DIR}/sample/sample.sqlite3"

DATABASES = {
"default": {
"ENGINE": "django.db.backends.sqlite3",
"NAME": _sqlite_db_path,
"TEST": {
"NAME": _sqlite_db_path,
"default": dj_database_url.config(
default=f"sqlite:///{_sample_db_path}",
# Python tests also use the same sample SQLite file.
test_options={
"NAME": _sample_db_path,
"MIGRATE": False,
},
},
),
}

# Internationalization
Expand Down
Loading
Loading