cfpb · chosak · Sep 9, 2024 · Sep 9, 2024
diff --git a/README.md b/README.md
@@ -1,43 +1,25 @@
 # website-indexer 🪱
 
-This repository crawls a website and stores its content in a SQLite database file.
+Crawl a website and search its content.
 
-Use the SQLite command-line interface to
-[make basic queries](#searching-the-crawl-database)
-about website content including:
+This project consists of two components:
+a **crawler** application to crawl the contents of a website and store its content in a database; and a **viewer** web application that allows for searching of that crawled content.
 
-- URLs
-- Page titles
-- Full text search
-- HTML search
-- Link URLs
-- Design components (CSS class names)
-- Crawler errors (404s and more)
-- Redirects
+Both components require
+[Python 3.12](https://www.python.org/)
+to run and are built using the
+[Django](https://www.djangoproject.com/)
+web application framework.
+The crawler piece is built on top of the Archive Team's
+[ludios_wpull](https://github.com/ArchiveTeam/ludios_wpull)
+web crawler.
 
-This repository also contains a Django-based
-[web application](#running-the-viewer-application)
-to explore crawled website content in your browser.
-Make queries through an easy-to-use web form, review page details,
-and export results as CSV or JSON reports.
+## Getting started
 
-## Crawling a website
-
-### Using a Python virtual environment
-
-Create a Python virtual environment and install required packages:
-
-```
-python3.12 -m venv venv
-source venv/bin/activate
-pip install -r requirements/base.txt
-```
-
-Crawl a website:
-
-```sh
-./manage.py crawl https://www.consumerfinance.gov crawl.sqlite3
-```
+This project can be run
+[using Docker](#using-docker)
+or a local
+[Python virtual environment](#using-a-python-virtual-environment).
 
 ### Using Docker
 
@@ -47,109 +29,52 @@ To build the Docker image:
 docker build -t website-indexer:main .
 ```
 
-Crawl a website:
+#### Viewing a sample crawl using Docker
+
+To then run the viewer application using sample data:
 
 ```
 docker run -it \
     -p 8000:8000 \
-    -v `pwd`:/data website-indexer:main \
-    python manage.py crawl https://www.consumerfinance.gov /data/crawl.sqlite3
-```
-
-## Searching the crawl database
-
-You can use the
-[SQLite command-line client](https://www.sqlite.org/cli.html)
-to make queries against the crawl database,
-or a graphical client such as [DB4S](https://github.com/sqlitebrowser/sqlitebrowser) if you prefer.
-
-To run the command-line client:
-
-```
-sqlite3 crawl.sqlite3
+    website-indexer:main
 ```
 
-The following examples describe some common use cases.
-
-### Dump database statistics
-
-To list the total number of URLs and crawl timestamps:
-
-```sql
-sqlite> SELECT COUNT(*), MIN(timestamp), MAX(timestamp) FROM crawler_page;
-23049|2022-07-20 02:50:02|2022-07-20 08:35:23
-```
+The web application using sample data will be accessible at http://localhost:8000/.
 
-Note that page data is stored in a table named `crawler_page`.
+#### Crawling a website and viewing the crawl results using Docker
 
-### List pages that link to a certain URL
+To crawl a website using the Docker image,
+storing the result in a local SQLite database named `crawl.sqlite3`:
 
-```sql
-sqlite> SELECT DISTINCT url
-FROM crawler_page
-INNER JOIN crawler_page_links ON (crawler_page.id = crawler_page_links.page_id)
-INNER JOIN crawler_link ON (crawler_page_links.link_id = crawler_page_link.id)
-WHERE href LIKE "/plain-writing/"
-ORDER BY url ASC;
 ```
-
-To dump results to a CSV instead of the terminal:
-
-```sql
-sqlite> .mode csv
-sqlite> .output filename.csv
-sqlite> ... run your query here
-sqlite> .output stdout
-sqlite> .mode list
+docker run -it \
+    -v `pwd`:/data \
+    website-indexer:main \
+    python manage.py crawl https://www.consumerfinance.gov /data/crawl.sqlite3
 ```
 
-To search with wildcards, use the `%` character:
+To then run the viewer web application to view that crawler database:
 
-```sql
-sqlite> SELECT DISTINCT url
-FROM crawler_page
-INNER JOIN crawler_page_links ON (crawler_page.id = crawler_page_links.page_id)
-INNER JOIN crawler_link ON (crawler_page_links.link_id = crawler_link.id)
-WHERE href LIKE "/about-us/blog/"
-ORDER BY url ASC;
 ```
-
-### List pages that contain a specific design component
-
-```sql
-sqlite> SELECT DISTINCT url
-FROM crawler_page
-INNER JOIN crawler_page_components ON (crawler_page.id = crawler_page_components.page_id)
-INNER JOIN crawler_component ON (crawler_page_components.component_id = crawler_component.id)
-WHERE crawler_component.class_name LIKE "o-featured-content-module"
-ORDER BY url ASC
+docker run -it \
+    -p 8000:8000 \
+    -v `pwd`:/data \
+    -e DATABASE_URL=sqlite:////data/crawl.sqlite3 \
+    website-indexer:main
 ```
 
-See the [CFPB Design System](https://cfpb.github.io/design-system/)
-for a list of common components used on CFPB websites.
-
-### List pages with titles containing a specific string
+The web application with the crawl results will be accessible at http://localhost:8000/.
 
-```sql
-SELECT url FROM crawler_page WHERE title LIKE "%housing%" ORDER BY url ASC;
-```
+### Using a Python virtual environment
 
-### List pages with body text containing a certain string
+Create a Python virtual environment and install required packages:
 
-```sql
-sqlite> SELECT url FROM crawler_page WHERE text LIKE "%diamond%" ORDER BY URL asc;
 ```
-
-### List pages with HTML containing a certain string
-
-```sql
-sqlite> SELECT url FROM crawler_page WHERE html LIKE "%<br>%" ORDER BY URL asc;
+python3.12 -m venv venv
+source venv/bin/activate
+pip install -r requirements/base.txt
 ```
 
-## Running the viewer application
-
-### Using a Python virtual environment
-
 From the repo's root, compile frontend assets:
 
 ```
@@ -164,53 +89,61 @@ yarn
 yarn watch
 ```
 
-Create a Python virtual environment and install required packages:
+#### Viewing a sample crawl using a Python virtual environment
+
+Run the viewer application using sample data:
 
 ```
-python3.12 -m venv venv
-source venv/bin/activate
-pip install -r requirements/base.txt
+./manage.py runserver
 ```
 
-Optionally set the `CRAWL_DATABASE` environment variable to point to a local crawl database:
+The web application using sample data will be accessible at http://localhost:8000/.
 
-```
-export CRAWL_DATABASE=crawl.sqlite3
+#### Crawling a website and viewing the crawl results using a Python virtual environment
+
+To crawl a website and store the result in a local SQLite database named `crawl.sqlite3`:
+
+```sh
+./manage.py crawl https://www.consumerfinance.gov crawl.sqlite3
 ```
 
-Finally, run the Django webserver:
+To then run the viewer web application to view that crawler database:
 
 ```
-./manage.py runserver
+DATABASE_URL=sqlite:///crawl.sqlite3 ./manage.py runserver
 ```
 
-The viewer application will be available locally at http://localhost:8000.
+The web application with the crawl results will be accessible at http://localhost:8000/
 
-### Using Docker
+## Configuration
 
-To build the Docker image:
+### Database configuration
 
-```
-docker build -t website-indexer:main .
-```
+The `DATABASE_URL` environment variable can be used to specify the database
+used for crawl results by the viewer application.
+This project makes use of the
+[dj-database-url](https://github.com/jazzband/dj-database-url)
+project to convert that variable into a Django database specification.
 
-To run the image using sample data:
+For example, to use a SQLite file at `/path/to/db.sqlite`:
 
 ```
-docker run -it -p 8000:8000 website-indexer:main
+export DATABASE_URL=sqlite:////path/to/db.sqlite
 ```
 
-To run the image using a local database dump:
+(Note use of four slashes when referring to an absolute path;
+only three are needed when referring to a relative path.)
+
+To point to a PostgreSQL database instead:
 
 ```
-docker run \
-    -it \
-    -p 8000:8000 \
-    -v /path/to/local/dump:/data \
-    -e CRAWL_DATABASE=/data/crawl.sqlite3 \
-    website-indexer:main
+export DATABASE_URL=postgres://username:password@localhost/dbname
 ```
 
+Please see
+[the dj-database-url documentation](https://github.com/jazzband/dj-database-url)
+for additional examples.
+
 ## Development
 
 ### Testing
@@ -227,8 +160,8 @@ To run the tests:
 ./manage.py test --keepdb
 ```
 
-The `--keepdb` parameter is used because tests are run using a fixed,
-pre-existing test database.
+The `--keepdb` parameter is used because tests are run using
+[a fixed, pre-existing test database](#sample-test-data).
 
 ### Code formatting
 
@@ -284,12 +217,18 @@ Then, in another terminal, start a crawl against the locally running site:
 ./manage.py crawl http://localhost:8000/ --recreate ./sample/src/sample.sqlite3
 ```
 
-This will overwrite the test database with a fresh crawl.
+(This uses a local Python virtual environment; see
+[above](#crawling-a-website-and-viewing-the-crawl-results-using-docker)
+for instructions on using Docker instead.)
+
+This command will overwrite the sample database with a fresh crawl.
 
 ## Deployment
 
 _For information on how this project is deployed at the CFPB,
-employees and contractors should refer to the internal "CFGOV/crawler-deploy" repository._
+employees and contractors should refer to the internal
+[CFGOV/crawler-deploy](https://github.local/CFGOV/crawler-deploy/) 🔒
+repository._
 
 This repository includes a [Fabric](https://www.fabfile.org/) script
 that can be used to configure a RHEL8 Linux server to run this project

diff --git a/fabfile.py b/fabfile.py
@@ -160,7 +160,7 @@ def deploy(conn):
     --timeout 600 \\
     wsgi
 ExecReload=/bin/kill -s HUP $MAINPID
-Environment=CRAWL_DATABASE={CRAWL_DATABASE}
+Environment=CRAWL_DATABASE=sqlite:///{CRAWL_DATABASE}
 
 [Install]
 WantedBy=multi-user.target

diff --git a/requirements/base.txt b/requirements/base.txt
@@ -2,6 +2,7 @@ beautifulsoup4==4.12.3
 click==8.1.7
 cssselect==1.2.0
 Django==4.2.15
+dj-database-url==2.2.0
 django-click==2.4.0
 django-debug-toolbar==4.4.6
 django-filter==24.3

diff --git a/settings.py b/settings.py
@@ -1,17 +1,10 @@
-"""
-Django settings for viewer project.
-
-For more information on this file, see
-https://docs.djangoproject.com/en/3.2/topics/settings/
-
-For the full list of settings and their values, see
-https://docs.djangoproject.com/en/3.2/ref/settings/
-"""
-
 import os
 import sys
 from pathlib import Path
 
+import dj_database_url
+
+
 # Build paths inside the project like this: BASE_DIR / 'subdir'.
 BASE_DIR = Path(__file__).resolve().parent
 
@@ -72,28 +65,20 @@
 WSGI_APPLICATION = "wsgi.application"
 
 
-# Database
-# https://docs.djangoproject.com/en/3.2/ref/settings/#databases
-
-_sample_db_path = str(BASE_DIR / "sample" / "sample.sqlite3")
-_env_db_path = os.getenv("CRAWL_DATABASE")
-
-if _env_db_path and os.path.exists(_env_db_path) and "test" not in sys.argv:
-    CRAWL_DATABASE = _env_db_path
-else:
-    CRAWL_DATABASE = _sample_db_path
-
-_sqlite_db_path = f"file:{CRAWL_DATABASE}?mode=ro"
+# The default database is configured to use a sample SQLite file.
+# Override this by setting DATABASE_URL in the environment.
+# See https://github.com/jazzband/dj-database-url for URL formatting.
+_sample_db_path = f"{BASE_DIR}/sample/sample.sqlite3"
 
 DATABASES = {
-    "default": {
-        "ENGINE": "django.db.backends.sqlite3",
-        "NAME": _sqlite_db_path,
-        "TEST": {
-            "NAME": _sqlite_db_path,
+    "default": dj_database_url.config(
+        default=f"sqlite:///{_sample_db_path}",
+        # Python tests also use the same sample SQLite file.
+        test_options={
+            "NAME": _sample_db_path,
             "MIGRATE": False,
         },
-    },
+    ),
 }
 
 # Internationalization