Skip to content

Commit

Permalink
Deprecate wget/WARC-based crawler approach
Browse files Browse the repository at this point in the history
PR 81 implemented a new crawler approach based on wpull. This change
deprecates the old approach based on wget crawling into an intermediate
WARC file.
  • Loading branch information
chosak committed Jul 2, 2024
1 parent 8bd065d commit cbfb66c
Show file tree
Hide file tree
Showing 8 changed files with 1 addition and 522 deletions.
6 changes: 0 additions & 6 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -31,12 +31,6 @@ _site/
*.sqlite3
!sample/sample.sqlite3

# Crawl files, except for the sample crawl #
############################################
*.warc
*.warc.gz
!sample/crawl.warc.gz

# OS generated files #
######################
.DS_Store
Expand Down
104 changes: 0 additions & 104 deletions crawler/management/commands/warc_to_csv.py

This file was deleted.

76 changes: 0 additions & 76 deletions crawler/management/commands/warc_to_db.py

This file was deleted.

189 changes: 0 additions & 189 deletions crawler/reader.py

This file was deleted.

2 changes: 1 addition & 1 deletion crawler/wpull_plugin.py
Original file line number Diff line number Diff line change
Expand Up @@ -69,7 +69,7 @@ def deactivate(self):
self.db_writer.analyze()

def init_db(self):
db_alias = "warc_to_db"
db_alias = "crawler"

connections.databases[db_alias] = {
"ENGINE": "django.db.backends.sqlite3",
Expand Down
Loading

0 comments on commit cbfb66c

Please sign in to comment.