-
Notifications
You must be signed in to change notification settings - Fork 39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add WARC support #128
Comments
Thanks for your suggestion. It seems that libarchive has support for WARC since 2014, but when I tried to mount it with archivemount or fuse-archive, the mount point was empty. If libarchive works in general, then adding a libarchive backend would also implement this, but the problem with archivemount doesn't bode well. The performance with libarchive wouldn't be optimal anyway because the interface is not designed for random access. By default, wget also compresses each record individually with gzip, which is very well-behaved for random access via rapidgzip. It should be fast and the index should be small. The Common Crawl dataset also is served as warc.gz and would be a very strong use case for performant access to this. The format itself looks simple enough. It is reminiscent of TAR in that way. For example, I tried
It has all the necessary information such as date, URI, and content length. So yeah, similar to TAR, we could simply collect the offsets for each entry and jump to it. This would avoid parsing the archive from the beginning, which would have to be done with a libarchive backend. However, this dump already shows multiple problems. Do you have any opinion, expectation, or precedent as to how the mounted view should look?
I guess that these conceptual problems are the reason why archivemount and fuse-archive don't work. Alternatively, each WARC record could simply be exposed as a file name numbered from 0, or maybe even better the WARC UUID. Then, the mount point would contain no hierarchy and possibly hundreds of thousands of files with only cryptic file names. The URI would then also be exposed via the extended file attributes if that works. This would save a lot of complexity and assumptions on the ratarmount side. Would that be an option for you? |
I don't have all the answers, but: |
There are well established index formats for WARCs that do what you're describing of collecting offsets for various pieces of content, and which are the basis of how the wayback machine (the technology, not to be confused with the Internet Archive a service using similar technology) works (CDX indexes are one text based way of doing this, although I know that there was also a BDB format that was in use at some point). Webrecorder have a tool for generating these: CDXJ-Indexer. You might also want to check out the concept of WACZ which bundles the index and the warc(s) into a single zip file. The thing I would caution to bear in mind is that WARCs don't generally contain traditional file system resources. It was probably true in the early days of the web that websites were reflections of some physical filesystem layout on a server, served largely as static content, but that hasn't been true for quite some time. Websites today are more like applications. What you're getting in a WARC (at least to the extent that you're using them as Web-ARChives and not as a generic content + metadata container, which I know some people do) is a full set of requests and responses made when crawling a particular site. Some of those requests are for resources that you could map onto a filesystem-like structure, but lots of them aren't so it would be worth bearing mind what those resources even mean in the context of a mount point like this. Hope this helps. |
While working on #109 / #130, I have a state that can mount WARC files with libarchive. Without doing any special treatment, the file hierarchy for hello-world.warc provided by libarchive looks like this: python3 ratarmount.py -f -d 3 tests/hello-world.warc mounted
tree mounted Output:
And the file contents of
Trying to mount the test file created with
Adding debug output also shows nothing and there seem to be no errors, i.e., libarchive behaves as if the file was an empty archive. I'd have to check the libarchive implementation source to see why this happens. Maybe because, as @jackdos said, none of the warc records can be mapped onto a filesystem-like structure. |
WARC support would be great. It's used at-scale web archives across the world as the standard file format for web archiving. More information at https://en.wikipedia.org/wiki/WARC_(file_format)
Most linux distros have wget, whose modern versions can generate one flavour of WARC file, using the --warc-file=file argument.
The text was updated successfully, but these errors were encountered: