The chgov-brprotokolle project is settled around managing, retrieving and displaying historic minutes of the Federal Council, based on the IIIF standard and the live project set-up can be experienced over at Federal Archives's site. The project is separated into 4 dedicated repositories while this current repository chgov-brprotokolle-server
is the backend for the ingestion of minutes and the interface for SOLR search requests. It was developed using TypeScript and is based on the archival-iiif-server. The other projects include the publicly accessible frontend (chgov-brprotokolle-frontend), a frontend utility to properly enable OCR display in Mirador (chgov-brprotokolle-mirador-ocr-helper) and documentation chgov-brprotokolle-markdown. The frontend is written in React and the frontend utility in plain JavaScript.
The backend server has two major tasks: ingestion and search routing. The latter is more or less directly passed to the corresponding SOLR instance and it's objective is to provide an interface for queries.. The former is outlined below with its objective to store data in the SOLR instance and create IIIF representations of the minutes.
The ingestion pipline handles either handwritten minutes (e.g. with provided OCR from the Transkribus project) or machine written minutes (e.g. as PDF files, no OCR provided), enhances the minutes with provided metadata and ultimately stores relevant information in a SOLR instance.
In order to start the ingestion, files in the appropriate format have to be added to the HOTFOLDER
, which the dirWatcher
catches. Then, depending on the type of minutes the collectionBuilder
handles handwritten minutes for further processing. Machine written minutes are ingested as single PDFs, thus before further processing, the images have to be extracted (imgExtractor
) and subsequently, OCR is extracted based on the images (ocrExtractor
).
At this point, the images, ocr data and metadata are provided and there is no distinction between machine written and handwritten anymore.
The ocr data is compiled into a single text file, the ocr plaintext and together with the images, and, ocr data, it's stored under the DATAFOLDER
directory.
The metadata and known locations of the images, ocr data, and, ocr plaintext are used to generate the IIIF manifests (manifestCreate
).
These manifests are delivered by an external webserver and are not further part of the backend project.
The pipeline is built in such a way that the solrAdd
step finalises the ingestion and adds relevant information to the SOLR instance.
To prepare the backend server
's setup, it is mandatory to have a running SOLR instance, prepared with the appropriate schema and plugin.
Installation of the development enviornment is done by calling npm install
, as this is a node project.
Custom elements for the pipeine can be added as described in the archival-iiif-server documentation.
There aren't any automated tests available. End to end runs have to be manually checked.
GNU Affero General Public License (AGPLv3), see LICENSE
This repository is a copy which is updated regularly - therefore contributions via pull requests are not possible. However, independent copies (forks) are possible under consideration of the The MIT license.
- For general questions (and technical support), please contact the Swiss Federal Archives by e-mail at [email protected].
- Technical questions or problems concerning the source code can be posted here on GitHub via the "Issues" interface.