Data Mining Historical Newspaper Metadata (Europeana Newspaper Project)
Newspapers from European digital librabries collections are part of the data set OLR’ed (Optical Layout Recognition) by the project Europeana Newspapers ( The OLR refinement (performed by CCS) consists of the description of the structure of each issue and articles (spatial extent, title and subtitle, classification of content types) using the METS/ALTO formats.
From each digital document is derived a set of bibliographical and descriptive metadata relating to content (date of publication, number of pages, articles, words, illustrations, etc.). Shell and XSLT scripts called with Xalan-Java are used to extract some metadata from METS manifest or OCR files.
Detailled presentation :
You can use XSLT (DOS scripts) or Perl script (faster).
Sample documents are stored in the "DOCS" folder. The metadata are generated in a "STATS" folder.
Two DOS shell scripts :
- batch-EN.bat
- xslt.cmd
Two XSLT sheets:
- analyseAltosCCS.xsl
- calculeStatsMETS_CSV.xsl
The XSLT are runned with Xalan-Java. Path to the Java bin must be set in xslt.cmd.
For each document, its metadata are stored in the STATS folder under two formats :
- XML (raw metadata, with detailled values for each page)
- CSV (metadata at the issue level)
An aggregated file (metadata.csv) contains all the CSV metadata.
- Open a DOS terminal.
- Change dir to the batch folder
Faster and richer (more metadata) than the XSLT scripts.
One Perl script : For each document, its metadata are stored in the STATS folder under your prefered formats : XML, JSON, CSV, txt
- Open a shell terminal.
- Change dir to the batch folder
perl DOCS xml json
(Made with Highcharts)
The complete set of derived data contains about 4,500,000 atomic metadata from six national and regional French newspapers (1814-1945, 880,000 pages, 150,000 issues) of Gallica ( press collections:
- Le Matin
- Le Gaulois
- Le Petit journal illustré
- Le Journal des débats politiques et littéraires
- Le Petit Parisien
- Ouest-Eclair
See Datasets
This work has been part-funded through the EU Competitiveness and Innovation Framework Programme grant Europeana Newspapers (Ref. 297380)