Data Mining Historical Newspaper Metadata (Europeana Newspaper Project)
Newspapers from European digital librabries collections are part of the data set OLR’ed (Optical Layout Recognition) by the project Europeana Newspapers (www.europeana-newspapers.eu). The OLR refinement (performed by CCS) consists of the description of the structure of each issue and articles (spatial extent, title and subtitle, classification of content types) using the METS/ALTO formats.
From each digital document is derived a set of bibliographical and descriptive metadata relating to content (date of publication, number of pages, articles, words, illustrations, etc.). Shell and XSLT scripts called with Xalan-Java are used to extract some metadata from METS manifest or OCR files.
Detailled presentation :
You can use XSLT (DOS scripts) or Perl script (faster).
Sample documents are stored in the "DOCS" folder. The metadata are generated in a "STATS" folder.
Two DOS shell scripts :
- batch-EN.bat
- xslt.cmd
Two XSLT sheets:
- analyseAltosCCS.xsl
- calculeStatsMETS_CSV.xsl
The XSLT are runned with Xalan-Java. Path to the Java bin must be set in xslt.cmd.
For each document, its metadata are stored in the STATS folder under two formats :
- XML (raw metadata, with detailled values for each page)
- CSV (metadata at the issue level)
An aggregated file (metadata.csv) contains all the CSV metadata.
- Open a DOS terminal.
- Change dir to the batch folder
-
batch-EN.bat
Faster and richer (more metadata) than the XSLT scripts.
One Perl script : extractMD.pl For each document, its metadata are stored in the STATS folder under your prefered formats : XML, JSON, CSV, txt
- Open a shell terminal.
- Change dir to the batch folder
-
perl extractMD.pl DOCS xml json
(Made with Highcharts)
The complete set of derived data contains about 4,500,000 atomic metadata from six national and regional French newspapers (1814-1945, 880,000 pages, 150,000 issues) of Gallica (www.gallica.fr) press collections:
- Le Matin
- Le Gaulois
- Le Petit journal illustré
- Le Journal des débats politiques et littéraires
- Le Petit Parisien
- Ouest-Eclair
See Datasets
CC0
This work has been part-funded through the EU Competitiveness and Innovation Framework Programme grant Europeana Newspapers (Ref. 297380)