py_heideltime is a python wrapper for the multilingual temporal tagger HeidelTime.
For more information about this temporal tagger, please visit the Heideltime Java standalone version: https://github.com/HeidelTime/heideltime
This wrapper has been developed by Jorge Mendes under the supervision of Professor Ricardo Campos in the scope of the Final Project of the Computer Science degree at the Polytechnic Institute of Tomar, Portugal.
Although there already exist some python packages for Heideltime (in particular https://github.com/amineabdaoui/python-heideltime or more recently https://github.com/PhilipEHausner/python_heideltime) all of them require a considerable intervention from the user side. In this project, we aim to overcome some of these limitations. Our aim was seven-fold:
- To provide a multi-platform (windows, Linux, Mac Os);
- To make it user-friendly not only in terms of installation but also in its usage;
- To make it lightweight without compromising its behavior;
- To give the user the chance to choose the granularity (e.g., year, month, etc) of the dates to be extracted;
- To handle texts with emojis (note: heideltime demo and existing packages throw an exception when a text has an emoji);
- To retrieve to the user a normalized version of the text (where each temporal expression is replaced by the normalized Heideltime version); and
- To retrieve a Time-ML annotated version of the text (as done in the Heideltime demo).
Docker for Windows requires 64bit Windows 10 Pro with Hyper-V available.
If you have this, then proceed to download here: (https://docs.docker.com/docker-for-windows/install/#download-docker-for-windows) and click on Get Docker for Windows (Stable)
If your system does not meet the requirements to run Docker for Windows (e.g., 64bit Windows 10 Home), you can install Docker Toolbox, which uses Oracle Virtual Box instead of Hyper-V. In that case proceed to download here: (https://docs.docker.com/toolbox/overview/#ready-to-get-started) and click on Get Docker Toolbox for Windows
Docker for Mac will launch only if all of these requirements (https://docs.docker.com/docker-for-mac/install/#what-to-know-before-you-install) are met.
If you have this, then proceed to download here: (https://docs.docker.com/docker-for-mac/install/#download-docker-for-mac) and click on Get Docker for Mac (Stable)
If your system does not meet the requirements to run Docker for Mac, you can install Docker Toolbox, which uses Oracle Virtual Box instead of Hyper-V. In that case proceed to download here: (https://docs.docker.com/toolbox/overview/#ready-to-get-started) and click on Get Docker Toolbox for Mac
Proceed to download here: (https://docs.docker.com/engine/installation/#server)
Execute the following command on your docker machine:
docker pull liaad/py_heideltime
This will retrieve the image from the following repository: https://hub.docker.com/r/liaad/py_heideltime
On your docker machine run the following to launch the image:
docker run -p 9999:8888 liaad/py_heideltime
Then go to your browser and type in the following url:
http://<DOCKER-MACHINE-IP>:9999
where the IP may be the localhost or 192.168.99.100 if you are using a Docker Machine VM.
You will be required a token which you can find on your docker machine prompt. It will be something similar to this: http://eac214218126:8888/?token=ce459c2f581a5f56b90256aaa52a96e7e4b1705113a657e8. Copy paste the token (in this example, that would be: ce459c2f581a5f56b90256aaa52a96e7e4b1705113a657e8) to the browser, and voila, you will have py_heideltime package ready to run. Keep this token (for future references) or define a password.
Once you logged in, proceed by running the notebook that we have prepared for you.
Once you are done go to File - Shutdown.
If later on you decide to play with the same container, you should proceed as follows. The first thing to do is to get the container id:
docker ps -a
Next run the following commands:
docker start ContainerId
docker attach ContainerId (attach to a running container)
Nothing happens in your docker machine, but you are now ready to open your browser as you did before:
http://<DOCKER-MACHINE-IP>:9999
Hopefully, you have saved the token or defined a password. If that is not the case, then you should run the following command (before doing start/attach) to have access to your token:
docker exec -it <docker_container_name> jupyter notebook list
On your docker machine run the following to launch the image in background mode:
docker run -p 9999:8888 -d liaad/py_heideltime
You can then execute py_heideltime in the prompt. An example is given below:
docker run -p 9999:8888 -d liaad/py_heideltime
py_heideltime -t "August 31st ..." -l "English"
pip install git+https://github.com/JMendes1995/py_heideltime.git
In order to use py_heideltime you must have java JDK and perl installed in your machine for heideltime dependencies.
To install java JDK begin by downloading it here. Once it is installed don't forget to add the path to the environment variables. On user variables for Administrator
add the JAVA_HOME
as the Variable name:
, and the path (e.g., C:\Program Files\Java\jdk-12.0.2\bin
) as the Variable value. Then on System variables
edit the Path
variable and add (e.g., ;C:\Program Files\Java\jdk-12.0.2\bin
) at the end of the variable value
.
For Perl we recomment you to download and install the following distribution. Once it is installed don't forget to restart your PC.
Note that perl doesn't need to be installed if you are using Anaconda instead of pure Python distribution.
Perl usually comes with Linux, thus you don't need to install it.
To install JAVA: sudo apt install default-jdk
In addition, if your user does not have permission executions on python lib folder, you should execute the following command: sudo chmod 111 /usr/local/lib//dist-packages/py_heideltime/HeidelTime/TreeTaggerLinux/bin/*
We highly recommend you to use this python notebook if you are interested in playing with py_heideltime when using the standalone version.
from py_heideltime import py_heideltime
text = '''
Thurs August 31st - News today that they are beginning to evacuate the London children tomorrow. Percy is a billeting officer. I can't see that they will be much safer here.
'''
Default language is "English" and document_type is "news" which means that having:
results = py_heideltime(text)
or:
results = py_heideltime(text, language='English', document_type='news')
is exactly the same thing and produces the same results.
Please note that running this on windows may require using the following code instead:
if __name__ == '__main__':
results = py_heideltime(text)
The output will be a list of 4 elements or an empty list [] if no temporal expression is found in the text. The four elements are:
- a list of tuples with two positions (e.g., ('XXXX-08-31', 'August 31st')). The first one is the detected temporal expression normalized by heideltime. The second is the temporal expression as it was found in the text;
- a normalized version of the text, where each temporal expression is replaced by its normalized heideltime counterpart;
- a TimeML-annotated version of the text.
- the execution time of the algorithm, divided into
heideltime_processing
(i.e., the time spent by the heideltime algorithm in extracting temporal expressions) andtext_normalization
(the time spent by the program in labelling the temporal expressions found in the text with a tag ).
TempExpressions = results[0]
TempExpressions
[('XXXX-08-31', 'August 31st'),
('PRESENT_REF', 'today'),
('XXXX-XX-XX', 'tomorrow')]
TextNormalized = results[1]
TextNormalized
'Thurs XXXX-08-31 - News PRESENT_REF that they are beginning to evacuate the London children XXXX-XX-XX. Percy is a billeting officer. I can't see that they will be much safer here.'
TimeML = results[2]
TimeML
'Thurs <TIMEX3 tid="t2" type="DATE" value="XXXX-08-31">August 31st</TIMEX3> - News <TIMEX3 tid="t3" type="DATE" value="PRESENT_REF">today</TIMEX3> that they are beginning to evacuate the London children <TIMEX3 tid="t4" type="DATE" value="XXXX-XX-XX">tomorrow</TIMEX3>. Percy is a billeting officer. I can\'t see that they will be much safer here.'
ExecutionTime = results[3]
ExecutionTime
{'heideltime_processing': 4.341801404953003, 'py_heideltime_text_normalization': 0.0}
Besides running py_heideltime with the default parameters, users can also specify more advanced options. These are:
date granularity
: "full" (Highest possible granularity detected will be retrieved); "year" (YYYY will be retrieved); "month" (YYYY-MM will be retrieved); "day" (YYYY-MM-DD will be retrieved)document type
"news" (news-style documents); "narrative" (narrative-style documents (e.g., Wikipedia articles)); "colloquial" (English colloquial (e.g., Tweets and SMS)); "scientific" (scientific articles (e.g., clinical trails))document creation time
: in the format YYYY-MM-DD
results = py_heideltime(text, language='English', date_granularity="day", document_type='news', document_creation_time='1939-08-31')
Please note that running this on windows may require using the following code instead:
if __name__ == '__main__':
results = py_heideltime(text, language='English', date_granularity="day", document_type='news', document_creation_time='1939-08-31')
The output follows the same patterns as described above.
TempExpressions = results[0]
TempExpressions
[('1939-08-31', 'August 31st'),
('1939-08-31', 'today'),
('1939-09-01', 'tomorrow')]
TextNormalized = results[1]
TextNormalized
'Thurs 1939-08-31 - News 1939-08-31 that they are beginning to evacuate the London children 1939-09-01. Percy is a billeting officer. I can't see that they will be much safer here.'
TimeML = results[2]
TimeML
'Thurs <TIMEX3 tid="t2" type="DATE" value="1939-08-31">August 31st</TIMEX3> - News <TIMEX3 tid="t3" type="DATE" value="1939-08-31">today</TIMEX3> that they are beginning to evacuate the London children <TIMEX3 tid="t4" type="DATE" value="1939-09-01">tomorrow</TIMEX3>. Percy is a billeting officer. I can\'t see that they will be much safer here.'
ExecutionTime = results[3]
ExecutionTime
{'heideltime_processing': 4.341801404953003, 'text_normalization': 0.0}
py_heideltime --help
Make sure that the input parameters are within quotes.
Default Parameters:
py_heideltime -t "August 31st"
All the Parameters:
py_heideltime -t "August 31st" -l "English" -dg "full" -dt "News" -dct "1939-08-31"
[required]: either specify a text or an input_file path.
----------------------------------------------------------------------------------------------------------------------------------
-t, --text - Input text.
Example: “August 31st”.
-i, --input_file - Text file path.
Example: “c:\text.txt”.
[required]
----------------------------------------------------------------------------------------------------------------------------------
-l, --language - Language of the text.
Default: "English"
Options:
"English";
"Portuguese";
"Spanish";
"Germany";
"Dutch";
"Italian";
"French".
[not required]
-----------------------------------------------------------------------------------------------------------------------------------
-dg, --date_granularity - Date granularity
Default: "full"
Options:
"full": means that all types of granularity will be retrieved, from the coarsest to
the finest-granularity.
"day": means that for the date YYYY-MM-DD-HH:MM:SS it will retrieve YYYY-MM-DD;
"month": means that for the date YYYY-MM-DD-HH:MM:SS only the YYYY-MM will be retrieved;
"year": means that for the date YYYY-MM-DD-HH:MM:SS only the YYYY will be retrieved;
-dt, --document_type - Type of the document text.
Default: "News",
Options:
"News": for news-style documents - default param;
"Narrative": for narrative-style documents (e.g., Wikipedia articles);
"Colloquial": for English colloquial (e.g., Tweets and SMS);
"Scientific": for scientific articles (e.g., clinical trails).
-dct, --document_creation_time - Document creation date in the format YYYY-MM-DD. Taken into account when "News" or
"Colloquial" texts are specified.
Example: "2019-05-30".
--help - Show this message and exit.
Docker image is prepared to work with the following languages: English, German, Dutch, Vietnamese, Arabic, Spanish, Italian, French, Chinese, Russian, Croatian, Estonian and Portuguese.
This github package is prepared to work with the following languages: English, Portuguese, Spanish, German, Dutch, Italian, French.
To use py_heideltime with other languages proceed as follows:
- Download from TreeTagger the parameter files
- gunzip < Downloaded file >
- Copy the extracted file to the module folder /py_heideltime/HeidelTime/TreeTagger< your system >/lib/
Please check py_rule_based if you are interested in extracting dates by means of a rule-based model solution.
Please check Time-Matters docker image or github if you are interested in detecting the relevance (score) of dates in a text.
Please cite the appropriate paper when using py_heideltime. In general, this would be:
Strötgen, Gertz: Multilingual and Cross-domain Temporal Tagging. Language Resources and Evaluation, 2013. pdf bibtex
Other related papers may be found here: