Flexible service to tokenize Latin texts.
Add this line to your application's Gemfile:
gem 'llt-tokenizer'
And then execute:
$ bundle
Or install it yourself as:
$ gem install llt-tokenizer
The LLT's Tokenizer makes use of stem dictionaries. Refer to these instructions on how to set one up.
require 'llt/tokenizer'
t = LLT::Tokenizer.new
tokens = t.tokenize('Arma virumque cano.')
tokens.map(&:to_s)
# => ["Arma", "-que", "virum", "cano", "."]
The Tokenizer takes several options upon creation or a call to #tokenize:
# shifting determines if enclitics shall be moved to
# their functional position
t = LLT::Tokenizer.new(shifting: true)
tokens = t.tokenize('In eoque arma cano.')
tokens.map(&:to_s)
# => ["-que", "In", "eo", "arma", "cano", "."]
# all options can be passed directly to #tokenize to override
# the default options
tokens = t.tokenize('In eoque arma cano.', shifting: false)
tokens.map(&:to_s)
# => ["In", "eo", "-que", "arma", "cano", "."]
# enclitics_marker takes a string, which marks up splitted enclitics
t = LLT::Tokenizer.new(enclitics_marker: '--', shifting: false)
tokens = t.tokenize('Arma virumque cano.')
tokens.map(&:to_s)
# => ["Arma", "virum", "--que", "cano", "."]
# splitting of enclitics can be disabled altogether
tokens = t.tokenize('Arma virumque cano.', splitting: false)
tokens.map(&:to_s)
# => ["Arma", "virumque", "cano", "."]
# indexing determines if each token shall receive a consecutive id
tokens = t.tokenize('Arma virumque cano.', indexing: true)
tokens.first.id # => 1
tokens = t.tokenize('Arma virumque cano.', indexing: false)
tokens.first.id # => nil
# merging enables token merging of lemmata, that often appear with
# orthographical inconsistencies
tokens = t.tokenize('Quam diu cano?', merging: true)
tokens.map(&:to_s)
# => ["Quamdiu", "cano", "?"]
The returned items are instances of LLT::Token, which can be marked up in a variety of forms:
t = LLT::Tokenizer.new(shifting: false, indexing: true)
tokens = t.tokenize('Arma virumque cano.')
tokens.map(&:to_xml)
# => ["<w>arma<_w>", "<w>virum<_w>", "<w>-que<_w>", "<w>cano<_w>", "<pc>.<_pc>"]
Standard TEI XML markup is used: w tags for word tokens, pc tags for punctuation. The #to_xml method is highly flexible as well, for full coverage see TODO.
puts tokens.map { |token| token.to_xml(indexing: true) }
# <w n="1">Arma</w>
# <w n="2">virum</w>
# <w n="3">-que</w>
# <w n="4">cano</w>
# <pc n="5">.</pc>
- Fork it
- Create your feature branch (
git checkout -b my-new-feature
) - Commit your changes (
git commit -am 'Add some feature'
) - Push to the branch (
git push origin my-new-feature
) - Create new Pull Request