Skip to content

N Tuples Analyzer and Filters

rdelbru edited this page Sep 7, 2011 · 5 revisions

N-Tuples Analyzer

SIREn provides a generic analyzer, the TupleAnalyzer, for parsing a field containing N-Tuples data. The TupleAnalyzer is pre-configured for working with most of the use cases. It integrates by default a StandardAnalyzer for tokenising the Literal cells, and additonal filters for normalising the tokens.

  • URITrailingSlashFilter: It normalises URIs by removing trailing slashes.
"http://xmlns.com/foaf/0.1/" -> "http://xmlns.com/foaf/0.1"
  • URINormalisationFilter: It normalises URIs by breaking down them into subwords and by generating multiple variations.
"http://xmlns.com/foaf/0.1/name" ->
(position:token)
0:"http"
1:"xmlns.com",
2:"foaf",
3:"0.1",
4:"name",
0:"http://xmlns.com/foaf/0.1/name
  • LowerCaseFilter: The original Lucene filter that normalises tokens (of type Literal, URIs, etc.) to lower case.
  • StopFilter: The original Lucene filter that removes stop words.
  • LengthFilter: The original Lucene filter that removes words that are too short (by default 2) or too long (by default 128).

The following example helps to visualise the effects of the TupleAnalyzer on one tuple:

Analysing "<http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://xmlns.com/foaf/0.1/Person> "A Person" ."
[http] [www] [w3] [org] [1999] [02] [22] [rdf] [syntax] [ns] [type] [http] [xmlns] [com] [foaf] [0.1] [person] [person]
|                                                                   |
[http://www.w3.org/1999/02/22-rdf-syntax-ns#type]                   [http://xmlns.com/foaf/0.1/person]