N Tuples Analyzer and Filters

N-Tuples Analyzer

SIREn provides a generic analyzer, the TupleAnalyzer, for parsing a field containing N-Tuples data. The TupleAnalyzer is pre-configured for working with most of the use cases. It integrates by default a StandardAnalyzer for tokenising the Literal cells, and additonal filters for normalising the tokens.

URITrailingSlashFilter: It normalises URIs by removing trailing slashes.

"http://xmlns.com/foaf/0.1/" -> "http://xmlns.com/foaf/0.1"

URINormalisationFilter: It normalises URIs by breaking down them into subwords and by generating multiple variations.

"http://xmlns.com/foaf/0.1/name" ->
(position:token)
0:"http"
1:"xmlns.com",
2:"foaf",
3:"0.1",
4:"name",
5:"http://xmlns.com/foaf/0.1/name

LowerCaseFilter: The original Lucene filter that normalises tokens (of type Literal, URIs, etc.) to lower case.
StopFilter: The original Lucene filter that removes stop words.
LengthFilter: The original Lucene filter that removes words that are too short (by default 2) or too long (by default 128).

The following example helps to visualise the effects of the TupleAnalyzer on one tuple:

Analysing "<http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://xmlns.com/foaf/0.1/Person> "A Person" ."
[http] [www] [w3] [org] [1999] [02] [22] [rdf] [syntax] [ns] [type] [http] [xmlns] [com] [foaf] [0.1] [person] [person]
|                                                                   |
[http://www.w3.org/1999/02/22-rdf-syntax-ns#type]                   [http://xmlns.com/foaf/0.1/person]

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

N Tuples Analyzer and Filters

N-Tuples Analyzer

Clone this wiki locally