-
Notifications
You must be signed in to change notification settings - Fork 36
N Tuples Analyzer and Filters
rdelbru edited this page Sep 7, 2011
·
5 revisions
SIREn provides a generic analyzer, the TupleAnalyzer, for parsing a field containing N-Tuples data. The TupleAnalyzer is pre-configured for working with most of the use cases. It integrates by default a StandardAnalyzer for tokenising the Literal cells, and additonal filters for normalising the tokens.
- URITrailingSlashFilter: It normalises URIs by removing trailing slashes.
"http://xmlns.com/foaf/0.1/" -> "http://xmlns.com/foaf/0.1"
- URINormalisationFilter: It normalises URIs by breaking down them into subwords and by generating multiple variations.
"http://xmlns.com/foaf/0.1/name" ->
(position:token)
0:"http"
1:"xmlns.com",
2:"foaf",
3:"0.1",
4:"name",
5:"http://xmlns.com/foaf/0.1/name
- LowerCaseFilter: The original Lucene filter that normalises tokens (of type Literal, URIs, etc.) to lower case.
- StopFilter: The original Lucene filter that removes stop words.
- LengthFilter: The original Lucene filter that removes words that are too short (by default 2) or too long (by default 128).
The following example helps to visualise the effects of the TupleAnalyzer on one tuple:
Analysing "<http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://xmlns.com/foaf/0.1/Person> "A Person" ."
[http] [www] [w3] [org] [1999] [02] [22] [rdf] [syntax] [ns] [type] [http] [xmlns] [com] [foaf] [0.1] [person] [person]
| |
[http://www.w3.org/1999/02/22-rdf-syntax-ns#type] [http://xmlns.com/foaf/0.1/person]