Skip to content

Commit

Permalink
Removed shingled query analyzer
Browse files Browse the repository at this point in the history
This analyzer is supposed to incorporate bigrams (pairs of adjacent words) into
the search query.

This is useful because part of the meaning of a sentence comes from the word
order, for example: "book a driving test for someone else" vs "driving someone
else for a test book".

However, this code never worked as intended because it only analyzed queries.
So a query was broken down into single words and bigrams, but it compared those
tokens to analyzed text that didn't contain any bigrams at all.

This means the bigram part of the query is only functioning as a single word
match. This is very confusing when trying to understand what Rummager is doing.

I've changed it to use the normal query analyzer. This will change results
slightly, because the shingles analyzer didn't include synonyms, but the new
analyzer does.

Bigram matching was implemented properly as part of the 'new weighting' code a
couple of years ago, but it never went live. This is something that could be
revisited in future.
  • Loading branch information
MatMoore committed Jan 5, 2018
1 parent dff183d commit 806a68c
Show file tree
Hide file tree
Showing 4 changed files with 3 additions and 15 deletions.
6 changes: 0 additions & 6 deletions config/schema/elasticsearch_schema.yml
Original file line number Diff line number Diff line change
Expand Up @@ -70,12 +70,6 @@ index:
filter: [standard, lowercase, old_synonym, stop, stemmer_override, stemmer_english]
char_filter: [normalize_quotes, strip_quotes]

# Analyzer used at query time for old-style shingle matching.
shingled_query_analyzer:
type: custom
tokenizer: standard
filter: [standard, asciifolding, lowercase, stop, stemmer_override, stemmer_english, old_shingles]

# An analyzer for doing "exact" word matching (but stripping wrapping whitespace, and case insensitive).
exact_match:
type: custom
Expand Down
2 changes: 1 addition & 1 deletion lib/search/query_builder.rb
Original file line number Diff line number Diff line change
Expand Up @@ -56,7 +56,7 @@ def query
core_query.match_phrase("description"),
core_query.match_phrase("indexable_content"),
core_query.match_all_terms(%w(title acronym description indexable_content)),
core_query.match_bigrams(%w(title acronym description indexable_content)),
core_query.match_any_terms(%w(title acronym description indexable_content)),
core_query.minimum_should_match("all_searchable_text")
],
}
Expand Down
4 changes: 2 additions & 2 deletions lib/search/query_components/core_query.rb
Original file line number Diff line number Diff line change
Expand Up @@ -97,15 +97,15 @@ def match_all_terms(fields)
}
end

def match_bigrams(fields)
def match_any_terms(fields)
fields = fields.map { |f| synonym_field(f) }

{
multi_match: {
query: escape(search_term),
operator: "or",
fields: fields,
analyzer: "shingled_query_analyzer"
analyzer: query_analyzer
}
}
end
Expand Down
6 changes: 0 additions & 6 deletions spec/integration/schema/stemming_spec.rb
Original file line number Diff line number Diff line change
Expand Up @@ -18,12 +18,6 @@
"It's, It’s Mr. O'Neill" => %w(it it mr oneil)
end

it "shingled query analyzer" do
expect_tokenisation :shingled_query_analyzer,
"Hello Hallo" => ["hello", "hello hallo", "hallo"],
"H'lo ’Hallo" => ["h'lo", "h'lo hallo", "hallo"]
end

it "exact match" do
expect_tokenisation :exact_match,
"It’s A Small W'rld" => ["it's a small w'rld"]
Expand Down

0 comments on commit 806a68c

Please sign in to comment.