Exact phrase matching? #62

dannydan412 · 2014-01-18T06:58:45Z

Hi!

Does lunar support exact phrase matching (i.e. use quotation marks in search)? It doesn't seem like it from what my initial research. I'd like to try and add this feature to the project. Could someone please give me some pointers on how to implement this?

olivernn · 2014-01-21T18:22:16Z

At the moment lunr tries to be "clever" by automatically adding a wildcard at the end of your search terms, e.g. a search for "foo" becomes "foo*".

I'd like to move away from this, for exactly this kind of issue, it is currently not possible to do an exact match search.

I have some plans to change this, so hold off on implementing anything for now. I need to think through how to implement these changes. I'll be sure to keep you in the loop though, and would very grateful for any help in making these changes.

dannydan412 · 2014-01-21T18:55:48Z

Hi Oliver,
Thanks for getting back to me so quickly! Just to clarify - the current problem with lunr.js is that if I search for a phrase such as '"Hello World"' it would also return documents that contain "Hello Great World".
I'm working on a project with a deadline and I was wondering if you have any ideas for a "quick and dirty" solution that I could implement today. Of course I would not commit the code to github. One thought I had was to rebuild the index when phrases are used in the query. This would affect the behavior of the tokenizer to consider the quotes. So if something is in quotes it would be considered a single token. What are your thoughts on this approach?

olivernn · 2014-01-21T19:49:04Z

It depends, if you only want exact matches, then you can change this code https://github.com/olivernn/lunr.js/blob/master/lib/index.js#L301 to not do the expanding, changing it to get would get you just the token, not any others that are an extension of this term.

Another potential solution is to create n-grams. Basically if you had the text "The quick brown fox" you would treat multiple words together as a 'token'. For a bi-gram, n = 2, you would end up with tokens "The quick", "quick brown", "brown fox" etc. You could extend this to greater number of n, depending on the kind of results you get back. Take a look at adding a processor to the pipeline to do this.

Another idea (not fully thought through) would be to use several instances of lunr together. Maybe one with the n-gram indexes and another with the regular index or even another with the token exapnding etc.

Sorry I can't be of much help here. The changes I've been thinking about alter the way the indexing works in a fairly substantial way and I need to fully understand the implications of it, hence it is taking a little while! Personally I wouldn't worry to much about posting your "quick and dirty" code to github. Create a fork of this project and do whatever you do there, let me know how you get along!

P.S. If you're interested in this kind of thing I can recommend taking a look at - http://nlp.stanford.edu/IR-book/pdf/irbookonlinereading.pdf, it might give you a few ideas.

dannydan412 · 2014-01-22T00:07:35Z

The n-grams sounds like an interesting solution. Isn't using the pipeline too late though in this case? It operates on tokens, so anything I add to the pipeline would operate on single words. Or am I missing something?

olivernn · 2014-01-22T17:07:26Z

A pipeline function will get called with three arguments, a token, the index of that token, and all the tokens, so you should be able to do what you want with a pipeline function.

http://lunrjs.com/docs/#Pipeline

dannydan412 · 2014-01-22T17:31:53Z

Isn't it too late to add tokens to the list when the pipeline function gets called? The parent method won't iterate through these newly added objects and so they never get copied to the "final" list of tokens.

olivernn · 2014-01-22T17:47:17Z

Whatever you return from the pipeline function is used as the input to the next.

var pipeline = new lunr.Pipeline
var bigram = function (token, idx, tokens) {
  return token + " " + tokens[idx + 1]
}

pipeline.run(["The", "quick", "brown", "fox"]) // ["The quick", "quick brown", "brown fox", "fox undefined"]

You would probably have to do something about the undefined.

So if you have this bigram function at the end of your pipeline it will spit out the bigrams, which will then be indexed and searchable. Unless I'm missing something!

idx.pipeline.add(bigram)

dannydan412 · 2014-01-22T17:51:20Z

The problem is in this case I want the index to contain:
"Quick", "Brown", "Fox", "Quick Brown", "Brown Fox"
And the pipeline can only return a single token.

olivernn · 2014-01-22T18:03:48Z

Ah, yes, I see now, sorry for the confusion.

You would need to separate instances of lunr in that case, and your code would have to do the search twice and combine the results.

Sorry I haven't been able to help you much with this problem!

dannydan412 · 2014-01-22T18:05:25Z

You've been extremely helpful! I was able to achieve a similar result by modifying the tokenizer.

olivernn · 2014-01-22T18:10:43Z

Cool, out of interest, what modifications did you make?

dannydan412 · 2014-01-22T18:13:20Z

Let me clean up the code a little bit and I'll post it here.
Here's a snippet from the quick and dirty version:
https://gist.github.com/dannydan412/8564158

dannydan412 · 2014-01-23T18:12:59Z

Hey Oliver,

Have you considered adding fuzzy matching support to lunr?

olivernn · 2017-04-10T20:18:12Z

The latest version (2.0.x) of Lunr supports exact phrase matching and fuzzy matching, more info in the guides.

wdiego · 2017-06-28T20:27:34Z

Hey @olivernn, I couldn't find the "exact phrase matching" support that you told in Lunr guides. Can you show me where I can find this in the guide?

olivernn · 2017-06-30T06:41:20Z

@wdiego ah, now that I re-read the this issue, I see that I was confused. I must've though this issue was about exact term matching, which is now supported. Phrase matching, i.e. "foo bar" is not currently supported, sorry to mislead.

928PJY · 2017-08-11T06:42:40Z

Hi @olivernn So any plan to support Phrase matching?

olivernn · 2017-08-15T19:55:20Z

@928PJY I want to support it, I just don't know how to implement it in an efficient way yet. I'll re-open this issue.

928PJY · 2017-08-16T02:13:17Z

OK! Thank you @olivernn, if I have any idea, I will let you know!

jacksongs · 2018-01-13T12:31:26Z

Hello @olivernn I've been trying to use your code from January 2014 above to offer two-word exact phrase matching, but I can't seem to reconcile it with the docs. Wondering if there has been a change since v2. Can you offer any advice?

My intention is to create an index with both single word and two word tokens.

bengry · 2019-10-28T12:22:10Z

@olivernn I'm not sure if this issue covers my use-case, but I thought of asking here before creating a new issue - does lunr support exact phrase matching at the moment, or can I add it using a plugin (.use()) externally? So far I wasn't able to get it working.
To clarify, what I want is that for the following list of texts:

[
  "foo bar",
  "bar foo",
  "foo bar baz",
  "bar que foo",
]

searching for "foo bar" will only return index 0 and 2. Index 1 doesn't match since the order is wrong (I searched for "foo bar" and it has "bar foo") and index 3 doesn't match since it has the word que in between.

I'm using the latest version of lunr as of now (2.3.8).

biosocket · 2020-06-12T15:02:52Z

Couldn't exact-phrase matching be achieved using the position meta data?

georg-d · 2022-05-12T08:45:21Z

For me as a user of a site using lunr / antora, the non-existing "exact phrase search" causes massive redundand manual search efforts: For example, configuration file names contain sub-terms / parts that are fairly common component names, so searching foo bar.ssl.bar produces massive amounts of results for foo and bar – all these results need to be be checked manually whether they really contain the search phrase. Finally, I find out no document contains foo bar.ssl.bat and all results are only results due to automatic "convenience" to not do "inconvenient" exact phrase search but a "sub term search including stemming", so your good intention causes the opposite result 🙉☹ To make the effect tangible: If I had exact phrase search, a task would take 10 secs instead of current 10mins. Sadly, site: search in google etc. also fail because that site has several versions for the same document and google only searches one of them.

Seeing this issue is 8 years old: I'd already be happy with a very simple/naive approach, e.g. a toggle like surrounding the search phrase in "" to turn on exact phrase search mode which does not use any index but crawls live over the plain text like find/grep – while this is technically slower than with an index, it's still finished within 1-1000 milliseconds and the user task is much quicker completed.

Related, but not the same: Issue #33 that less exact search terms produce higher scores than exact terms

olivernn mentioned this issue Apr 1, 2014

Creates a TokenMetadataStore to return startPosition of tokens in results #79

Open

olivernn closed this as completed Apr 10, 2017

928PJY mentioned this issue Jul 20, 2017

Search within quotes dotnet/docfx#1806

Open

olivernn reopened this Aug 15, 2017

justin5267 mentioned this issue May 27, 2022

Chinese search support: Improve query segmentation squidfunk/mkdocs-material#3915

Closed

5 tasks

jhammen mentioned this issue Nov 30, 2023

Search for exact phrase within the Tutorial doesn't work but it works in "Antora". jakartaee/jakartaee-documentation-ui#27

Open

joshbeckman mentioned this issue Jul 11, 2024

Switch search library joshbeckman/notes#56

Closed

joshbeckman mentioned this issue Jul 16, 2024

Switching search libraries joshbeckman/notes#65

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Exact phrase matching? #62

Exact phrase matching? #62

dannydan412 commented Jan 18, 2014

olivernn commented Jan 21, 2014

dannydan412 commented Jan 21, 2014

olivernn commented Jan 21, 2014

dannydan412 commented Jan 22, 2014

olivernn commented Jan 22, 2014

dannydan412 commented Jan 22, 2014

olivernn commented Jan 22, 2014

dannydan412 commented Jan 22, 2014

olivernn commented Jan 22, 2014

dannydan412 commented Jan 22, 2014

olivernn commented Jan 22, 2014

dannydan412 commented Jan 22, 2014

dannydan412 commented Jan 23, 2014

olivernn commented Apr 10, 2017

wdiego commented Jun 28, 2017

olivernn commented Jun 30, 2017

928PJY commented Aug 11, 2017

olivernn commented Aug 15, 2017

928PJY commented Aug 16, 2017

jacksongs commented Jan 13, 2018

bengry commented Oct 28, 2019 •

edited

Loading

biosocket commented Jun 12, 2020

georg-d commented May 12, 2022 •

edited

Loading

Exact phrase matching? #62

Exact phrase matching? #62

Comments

dannydan412 commented Jan 18, 2014

olivernn commented Jan 21, 2014

dannydan412 commented Jan 21, 2014

olivernn commented Jan 21, 2014

dannydan412 commented Jan 22, 2014

olivernn commented Jan 22, 2014

dannydan412 commented Jan 22, 2014

olivernn commented Jan 22, 2014

dannydan412 commented Jan 22, 2014

olivernn commented Jan 22, 2014

dannydan412 commented Jan 22, 2014

olivernn commented Jan 22, 2014

dannydan412 commented Jan 22, 2014

dannydan412 commented Jan 23, 2014

olivernn commented Apr 10, 2017

wdiego commented Jun 28, 2017

olivernn commented Jun 30, 2017

928PJY commented Aug 11, 2017

olivernn commented Aug 15, 2017

928PJY commented Aug 16, 2017

jacksongs commented Jan 13, 2018

bengry commented Oct 28, 2019 • edited Loading

biosocket commented Jun 12, 2020

georg-d commented May 12, 2022 • edited Loading

bengry commented Oct 28, 2019 •

edited

Loading

georg-d commented May 12, 2022 •

edited

Loading