Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exact phrase matching? #62

Open
dannydan412 opened this issue Jan 18, 2014 · 23 comments
Open

Exact phrase matching? #62

dannydan412 opened this issue Jan 18, 2014 · 23 comments

Comments

@dannydan412
Copy link

Hi!

Does lunar support exact phrase matching (i.e. use quotation marks in search)? It doesn't seem like it from what my initial research. I'd like to try and add this feature to the project. Could someone please give me some pointers on how to implement this?

@olivernn
Copy link
Owner

At the moment lunr tries to be "clever" by automatically adding a wildcard at the end of your search terms, e.g. a search for "foo" becomes "foo*".

I'd like to move away from this, for exactly this kind of issue, it is currently not possible to do an exact match search.

I have some plans to change this, so hold off on implementing anything for now. I need to think through how to implement these changes. I'll be sure to keep you in the loop though, and would very grateful for any help in making these changes.

@dannydan412
Copy link
Author

Hi Oliver,
Thanks for getting back to me so quickly! Just to clarify - the current problem with lunr.js is that if I search for a phrase such as '"Hello World"' it would also return documents that contain "Hello Great World".
I'm working on a project with a deadline and I was wondering if you have any ideas for a "quick and dirty" solution that I could implement today. Of course I would not commit the code to github. One thought I had was to rebuild the index when phrases are used in the query. This would affect the behavior of the tokenizer to consider the quotes. So if something is in quotes it would be considered a single token. What are your thoughts on this approach?

@olivernn
Copy link
Owner

It depends, if you only want exact matches, then you can change this code https://github.com/olivernn/lunr.js/blob/master/lib/index.js#L301 to not do the expanding, changing it to get would get you just the token, not any others that are an extension of this term.

Another potential solution is to create n-grams. Basically if you had the text "The quick brown fox" you would treat multiple words together as a 'token'. For a bi-gram, n = 2, you would end up with tokens "The quick", "quick brown", "brown fox" etc. You could extend this to greater number of n, depending on the kind of results you get back. Take a look at adding a processor to the pipeline to do this.

Another idea (not fully thought through) would be to use several instances of lunr together. Maybe one with the n-gram indexes and another with the regular index or even another with the token exapnding etc.

Sorry I can't be of much help here. The changes I've been thinking about alter the way the indexing works in a fairly substantial way and I need to fully understand the implications of it, hence it is taking a little while! Personally I wouldn't worry to much about posting your "quick and dirty" code to github. Create a fork of this project and do whatever you do there, let me know how you get along!

P.S. If you're interested in this kind of thing I can recommend taking a look at - http://nlp.stanford.edu/IR-book/pdf/irbookonlinereading.pdf, it might give you a few ideas.

@dannydan412
Copy link
Author

The n-grams sounds like an interesting solution. Isn't using the pipeline too late though in this case? It operates on tokens, so anything I add to the pipeline would operate on single words. Or am I missing something?

@olivernn
Copy link
Owner

A pipeline function will get called with three arguments, a token, the index of that token, and all the tokens, so you should be able to do what you want with a pipeline function.

http://lunrjs.com/docs/#Pipeline

@dannydan412
Copy link
Author

Isn't it too late to add tokens to the list when the pipeline function gets called? The parent method won't iterate through these newly added objects and so they never get copied to the "final" list of tokens.

@olivernn
Copy link
Owner

Whatever you return from the pipeline function is used as the input to the next.

var pipeline = new lunr.Pipeline
var bigram = function (token, idx, tokens) {
  return token + " " + tokens[idx + 1]
}

pipeline.run(["The", "quick", "brown", "fox"]) // ["The quick", "quick brown", "brown fox", "fox undefined"]

You would probably have to do something about the undefined.

So if you have this bigram function at the end of your pipeline it will spit out the bigrams, which will then be indexed and searchable. Unless I'm missing something!

idx.pipeline.add(bigram)

@dannydan412
Copy link
Author

The problem is in this case I want the index to contain:
"Quick", "Brown", "Fox", "Quick Brown", "Brown Fox"
And the pipeline can only return a single token.

@olivernn
Copy link
Owner

Ah, yes, I see now, sorry for the confusion.

You would need to separate instances of lunr in that case, and your code would have to do the search twice and combine the results.

Sorry I haven't been able to help you much with this problem!

@dannydan412
Copy link
Author

You've been extremely helpful! I was able to achieve a similar result by modifying the tokenizer.

@olivernn
Copy link
Owner

Cool, out of interest, what modifications did you make?

@dannydan412
Copy link
Author

Let me clean up the code a little bit and I'll post it here.
Here's a snippet from the quick and dirty version:
https://gist.github.com/dannydan412/8564158

@dannydan412
Copy link
Author

Hey Oliver,

Have you considered adding fuzzy matching support to lunr?

@olivernn
Copy link
Owner

The latest version (2.0.x) of Lunr supports exact phrase matching and fuzzy matching, more info in the guides.

@wdiego
Copy link

wdiego commented Jun 28, 2017

Hey @olivernn, I couldn't find the "exact phrase matching" support that you told in Lunr guides. Can you show me where I can find this in the guide?

@olivernn
Copy link
Owner

@wdiego ah, now that I re-read the this issue, I see that I was confused. I must've though this issue was about exact term matching, which is now supported. Phrase matching, i.e. "foo bar" is not currently supported, sorry to mislead.

@928PJY
Copy link

928PJY commented Aug 11, 2017

Hi @olivernn So any plan to support Phrase matching?

@olivernn
Copy link
Owner

@928PJY I want to support it, I just don't know how to implement it in an efficient way yet. I'll re-open this issue.

@olivernn olivernn reopened this Aug 15, 2017
@928PJY
Copy link

928PJY commented Aug 16, 2017

OK! Thank you @olivernn, if I have any idea, I will let you know!

@jacksongs
Copy link

Hello @olivernn I've been trying to use your code from January 2014 above to offer two-word exact phrase matching, but I can't seem to reconcile it with the docs. Wondering if there has been a change since v2. Can you offer any advice?

My intention is to create an index with both single word and two word tokens.

@bengry
Copy link

bengry commented Oct 28, 2019

@olivernn I'm not sure if this issue covers my use-case, but I thought of asking here before creating a new issue - does lunr support exact phrase matching at the moment, or can I add it using a plugin (.use()) externally? So far I wasn't able to get it working.
To clarify, what I want is that for the following list of texts:

[
  "foo bar",
  "bar foo",
  "foo bar baz",
  "bar que foo",
]

searching for "foo bar" will only return index 0 and 2. Index 1 doesn't match since the order is wrong (I searched for "foo bar" and it has "bar foo") and index 3 doesn't match since it has the word que in between.

I'm using the latest version of lunr as of now (2.3.8).

@biosocket
Copy link

Couldn't exact-phrase matching be achieved using the position meta data?

@georg-d
Copy link

georg-d commented May 12, 2022

For me as a user of a site using lunr / antora, the non-existing "exact phrase search" causes massive redundand manual search efforts: For example, configuration file names contain sub-terms / parts that are fairly common component names, so searching foo bar.ssl.bar produces massive amounts of results for foo and bar – all these results need to be be checked manually whether they really contain the search phrase. Finally, I find out no document contains foo bar.ssl.bat and all results are only results due to automatic "convenience" to not do "inconvenient" exact phrase search but a "sub term search including stemming", so your good intention causes the opposite result 🙉☹ To make the effect tangible: If I had exact phrase search, a task would take 10 secs instead of current 10mins. Sadly, site: search in google etc. also fail because that site has several versions for the same document and google only searches one of them.

Seeing this issue is 8 years old: I'd already be happy with a very simple/naive approach, e.g. a toggle like surrounding the search phrase in "" to turn on exact phrase search mode which does not use any index but crawls live over the plain text like find/grep – while this is technically slower than with an index, it's still finished within 1-1000 milliseconds and the user task is much quicker completed.

Related, but not the same: Issue #33 that less exact search terms produce higher scores than exact terms

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants