-
Notifications
You must be signed in to change notification settings - Fork 550
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Exact phrase matching? #62
Comments
At the moment lunr tries to be "clever" by automatically adding a wildcard at the end of your search terms, e.g. a search for "foo" becomes "foo*". I'd like to move away from this, for exactly this kind of issue, it is currently not possible to do an exact match search. I have some plans to change this, so hold off on implementing anything for now. I need to think through how to implement these changes. I'll be sure to keep you in the loop though, and would very grateful for any help in making these changes. |
Hi Oliver, |
It depends, if you only want exact matches, then you can change this code https://github.com/olivernn/lunr.js/blob/master/lib/index.js#L301 to not do the expanding, changing it to Another potential solution is to create n-grams. Basically if you had the text "The quick brown fox" you would treat multiple words together as a 'token'. For a bi-gram, n = 2, you would end up with tokens "The quick", "quick brown", "brown fox" etc. You could extend this to greater number of n, depending on the kind of results you get back. Take a look at adding a processor to the pipeline to do this. Another idea (not fully thought through) would be to use several instances of lunr together. Maybe one with the n-gram indexes and another with the regular index or even another with the token exapnding etc. Sorry I can't be of much help here. The changes I've been thinking about alter the way the indexing works in a fairly substantial way and I need to fully understand the implications of it, hence it is taking a little while! Personally I wouldn't worry to much about posting your "quick and dirty" code to github. Create a fork of this project and do whatever you do there, let me know how you get along! P.S. If you're interested in this kind of thing I can recommend taking a look at - http://nlp.stanford.edu/IR-book/pdf/irbookonlinereading.pdf, it might give you a few ideas. |
The n-grams sounds like an interesting solution. Isn't using the pipeline too late though in this case? It operates on tokens, so anything I add to the pipeline would operate on single words. Or am I missing something? |
A pipeline function will get called with three arguments, a token, the index of that token, and all the tokens, so you should be able to do what you want with a pipeline function. |
Isn't it too late to add tokens to the list when the pipeline function gets called? The parent method won't iterate through these newly added objects and so they never get copied to the "final" list of tokens. |
Whatever you return from the pipeline function is used as the input to the next.
You would probably have to do something about the undefined. So if you have this bigram function at the end of your pipeline it will spit out the bigrams, which will then be indexed and searchable. Unless I'm missing something!
|
The problem is in this case I want the index to contain: |
Ah, yes, I see now, sorry for the confusion. You would need to separate instances of lunr in that case, and your code would have to do the search twice and combine the results. Sorry I haven't been able to help you much with this problem! |
You've been extremely helpful! I was able to achieve a similar result by modifying the tokenizer. |
Cool, out of interest, what modifications did you make? |
Let me clean up the code a little bit and I'll post it here. |
Hey Oliver, Have you considered adding fuzzy matching support to lunr? |
The latest version (2.0.x) of Lunr supports exact phrase matching and fuzzy matching, more info in the guides. |
Hey @olivernn, I couldn't find the "exact phrase matching" support that you told in Lunr guides. Can you show me where I can find this in the guide? |
@wdiego ah, now that I re-read the this issue, I see that I was confused. I must've though this issue was about exact term matching, which is now supported. Phrase matching, i.e. "foo bar" is not currently supported, sorry to mislead. |
Hi @olivernn So any plan to support Phrase matching? |
@928PJY I want to support it, I just don't know how to implement it in an efficient way yet. I'll re-open this issue. |
OK! Thank you @olivernn, if I have any idea, I will let you know! |
Hello @olivernn I've been trying to use your code from January 2014 above to offer two-word exact phrase matching, but I can't seem to reconcile it with the docs. Wondering if there has been a change since v2. Can you offer any advice? My intention is to create an index with both single word and two word tokens. |
@olivernn I'm not sure if this issue covers my use-case, but I thought of asking here before creating a new issue - does lunr support exact phrase matching at the moment, or can I add it using a plugin ( [
"foo bar",
"bar foo",
"foo bar baz",
"bar que foo",
] searching for I'm using the latest version of |
Couldn't exact-phrase matching be achieved using the position meta data? |
For me as a user of a site using lunr / antora, the non-existing "exact phrase search" causes massive redundand manual search efforts: For example, configuration file names contain sub-terms / parts that are fairly common component names, so searching foo bar.ssl.bar produces massive amounts of results for foo and bar – all these results need to be be checked manually whether they really contain the search phrase. Finally, I find out no document contains foo bar.ssl.bat and all results are only results due to automatic "convenience" to not do "inconvenient" exact phrase search but a "sub term search including stemming", so your good intention causes the opposite result 🙉☹ To make the effect tangible: If I had exact phrase search, a task would take 10 secs instead of current 10mins. Sadly, site: search in google etc. also fail because that site has several versions for the same document and google only searches one of them. Seeing this issue is 8 years old: I'd already be happy with a very simple/naive approach, e.g. a toggle like surrounding the search phrase in "" to turn on exact phrase search mode which does not use any index but crawls live over the plain text like find/grep – while this is technically slower than with an index, it's still finished within 1-1000 milliseconds and the user task is much quicker completed. Related, but not the same: Issue #33 that less exact search terms produce higher scores than exact terms |
Hi!
Does lunar support exact phrase matching (i.e. use quotation marks in search)? It doesn't seem like it from what my initial research. I'd like to try and add this feature to the project. Could someone please give me some pointers on how to implement this?
The text was updated successfully, but these errors were encountered: