-
Notifications
You must be signed in to change notification settings - Fork 122
Changed spell-checking to be plugin-based. #120
Conversation
5c318b1
to
c02e3f7
Compare
e95f999
to
833330f
Compare
Commit 833330f includes a slightly reworked default language selection based on conversations from this discussion thread. |
Hey @dmoonfire, thanks so much for all the efforts you've put into this: this looks great!
Could you expound a bit on this? If we want to keep using a |
@init() | ||
|
||
# We need a couple packages. | ||
multirange = require 'multi-integer-range' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you clarify on why we need the multi-integer-range
library?
Anyways, this is an interesting problem, as it would seem we'd need to define some rules for how the returned ranges need to be interpreted. For example: if a range is marked as incorrect, but there's an intersecting range that's marked as correct by another provider, what should we do? Is the whole range correct or incorrect? I guess we might define some importance rules in order to prevent this kind of mismatching.
I also wonder if it might be simpler to just expose a word-based API: spell checking, as defined on Wikipedia, extracts words and then compares them against some kind of dictionary, so maybe we could avoid introducing this extra complexity by just having correct and incorrect words? Then, we should easily be able to identify which ranges are valid misspellings and whatnot.
Thoughts? 💭
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought about a word-based API, but then I was thinking that the overhead of native interop would be more inefficient. At least in C# and Java, there is more overhead to jump to the native libraries (as opposed to a single large string once) and I thought a word-by-word would be excessive. But, if that isn't a problem, then the multi-integer-range
wouldn't be needed because we could stop earlier.
My general rule was if any checker says it is correct, it is correct. We do that in lines 45-47 by a simple union of indexes. Known words and spell-check-project
both only do correctness checking. I mainly union'd those together because it was more efficient than just inverting the range and then adding them into the second rule (below).
The second rule was a word was incorrect if every checker says it is incorrect. The multiple locale checkers use this rule. To do that, we keep a integer range of incorrect words for each checker then get the intersection of all of them to identify which words are incorrect in every single one. I mainly used multi-integer-range
to make the intersection logic a bit easier to understand using set theory instead of managing it myself (I really hate writing set operations :P).
Doing word-by-word would prevent #13 but I think that is a grammar checker thing, not a spell-checking.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. I am not sure about the possible overheads of crossing the native boundary, but I'd assume that passing only reference types on the way in might be efficient enough? We might define the check
function as:
function check (words: [string]) -> [bool] {
}
Collecting all the check
results enables us to have a set of words which are valid/invalid, and then we can convert them back to ranges in js land. Thoughts? 💭
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now I have opinions about the word API. I banged up a quick gist which has a couple implementations with a low-resolution timing (Date - Date). Mainly, it creates 100k words via lorem-ipsum
and then puts it through four different combinations:
- The entire buffer in one shot. (493 ms)
- Each words, one at a time. (989 ms)
- With a small
lru-cache
(1535 ms) - With an unlimited life cache. (45 ms)
Now, the lru-cache
is probably implementation-specific, but it gives a rough idea of caching the results of the more common words being checked (once "a" is correct, "a" will always be right).
For the unlimited cache: Using my own novel as an example, I grabbed the first ten chapters which came out to 22k words but only 3.9k distinct words. Storing the last 5k words in memory and then wiping the cache if it exceeded it would allow you to probably easily use a fast cache and take advantage of that fast lookup for the entire buffer. I'd probably say any word over 20 characters is always rechecked and skips the cache to avoid polluting it with obscure words.
The idea of the cache is that most languages have a certain redundancy (articles, prepositions, etc) that fills in most of the language.
In the worst case, checking every word on 100k words is about double what it is to send the entire buffer as a single string.
An advantage of using a word-by-word is that we could eventually have a hook that lets a grammar give us a different rule for tokenization, which would allow us to do PascalCase and camelCase (#91).
I think with a small cache (5k words, settable via config), it would be advantageous to switch to a word-by-word. It would let us drop the dependency on multi-integer-range
and simplify the logic for marking up the errors. If you are okay with that, I'll rework the implementation and push it back up. If so, do you want it as a delta against this one or rebase it off the head as a single commit? (I'd squash everything before asking you for the final pull.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the worst case, checking every word on 100k words is about double what it is to send the entire buffer as a single string.
If we manage to wrap this inside a Task
, I feel like this might already be fine. Moreover, we can let the UI thread be updated at regular intervals, so that it doesn't look like no spell-checking is being performed in the case of a long text. 🐎
Also, I am wondering if you could setup a test where, instead of crossing the native boundary for each word, we pass an array of strings (the words), and return an array of booleans. It seems like it could greatly improve performance because I'd assume strings
get passed by reference, and bool
is a really cheap data type. What do you think?
think with a small cache (5k words, settable via config), it would be advantageous to switch to a word-by-word. It would let us drop the dependency on multi-integer-range and simplify the logic for marking up the errors.
I think it'd be nice to avoid the cache, just so we can keep the memory footprint to a minimum. Also, it seems like we would need to invalidate this cache when e.g. new packages that provide spell-checking get activated, and I'd ❤️ to avoid this extra complexity if we can.
do you want it as a delta against this one or rebase it off the head as a single commit? (I'd squash everything before asking you for the final pull.)
Personally, I think it's fine to avoid rebasing: feel free to just add new commits on top of the ones you've already pushed, so that this conversation can be read in the future within its context.
@as-cii, glad you like it so far. I'll admit, my understanding of Tasks is probably the weakest part of this entire thing, so I may have missed something that makes all this possible. :) When we allow the user to enter a list of locales into the system, I have a listener to catch those changes so I can update the plugins. It starts with the hook in main.coffee which then goes into the manager. That clears out the Assuming the manager is in the Task, I couldn't find an example of passing configuration information or requests to reload from the main Atom process into the Task so it knows to reload all the locales before checking. This problem also hung me up on the I was also using the service interface to get all the additional checkers (such as Related to that was also how to pass configuration changes for those plugins through the manager to the corresponding plugin on the other side of the Task so they know to update their configurations, reload files, etc. With the locales and known words, I know what to listen for, but arbitrary plugins ( I'd like to use Task, it would cut down on the loading time significantly, I'm just missing the required knowledge of how to make it work. Thank you. |
return | ||
|
||
# Initialize the spell checker which can take some time. | ||
@spellchecker = new spellchecker.Spellchecker |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome that you used the new API instead of the deprecated one!
* Changed the package to allow for external packages to provide additional checking. (Closes atom#74) - Disabled the task-based handling because of passing plugins. - Two default plugins are included: system-based dictionaries and "known words". - Suggestions and "add to dictionary" are also provided via interfaces. (Closes atom#10) - Modified various calls so they are aware of the where the buffer is located. * Modified system to allow for multiple plugins/checkers to identify correctness. - Incorrect words must be incorrect for all checkers. - Any checker that treats a word as valid is considered valid for the buffer. * Extracted system-based dictionary support into separate checker. - System dictionaries can now check across multiple system locales. - Locale selection can be changed via package settings. (Closes atom#21) - Multiple locales can be selected. (Closes atom#11) - External search paths can be used for Linux and OS X. - Default language is based on the process environment, with a fallback to the browser, before finally using `en-US` as a fallback. * Extracted hard-coded approved list into a separate checker. - User can add additional "known words" via settings. - Added an option to add more known words via the suggestion dialog. * Updated ignore files and added EditorConfig settings for development. * Various coffee-centric formatting.
9426334
to
79194c3
Compare
- Switched to a task-based system. - Reworked the plugins so they return full paths to a `require`able instance. - Modified the activation/deactivation to create a single background task for all checking. - Initial request for corrections will have a delay while an in-process manager is created. - Added flow control via `send` and `emit` to communicate configuration changes with the task process. - Changed to a word-based searching implementation. - Added a simplified tokenizer that includes some accented characters. - An internal cache is used for a single round of check to reduce overhead. - Removed the use of integer ranges and simplified processing logic. - Removed a number of `console.log` calls used for debugging. - Updated documentation. - Split out plugin creation into a separate document with a reference in the `README.md`.
79194c3
to
dff766e
Compare
Well, I accidentally rewrote the original commit, but I left the files as they were so the above comments remain true (line numbers and all), but sadly not the commit hash. I believe I have implemented everything you suggested. I did put in a short-term cache that is used only for the length of a single check on the buffer. With a 25k word chapter, it made a slightly noticeable difference that I felt it was worthwhile. We don't have to worry about invalidation because the cache is never persisted beyond the call and (ideally) the correctness of a word shouldn't change from paragraph five to paragraph fifty-five. I updated it to handle the array-based checking, but I did not change the spellchecker module to do that. It simply checks every word as an individual call. From my earlier tests, it seemed to produce fairly reasonable performance. I didn't touch the loading methods, only how they interacted. A lot of the code drastically changed with the inclusion of the |
There's some discussion on a word by word API in this PR comments. In fact spell checker used to be word by word but was changed here |
@jeancroy One of the reasons I added the code I did was to handle my personal needs (grave, acute, and macrons), but I think a more holistic approach would be far better. I don't know if using something like XRegExp (MIT) would be a better choice. ES6's new There is also the problem of things like "bob's" should be one word for most dictionaries. I did an English-centric approach, but I don't like it. Thank you. |
So this add back the word splitting on atom side instead of spellchecker side ? With regards to Other parts of atom define a list of non word character then define words character as anything not on that list. I think it make sens, I'd prefer to break the approximation on fancy ponctuation than not supporting some non latin alphabet. |
I wasn't aware of the atom/node-spellchecker#27 and was only going based on to feedback from the previous #120 review from @as-cii. This does pretty much go back to tokenization on the Atom/JS side which sounds like it may be a problem. I did have a small cache involved, which would cover most natural language-based usages; repeated words are only checked once for a given buffer check which means we wouldn't include the same word repeatedly while processing the lines. For a given line, though, all the unknown words were I guess the big question is which input should I optimize for? Line-based, word-for-word, single buffer? With regards to I wasn't entirely sure which "it" you meant, but I presume you'd like me to use the "non-word characters" approach instead of the word approach. I was mostly following how JS handled |
Thanks for updating this pull-request, @dmoonfire, and thanks @jeancroy for mentioning the potential issues we might have if we switch to a word-based API. Originally, I was mainly concerned about performance, and that's why I suggested to pass arrays as opposed to cross the native boundary for each word. However, the i18n concern feels even more important, because I underestimated the fact that a word represents a different concept based on the language, especially when it comes to non-Western locales. Hence, it seems like we cannot simply pass words into spell-checkers, but each one should rather need to handle their own concept of word, which might take us back to the previous design. A question still remains though: how would we handle cases where two spell-checkers return inconsistent results? For instance:
In that case there's a potential mismatch, so I guess we could either:
In the above case it seems like doing 1) would be okay, because it's likely that let index = new MarkerIndex
for (let checker of this.checkers) {
for (let result of checker.check(text)) {
let intersectingResults = index.findIntersecting(result.range.start, result.range.end)
for (let intersectingResultId of intersectingResults) {
let intersectingResult = markers.get(intersectingResultId)
if (intersectingResult.incorrect && result.correct) {
// remove intersecting result and add current result
} else if (intersectingResult.incorrect && result.incorrect) {
// remove intersecting result, and add union of the two results
} else if (intersectingResult.correct && result.correct)
// remove intersecting result, and add union of the two results
} else { // intersectingResult.correct && result.incorrect
// break
}
}
if (intersectingResults.length === 0) {
// add current result
}
} All the What do you think? 💭 @dmoonfire: thanks again for your great work on this (and sorry for revisiting my original suggestions), I really think this is going to be awesome! ✨ |
@as-cii: I like talking about these things. I want to make sure we cover everything since we all bring in different experiences and skills to the table. In the end, this is all about making a good library, not personal egos. :) I also don't mind refactoring. As for changing/revisiting opinion, that's why we discuss things. As new information is presented, we adapt. A question still remains though: how would we handle cases where two spell-checkers return inconsistent results? The general philosophy is that if one checker says its correct, it is correct. A good example for the word "danke" would be incorrect according to Originally, I wrote it as effectively as a tri-state (correct, incorrect, no opinion). That was to reduce the amount of data passed from the checkers. They passed a list of correct ranges, incorrect ranges, with everything else being ignored. If no checker has an opinion about a given range, then it was considered correct. This would prevent everything from being marked as incorrect if you have a bad locale. In the original implementation I was concerned about using the Question/Observation 1 It looks like One possible fix (if Question/Observation 2 With the proposed implementation, when
We would need the correct marking with the above implementation because of the "if any checker says correct, then its correct" to avoid Question/Observation 3 One advantage of Assuming that you have a 1k word document with alternative words correct:
For a given
For the each of the same ranges, |
I like the idea of propagating "valid" state, because then union of spell checker results is sort of equivalent to a spell checker with union of dictionaries. However that equivalence mostly exist if all spell checker tokenize the text the same way. I'm not sure how I feel about propagating true state to larger area of text. For example compound word. Some part of the word may be properly spelled while another isn't. Overall the word isn't properly spelled. Dictionary A: Detect English words, ignore hyphen Sentence: " The South-American rain forest is" Dictionary A: { "South" : valid, "American ": valid } Idealy valid state should not propagate. Or if above is too much of an edge case... Sentence: " South-Omerecan" Dictionary A: { "South" : valid, "Omerecan": invalid} Idealy valid state of "South" should not propagate to "Omerecan" using invalid state of "South-Omirecan" as a bridge. |
I'm not sure how to implement that and still correctly handle "danke cheese" with In your examples, with the current implementation, you would have a misspelled wavy line underneath the "-". But the corrections wouldn't work since we pass the full text of the marker ("-") to the checker for suggestions. One approach is if we used priority to determine correct/incorrect. If two checkers share a priority, then their results are merged so corrects will override, but when checker of a higher priority says it is incorrect, then it is incorrect regardless of what lower priority checkers say. That would mean there would be a lot in priority 100 (the locale checker) when you are merging multiple languages together. The only way that would work is if your second checker (the hyphenated one) only identified incorrect hyphenated words instead of trying to also do normal language checking. Otherwise, if that checker isn't multi-locale aware, it would incorrectly mark non-English words as incorrect even though the user had |
The above problem exist because some dictionary consider part of word while I don't think a dictionary will consider danke cheese as a compound word On Thu, May 5, 2016, 11:24 Dylan R. E. Moonfire [email protected]
|
Oh, I know that "danke cheese" isn't a compound word. The reason I mentioned it is that checkers provide a text range of what is wrong. In your example, "South-Omerecan".
Using my current logic, it would then only mark Using "danke cheese".
The behavior I'm working on is that if you use The output of these checkers is the same, but there is nothing in the output that tells me that the first example (South-Omerecan) should favor the incorrect range while the second example (danke cheese) should merge the results. I need some signal that indicates when to change the behavior. I have the ranges (as above) and priority (which is currently used to order suggestions). I could add a new property of the checkers to say "when I say incorrect, it is always incorrect" (example 1), but that wouldn't work if checker 2 also did normal language checking (e.g., all of English) because when it encountered "danke" in the second example, it would mark it as incorrect and it's signal said that its incorrect is absolute. I would consider that wrong since it breaks example 2. I don't have a good answer because we don't have checkers that produce this conflicted results (we could make one though) and I think the solution would significantly increase the complexity of this PR. |
03a2840
to
d400492
Compare
d400492
to
a509817
Compare
SpellChecker = require 'spellchecker' | ||
# This is the task local handler for the manager so we can reuse the manager | ||
# throughout the life of the task. | ||
SpellCheckerManager = require './spell-check-manager.coffee' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we remove the .coffee
extension here?
Hey @dmoonfire, so after looking at the code a second time I think we might merge this. It's a pretty big pull-request and, as with all redesigns, I am somewhat concerned we might introduce some regression. The work you've done is pretty solid, though, so we might let this bake for a while on master and see if we're introducing any issue. With the bundled spell-checker it works great on my machine. ✨ Is there any package that provides spell-check that I might use to test this out locally? In the meantime could you also update this pull-request's description to match what we're currently doing in code? (i.e. the description says we're not using a Apologies for making you wait so long and thanks for the great work on this, looking forward to 🚢ping it! |
Thank you. Yes, if you use I'll get those changes done during my lunch break. |
* Removed ".coffee" from require statements. * Moved empty array initialization closer to where it was populated.
17fecdc
to
c228c87
Compare
Thanks @dmoonfire! ⚡ ⚡ ⚡ This will go out in the next beta release. |
@as-cii Which release will that be? 1.12? (just want to have it desperately...) |
Yes, this was released in Atom beta v1.12.0-beta0 and you can already try it out by downloading the latest beta from https://atom.io/beta. |
@as-cii thanks. will try it now. No luck, got 1.11.0.beta3-0.1 |
It actually works! Tested in 1.12.0.beta3 and ru-RU locale. |
Hi, spellchecking/ setting locales still troubles in Fedora25, atom 1.12.6. #129. |
@dmoonfire - my apologies if this is not the correct place/forum to ask, but I am trying to determine if the ability to "add to dictionary" is something that should be part of the spell-check plugin, or if it requires another plug-in in addition, or if all that has been added to close #10 is the interface to make it possible for an additional plug in to add custom words to the dictionary. If the functionality exists, I might have stumbled across a bug that prevents it from working or perhaps am experiencing user error/additional packages that an update to the README (which I would be happy to make) would fix. I'd also be open to working to add that functionality if it is still needed and if there is anyone in the community who could mentor me in making a change to an Atom plugin. |
The "Add to Dictionary" is actually provided by the individual plugins. For example, the system checker doesn't (we can't add words to the system dictionary) but the known words does (we control that via configuration). Also, my project-based checker also provides words. To implement it:
As for getting everything working, I'm willing to add my two cents. I have a pretty high workload (surprise release at work, plus publishing two books), but I usually can answer questions or give suggestions without a problem. |
I am not sure if I should create a Feature Request but: on Windows 7 the Locale Path actually works, contrary to what description says (and I'd love it to work on Windows 10 as well) The problem with Locale Path is that it does not accept relative paths. I have used ZIP for Atom, prepared portable only package but I have now hard-coded the path to dictionaries in the plugin settings. Possible solutions:
What do you think @dmoonfire ? |
en-US
as a fallback.