Skip to content
This repository has been archived by the owner on Dec 15, 2022. It is now read-only.

Changed spell-checking to be plugin-based. #120

Merged
merged 6 commits into from
Aug 19, 2016

Conversation

dmoonfire
Copy link
Contributor

@dmoonfire dmoonfire commented Mar 12, 2016

  • Changed the package to allow for external packages to provide additional checking. (Closes Allow for pluggable spell-checking modelled after Linter #74)
    • Updated the Task-based system to handle passing known words to the background task.
    • Two default plugins are included: system-based dictionaries and "known words".
    • Suggestions and "add to dictionary" are also provided via interfaces. (Closes Learn spellings / Add words to dictionary  #10)
    • Modified various calls so they are aware of the where the buffer is located.
      • Modified system to allow for multiple plugins/checkers to identify correctness.
    • Incorrect words must be incorrect for all checkers.
    • If no checker has an opinion, it is considered correct.
    • Any checker that treats a word as valid is considered valid for the buffer.
      • Extracted system-based dictionary support into separate checker.
    • System dictionaries can now check across multiple system locales.
    • Locale selection can be changed via package settings. (Closes Add a way to choose the current language #21)
    • Multiple locales can be selected. (Closes Multiple/language dictionaries #11)
    • External search paths can be used for Linux and OS X.
    • Default language is based on the process environment, with a fallback to the browser, before finally using en-US as a fallback.
      • Extracted hard-coded approved list into a separate checker.
    • User can add additional "known words" via settings.
    • Added an option to add more known words via the suggestion dialog.
      • Updated ignore files and added EditorConfig settings for development.
      • Various coffee-centric formatting.

@dmoonfire dmoonfire force-pushed the plugin-based-spell-check branch 2 times, most recently from 5c318b1 to c02e3f7 Compare March 12, 2016 19:31
@dmoonfire dmoonfire force-pushed the plugin-based-spell-check branch 2 times, most recently from e95f999 to 833330f Compare March 15, 2016 03:18
@dmoonfire
Copy link
Contributor Author

Commit 833330f includes a slightly reworked default language selection based on conversations from this discussion thread.

@lee-dohm lee-dohm added the atom label Mar 15, 2016
@as-cii
Copy link
Contributor

as-cii commented Mar 28, 2016

Hey @dmoonfire, thanks so much for all the efforts you've put into this: this looks great!

Disabled the task-based handling because of passing plugins.

Could you expound a bit on this? If we want to keep using a Task (which seems reasonable), I wonder if we could let service authors expose a function that returns a path which, when required, exports a pure-function. This way we can easily pass those function paths into the Task, which in turn, is free to require them. What am I missing?

@init()

# We need a couple packages.
multirange = require 'multi-integer-range'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you clarify on why we need the multi-integer-range library?

Anyways, this is an interesting problem, as it would seem we'd need to define some rules for how the returned ranges need to be interpreted. For example: if a range is marked as incorrect, but there's an intersecting range that's marked as correct by another provider, what should we do? Is the whole range correct or incorrect? I guess we might define some importance rules in order to prevent this kind of mismatching.

I also wonder if it might be simpler to just expose a word-based API: spell checking, as defined on Wikipedia, extracts words and then compares them against some kind of dictionary, so maybe we could avoid introducing this extra complexity by just having correct and incorrect words? Then, we should easily be able to identify which ranges are valid misspellings and whatnot.

Thoughts? 💭

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about a word-based API, but then I was thinking that the overhead of native interop would be more inefficient. At least in C# and Java, there is more overhead to jump to the native libraries (as opposed to a single large string once) and I thought a word-by-word would be excessive. But, if that isn't a problem, then the multi-integer-range wouldn't be needed because we could stop earlier.

My general rule was if any checker says it is correct, it is correct. We do that in lines 45-47 by a simple union of indexes. Known words and spell-check-project both only do correctness checking. I mainly union'd those together because it was more efficient than just inverting the range and then adding them into the second rule (below).

The second rule was a word was incorrect if every checker says it is incorrect. The multiple locale checkers use this rule. To do that, we keep a integer range of incorrect words for each checker then get the intersection of all of them to identify which words are incorrect in every single one. I mainly used multi-integer-range to make the intersection logic a bit easier to understand using set theory instead of managing it myself (I really hate writing set operations :P).

Doing word-by-word would prevent #13 but I think that is a grammar checker thing, not a spell-checking.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. I am not sure about the possible overheads of crossing the native boundary, but I'd assume that passing only reference types on the way in might be efficient enough? We might define the check function as:

function check (words: [string]) -> [bool] {
}

Collecting all the check results enables us to have a set of words which are valid/invalid, and then we can convert them back to ranges in js land. Thoughts? 💭

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now I have opinions about the word API. I banged up a quick gist which has a couple implementations with a low-resolution timing (Date - Date). Mainly, it creates 100k words via lorem-ipsum and then puts it through four different combinations:

  1. The entire buffer in one shot. (493 ms)
  2. Each words, one at a time. (989 ms)
  3. With a small lru-cache (1535 ms)
  4. With an unlimited life cache. (45 ms)

Now, the lru-cache is probably implementation-specific, but it gives a rough idea of caching the results of the more common words being checked (once "a" is correct, "a" will always be right).

For the unlimited cache: Using my own novel as an example, I grabbed the first ten chapters which came out to 22k words but only 3.9k distinct words. Storing the last 5k words in memory and then wiping the cache if it exceeded it would allow you to probably easily use a fast cache and take advantage of that fast lookup for the entire buffer. I'd probably say any word over 20 characters is always rechecked and skips the cache to avoid polluting it with obscure words.

The idea of the cache is that most languages have a certain redundancy (articles, prepositions, etc) that fills in most of the language.

In the worst case, checking every word on 100k words is about double what it is to send the entire buffer as a single string.

An advantage of using a word-by-word is that we could eventually have a hook that lets a grammar give us a different rule for tokenization, which would allow us to do PascalCase and camelCase (#91).

I think with a small cache (5k words, settable via config), it would be advantageous to switch to a word-by-word. It would let us drop the dependency on multi-integer-range and simplify the logic for marking up the errors. If you are okay with that, I'll rework the implementation and push it back up. If so, do you want it as a delta against this one or rebase it off the head as a single commit? (I'd squash everything before asking you for the final pull.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the worst case, checking every word on 100k words is about double what it is to send the entire buffer as a single string.

If we manage to wrap this inside a Task, I feel like this might already be fine. Moreover, we can let the UI thread be updated at regular intervals, so that it doesn't look like no spell-checking is being performed in the case of a long text. 🐎

Also, I am wondering if you could setup a test where, instead of crossing the native boundary for each word, we pass an array of strings (the words), and return an array of booleans. It seems like it could greatly improve performance because I'd assume strings get passed by reference, and bool is a really cheap data type. What do you think?

think with a small cache (5k words, settable via config), it would be advantageous to switch to a word-by-word. It would let us drop the dependency on multi-integer-range and simplify the logic for marking up the errors.

I think it'd be nice to avoid the cache, just so we can keep the memory footprint to a minimum. Also, it seems like we would need to invalidate this cache when e.g. new packages that provide spell-checking get activated, and I'd ❤️ to avoid this extra complexity if we can.

do you want it as a delta against this one or rebase it off the head as a single commit? (I'd squash everything before asking you for the final pull.)

Personally, I think it's fine to avoid rebasing: feel free to just add new commits on top of the ones you've already pushed, so that this conversation can be read in the future within its context.

@dmoonfire
Copy link
Contributor Author

@as-cii, glad you like it so far. I'll admit, my understanding of Tasks is probably the weakest part of this entire thing, so I may have missed something that makes all this possible. :)

When we allow the user to enter a list of locales into the system, I have a listener to catch those changes so I can update the plugins. It starts with the hook in main.coffee which then goes into the manager. That clears out the localeCheckers in the manager which is then reloaded the first time it is needed.

Assuming the manager is in the Task, I couldn't find an example of passing configuration information or requests to reload from the main Atom process into the Task so it knows to reload all the locales before checking.

This problem also hung me up on the KnownWordsChecker since we pass in the list of known words into that checker (along with a flag to see if it should provide the "add to dictionary" option, etc).

I was also using the service interface to get all the additional checkers (such as spell-check-project, you can get that from apm) that the user loaded a package for. These are required by the Atom process and passed in as objects, but I couldn't figure out how to find the path to the module from the object. I could have each checker provide its require path, but that felt like a code smell to me.

Related to that was also how to pass configuration changes for those plugins through the manager to the corresponding plugin on the other side of the Task so they know to update their configurations, reload files, etc. With the locales and known words, I know what to listen for, but arbitrary plugins (spell-check-project or what I was envisioning a user-specific dictionary as spell-check-user) wouldn't be known ahead of time.

I'd like to use Task, it would cut down on the loading time significantly, I'm just missing the required knowledge of how to make it work.

Thank you.

return

# Initialize the spell checker which can take some time.
@spellchecker = new spellchecker.Spellchecker

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome that you used the new API instead of the deprecated one!

* Changed the package to allow for external packages to provide additional checking. (Closes atom#74)
  - Disabled the task-based handling because of passing plugins.
  - Two default plugins are included: system-based dictionaries and "known words".
  - Suggestions and "add to dictionary" are also provided via interfaces. (Closes atom#10)
  - Modified various calls so they are aware of the where the buffer is located.
* Modified system to allow for multiple plugins/checkers to identify correctness.
  - Incorrect words must be incorrect for all checkers.
  - Any checker that treats a word as valid is considered valid for the buffer.
* Extracted system-based dictionary support into separate checker.
  - System dictionaries can now check across multiple system locales.
  - Locale selection can be changed via package settings. (Closes atom#21)
  - Multiple locales can be selected. (Closes atom#11)
  - External search paths can be used for Linux and OS X.
  - Default language is based on the process environment, with a fallback to the browser, before finally using `en-US` as a fallback.
* Extracted hard-coded approved list into a separate checker.
  - User can add additional "known words" via settings.
  - Added an option to add more known words via the suggestion dialog.
* Updated ignore files and added EditorConfig settings for development.
* Various coffee-centric formatting.
@dmoonfire dmoonfire force-pushed the plugin-based-spell-check branch from 9426334 to 79194c3 Compare May 2, 2016 05:52
- Switched to a task-based system.
    - Reworked the plugins so they return full paths to a `require`able instance.
    - Modified the activation/deactivation to create a single background task for all checking.
    - Initial request for corrections will have a delay while an in-process manager is created.
    - Added flow control via `send` and `emit` to communicate configuration changes with the task process.
- Changed to a word-based searching implementation.
    - Added a simplified tokenizer that includes some accented characters.
    - An internal cache is used for a single round of check to reduce overhead.
    - Removed the use of integer ranges and simplified processing logic.
- Removed a number of `console.log` calls used for debugging.
- Updated documentation.
    - Split out plugin creation into a separate document with a reference in the `README.md`.
@dmoonfire dmoonfire force-pushed the plugin-based-spell-check branch from 79194c3 to dff766e Compare May 2, 2016 05:59
@dmoonfire
Copy link
Contributor Author

Well, I accidentally rewrote the original commit, but I left the files as they were so the above comments remain true (line numbers and all), but sadly not the commit hash.

I believe I have implemented everything you suggested.

I did put in a short-term cache that is used only for the length of a single check on the buffer. With a 25k word chapter, it made a slightly noticeable difference that I felt it was worthwhile. We don't have to worry about invalidation because the cache is never persisted beyond the call and (ideally) the correctness of a word shouldn't change from paragraph five to paragraph fifty-five.

I updated it to handle the array-based checking, but I did not change the spellchecker module to do that. It simply checks every word as an individual call. From my earlier tests, it seemed to produce fairly reasonable performance.

I didn't touch the loading methods, only how they interacted. A lot of the code drastically changed with the inclusion of the Task, it added a lot more complexity in some aspects to temper the simplification from going to an array-based checking.

@jeancroy
Copy link

jeancroy commented May 2, 2016

There's some discussion on a word by word API in this PR comments.
I just wanted to add spliting unicode words is not easy, I expect CJK word to be even harder.

In fact spell checker used to be word by word but was changed here
atom/node-spellchecker#27

@dmoonfire
Copy link
Contributor Author

@jeancroy One of the reasons I added the code I did was to handle my personal needs (grave, acute, and macrons), but I think a more holistic approach would be far better.

I don't know if using something like XRegExp (MIT) would be a better choice. ES6's new /u operator was much better at splitting things than the ES5, which would give a reason to wait until that operator was available inside Atom. It seemed like a decision that needed discussion, which is why I went with a "kinda works" approach.

There is also the problem of things like "bob's" should be one word for most dictionaries. I did an English-centric approach, but I don't like it.

Thank you.

@jeancroy
Copy link

jeancroy commented May 2, 2016

So this add back the word splitting on atom side instead of spellchecker side ?

With regards to wRegexClass = '[\\w\\u0080-\\u00FF\\u0100-\\u017F\\u0180-\\u024F]'

Other parts of atom define a list of non word character then define words character as anything not on that list. I think it make sens, I'd prefer to break the approximation on fancy ponctuation than not supporting some non latin alphabet.

@dmoonfire
Copy link
Contributor Author

I wasn't aware of the atom/node-spellchecker#27 and was only going based on to feedback from the previous #120 review from @as-cii.

This does pretty much go back to tokenization on the Atom/JS side which sounds like it may be a problem. I did have a small cache involved, which would cover most natural language-based usages; repeated words are only checked once for a given buffer check which means we wouldn't include the same word repeatedly while processing the lines. For a given line, though, all the unknown words were words.join(" ") and passed to node-spellcheck which reduced the interop overhead. This approach doesn't work for opening up the /usr/share/dict/word because there is no redundancy and every word is on its own line, so that would be the worse-case scenario.

I guess the big question is which input should I optimize for? Line-based, word-for-word, single buffer?

With regards to wRegexClass = '[\\w\\u0080-\\u00FF\\u0100-\\u017F\\u0180-\\u024F]'

I wasn't entirely sure which "it" you meant, but I presume you'd like me to use the "non-word characters" approach instead of the word approach. I was mostly following how JS handled \w by including additional characters in the class instead of reversing the class, but I can find an example of a non-word and split on that instead. But a lot of that answer depends on the first one because if I go back to passing the entire text into the checker, it's a moot point. :)

@as-cii
Copy link
Contributor

as-cii commented May 5, 2016

Thanks for updating this pull-request, @dmoonfire, and thanks @jeancroy for mentioning the potential issues we might have if we switch to a word-based API.

Originally, I was mainly concerned about performance, and that's why I suggested to pass arrays as opposed to cross the native boundary for each word. However, the i18n concern feels even more important, because I underestimated the fact that a word represents a different concept based on the language, especially when it comes to non-Western locales.

Hence, it seems like we cannot simply pass words into spell-checkers, but each one should rather need to handle their own concept of word, which might take us back to the previous design. A question still remains though: how would we handle cases where two spell-checkers return inconsistent results? For instance:

  • spell-checker-1 might return Range({row: 1, column: 3}, {row: 1, column: 10}) = "incorrect".
  • spell-checker-2 might return Range({row: 1, column: 6}, {row: 1, column: 14}) = "correct".

In that case there's a potential mismatch, so I guess we could either:

  1. Consider the union of the reported ranges to be correct.
  2. Let the spell-checker with the highest priority win.

In the above case it seems like doing 1) would be okay, because it's likely that spell-checker-1 is using a different locale than spell-checker-2, so I see how always favoring correct reports over incorrect ones could make sense. In the original implementation I was concerned about using the multi-integer-range library and how the performance of those intersect queries could have affected the speed of this package, but I believe we can actually use marker-index to perform the same job. The logic might roughly look like the following (pseudocode):

let index = new MarkerIndex
for (let checker of this.checkers) {
  for (let result of checker.check(text)) {
    let intersectingResults = index.findIntersecting(result.range.start, result.range.end)
    for (let intersectingResultId of intersectingResults) {
      let intersectingResult = markers.get(intersectingResultId)
      if (intersectingResult.incorrect && result.correct) {
        // remove intersecting result and add current result
      } else if (intersectingResult.incorrect && result.incorrect) {
        // remove intersecting result, and add union of the two results
      } else if (intersectingResult.correct && result.correct)
        // remove intersecting result, and add union of the two results
      } else { // intersectingResult.correct && result.incorrect
        // break
      }
    }

    if (intersectingResults.length === 0) {
      // add current result
    }
}

All the findIntersecting queries should be executed in O(log n), so the above routine will take O(mn log n), where m is the number of checkers, and n is the number of results. We should probably perform the translation to bi-dimensional coordinates in a later pass, when we're sure that the marker index contains only valid spell-check results.

What do you think? 💭

@dmoonfire: thanks again for your great work on this (and sorry for revisiting my original suggestions), I really think this is going to be awesome! ✨

@dmoonfire
Copy link
Contributor Author

@as-cii: I like talking about these things. I want to make sure we cover everything since we all bring in different experiences and skills to the table. In the end, this is all about making a good library, not personal egos. :) I also don't mind refactoring.

As for changing/revisiting opinion, that's why we discuss things. As new information is presented, we adapt.

A question still remains though: how would we handle cases where two spell-checkers return inconsistent results?

The general philosophy is that if one checker says its correct, it is correct. A good example for the word "danke" would be incorrect according to en-US but not for de-DE, so if you have both checkers, it will pick up de-DE saying it is correct and then go with that. Likewise, the known words (and project checker) only identify something as correct.

Originally, I wrote it as effectively as a tri-state (correct, incorrect, no opinion). That was to reduce the amount of data passed from the checkers. They passed a list of correct ranges, incorrect ranges, with everything else being ignored. If no checker has an opinion about a given range, then it was considered correct. This would prevent everything from being marked as incorrect if you have a bad locale.

In the original implementation I was concerned about using the multi-integer-range library and how the performance of those intersect queries could have affected the speed of this package, but I believe we can actually use marker-index to perform the same job.

Question/Observation 1

It looks like marker-index basically does the same thing as multi-integer-range, but I have a minor concern. Because we do spell-checking inside a task, everything that we pass back and forth must be JSON serializable. From the source, I saw this line which implies a recursive data structure which wouldn't be JSON serializable to pass through the line. I haven't written this version (obviously), just want to know if that is going to cause a problem.

One possible fix (if marker-index isn't JSON serializable), is to use marker-index, then walk through the changes to build up the numeric ranges which are then passed across the boundary to be passed.

Question/Observation 2

With the proposed implementation, when en-US parses a 25k word document (some of my chapters are that long), everything that it says isn't incorrect (double negative) is correct. We would have to put in a correct market for those 24k words. We might be able to have a single range, but I couldn't see where marker-index would split a range into two when you insert a range/union it; it might be there, I don't know the library very well. Can it handle saying Range({row: 1, column: 3}, {row: 105, column: 55}) (note the different rows) and then have Range({row: 5, column: 10}, {row: 5, column: 15}) split it into:

  • Range({row: 1, column: 3}, {row: 5, column: 9}) = Correct
  • Range({row: 5, column: 10}, {row: 5, column: 15}) = Incorrect
  • Range({row: 5, column: 16}, {row: 105, column: 55}) = Correct

We would need the correct marking with the above implementation because of the "if any checker says correct, then its correct" to avoid checker1 saying correct and then checker2 marking it as incorrect. Given then, I'm concerned about memory pressure.

Question/Observation 3

One advantage of multi-integer-range is that it only works on index ranges, which is the output of the checkers. It has relatively low overhead while still allowing for set-based operations. The internal implementation is just an array of integer pairs. marker-index, as I read it, is a binary tree which would have a larger overhead for large documents.

Assuming that you have a 1k word document with alternative words correct:

  • The marker-index would have 1k regions, half correct and half incorrect, in a binary tree.
  • The multi-integer-range would have the same number of ranges, as integer pairs in two separate arrays.

For a given marker-index range, we would have one Node which has this information:

constructor (parent, leftExtent) {
    this.parent = parent
    this.left = null
    this.right = null
    this.leftExtent = leftExtent
    this.leftMarkerIds = new Set()
    this.rightMarkerIds = new Set()
    this.startMarkerIds = new Set()
    this.endMarkerIds = new Set()
    this.priority = null
    this.id = this.constructor.idCounter++
  }

For the each of the same ranges, multi-integer-range would only have [start: number, stop: number].

@jeancroy
Copy link

jeancroy commented May 5, 2016

I like the idea of propagating "valid" state, because then union of spell checker results is sort of equivalent to a spell checker with union of dictionaries. However that equivalence mostly exist if all spell checker tokenize the text the same way.

I'm not sure how I feel about propagating true state to larger area of text. For example compound word. Some part of the word may be properly spelled while another isn't. Overall the word isn't properly spelled.

Dictionary A: Detect English words, ignore hyphen
Dictionary B: Detect commonly miss-hyphenated words


Sentence: " The South-American rain forest is"

Dictionary A: { "South" : valid, "American ": valid }
Dictionary B: { "South-American ": invalid } // Correction "South American"

Idealy valid state should not propagate.


Or if above is too much of an edge case...

Sentence: " South-Omerecan"

Dictionary A: { "South" : valid, "Omerecan": invalid}
Dictionary B: { "South-Omirecan": invalid } // Correction "South American"

Idealy valid state of "South" should not propagate to "Omerecan" using invalid state of "South-Omirecan" as a bridge.

@dmoonfire
Copy link
Contributor Author

I'm not sure how to implement that and still correctly handle "danke cheese" with en-US and de-DE since each one will mark opposite words as correct/incorrect.

In your examples, with the current implementation, you would have a misspelled wavy line underneath the "-". But the corrections wouldn't work since we pass the full text of the marker ("-") to the checker for suggestions.

One approach is if we used priority to determine correct/incorrect. If two checkers share a priority, then their results are merged so corrects will override, but when checker of a higher priority says it is incorrect, then it is incorrect regardless of what lower priority checkers say. That would mean there would be a lot in priority 100 (the locale checker) when you are merging multiple languages together. The only way that would work is if your second checker (the hyphenated one) only identified incorrect hyphenated words instead of trying to also do normal language checking. Otherwise, if that checker isn't multi-locale aware, it would incorrectly mark non-English words as incorrect even though the user had de-DE.

@jeancroy
Copy link

jeancroy commented May 5, 2016

The above problem exist because some dictionary consider part of word while
another consider compound word.

I don't think a dictionary will consider danke cheese as a compound word

On Thu, May 5, 2016, 11:24 Dylan R. E. Moonfire [email protected]
wrote:

I'm not sure how to implement that and still correctly handle "danke
cheese" with en-US and de-DE since each one will mark opposite words as
correct/incorrect.

In your examples, with the current implementation, you would have a
misspelled wavy line underneath the "-". But the corrections wouldn't work
since we pass the full text of the marker ("-") to the checker for
suggestions.

One approach is if we used priority to determine correct/incorrect. If two
checkers share a priority, then their results are merged so corrects will
override, but when checker of a higher priority says it is incorrect, then
it is incorrect regardless of what lower priority checkers say. That would
mean there would be a lot in priority 100 (the locale checker) when you are
merging multiple languages together. The only way that would work is if
your second checker (the hyphenated one) only identified incorrect
hyphenated words instead of trying to also do normal language checking.
Otherwise, if that checker isn't multi-locale aware, it would incorrectly
mark non-English words as incorrect even though the user had de-DE.


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#120 (comment)

@dmoonfire
Copy link
Contributor Author

Oh, I know that "danke cheese" isn't a compound word. The reason I mentioned it is that checkers provide a text range of what is wrong. In your example, "South-Omerecan".

  • Checker 1 returns 6-14 as incorrect.
  • Checker 2 returns 0-14 as incorrect.

Using my current logic, it would then only mark 6-14 as incorrect because checker 1 said 0-5 is not incorrect. If I change that so checker 2 takes priority, then we would mark 0-14 as incorrect (which is the correct incorrect range for this example).

Using "danke cheese".

  • Checker 1 (en-US) returns 0-5 as incorrect (danke is not English).
  • Checker 3 (de-DE) returns 6-12 as incorrect (cheese is not German).

The behavior I'm working on is that if you use en-US and de-DE at the same time, "danke cheese" should show no incorrect words because they are correct for at least one checker.

The output of these checkers is the same, but there is nothing in the output that tells me that the first example (South-Omerecan) should favor the incorrect range while the second example (danke cheese) should merge the results.

I need some signal that indicates when to change the behavior. I have the ranges (as above) and priority (which is currently used to order suggestions). I could add a new property of the checkers to say "when I say incorrect, it is always incorrect" (example 1), but that wouldn't work if checker 2 also did normal language checking (e.g., all of English) because when it encountered "danke" in the second example, it would mark it as incorrect and it's signal said that its incorrect is absolute. I would consider that wrong since it breaks example 2.

I don't have a good answer because we don't have checkers that produce this conflicted results (we could make one though) and I think the solution would significantly increase the complexity of this PR.

@dmoonfire dmoonfire force-pushed the plugin-based-spell-check branch 3 times, most recently from 03a2840 to d400492 Compare July 7, 2016 03:54
@dmoonfire dmoonfire force-pushed the plugin-based-spell-check branch from d400492 to a509817 Compare July 7, 2016 03:57
SpellChecker = require 'spellchecker'
# This is the task local handler for the manager so we can reuse the manager
# throughout the life of the task.
SpellCheckerManager = require './spell-check-manager.coffee'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we remove the .coffee extension here?

@as-cii
Copy link
Contributor

as-cii commented Aug 12, 2016

Hey @dmoonfire, so after looking at the code a second time I think we might merge this. It's a pretty big pull-request and, as with all redesigns, I am somewhat concerned we might introduce some regression. The work you've done is pretty solid, though, so we might let this bake for a while on master and see if we're introducing any issue.

With the bundled spell-checker it works great on my machine. ✨ Is there any package that provides spell-check that I might use to test this out locally?

In the meantime could you also update this pull-request's description to match what we're currently doing in code? (i.e. the description says we're not using a Task, but I think we are).

Apologies for making you wait so long and thanks for the great work on this, looking forward to 🚢ping it!

@dmoonfire
Copy link
Contributor Author

Thank you.

Yes, if you use spell-check-project, it provides a project-level dictionary support. Basically the same as known words but writes out to language.json at the project root. It handles multiple roots fairly well too. If you include that and then you go to a misspelled word and Control-: it, there will be an item on the bottom to add to project dictionary.

I'll get those changes done during my lunch break.

* Removed ".coffee" from require statements.
* Moved empty array initialization closer to where it was populated.
@dmoonfire dmoonfire force-pushed the plugin-based-spell-check branch from 17fecdc to c228c87 Compare August 13, 2016 06:05
@as-cii as-cii merged commit c228c87 into atom:master Aug 19, 2016
@as-cii
Copy link
Contributor

as-cii commented Aug 19, 2016

Thanks @dmoonfire! ⚡ ⚡ ⚡

This will go out in the next beta release.

@73
Copy link

73 commented Sep 6, 2016

@as-cii Which release will that be? 1.12? (just want to have it desperately...)

@as-cii
Copy link
Contributor

as-cii commented Sep 6, 2016

Yes, this was released in Atom beta v1.12.0-beta0 and you can already try it out by downloading the latest beta from https://atom.io/beta.

@73
Copy link

73 commented Sep 6, 2016

@as-cii thanks. will try it now. No luck, got 1.11.0.beta3-0.1

@neochief
Copy link

It actually works! Tested in 1.12.0.beta3 and ru-RU locale.

@robert-winkler
Copy link

Hi, spellchecking/ setting locales still troubles in Fedora25, atom 1.12.6. #129.
Any idea, what could be wrong?
Installing the spell-check-test package fails with /bin/electron : No such file or directory (but it is installed and present!)
Ok. Let's blame Fedora; I'll try a re-install of all packages related to electron/ atom node(js).

@recursivelycurious
Copy link

@dmoonfire - my apologies if this is not the correct place/forum to ask, but I am trying to determine if the ability to "add to dictionary" is something that should be part of the spell-check plugin, or if it requires another plug-in in addition, or if all that has been added to close #10 is the interface to make it possible for an additional plug in to add custom words to the dictionary.

If the functionality exists, I might have stumbled across a bug that prevents it from working or perhaps am experiencing user error/additional packages that an update to the README (which I would be happy to make) would fix.

I'd also be open to working to add that functionality if it is still needed and if there is anyone in the community who could mentor me in making a change to an Atom plugin.

@dmoonfire
Copy link
Contributor Author

The "Add to Dictionary" is actually provided by the individual plugins. For example, the system checker doesn't (we can't add words to the system dictionary) but the known words does (we control that via configuration). Also, my project-based checker also provides words.

To implement it:

  1. Have the plugin return true for the providesAdding property. (Example)
  2. Implement getAddingTargets in the plugin. (Example, project)
  3. Implement add in the plugin. (Example, project)

As for getting everything working, I'm willing to add my two cents. I have a pretty high workload (surprise release at work, plus publishing two books), but I usually can answer questions or give suggestions without a problem.

@ethernal
Copy link

ethernal commented Apr 11, 2017

I am not sure if I should create a Feature Request but:

on Windows 7 the Locale Path actually works, contrary to what description says (and I'd love it to work on Windows 10 as well)

The problem with Locale Path is that it does not accept relative paths. I have used ZIP for Atom, prepared portable only package but I have now hard-coded the path to dictionaries in the plugin settings.

Possible solutions:

  • Allow relative pathis in Locale Path
  • Search .atom folder for dictionaries or dict subfolder
  • On W10 (did not test yet) if Locale Path is set it should take precedence over OS settings? Or allow us to set priority (local dictionaries/ OS dictionaries)

What do you think @dmoonfire ?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.