Skip to content

Commit

Permalink
ERRANT v2.2.0
Browse files Browse the repository at this point in the history
  • Loading branch information
Christopher Bryant committed May 6, 2020
1 parent e1e6066 commit 1a56544
Show file tree
Hide file tree
Showing 6 changed files with 33 additions and 57 deletions.
6 changes: 6 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,12 @@

This log describes all the significant changes made to ERRANT since its release.

## v2.2.0 (06-05-20)

1. ERRANT now works with spaCy v2.2. It is 4x slower, but this change was necessary to make it work on Python 3.7.

2. SpaCy 2 uses slightly different POS tags to spaCy 1 (e.g. auxiliary verbs are now tagged AUX rather than VERB) so I updated some of the merging rules to maintain performance.

## v2.1.0 (09-01-20)

1. The character level cost in the sentence alignment function is now computed by the much faster [python-Levenshtein](https://pypi.org/project/python-Levenshtein/) library instead of python's native `difflib.SequenceMatcher`. This makes ERRANT 3x faster!
Expand Down
25 changes: 12 additions & 13 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# ERRANT v2.1.0
# ERRANT v2.2.0

This repository contains the grammatical ERRor ANnotation Toolkit (ERRANT) described in:

Expand Down Expand Up @@ -37,20 +37,23 @@ source errant_env/bin/activate
pip3 install errant
python3 -m spacy download en
```
This will create and activate a new python3 environment called `errant_env` in the current directory. `pip` will then install ERRANT, [spaCy v1.9.0](https://spacy.io/), [NLTK](http://www.nltk.org/), [python-Levenshtein](https://pypi.org/project/python-Levenshtein/) and spaCy's default English model in this environment. You can deactivate the environment at any time by running `deactivate`, but must remember to activate it again whenever you want to use ERRANT.
This will create and activate a new python3 environment called `errant_env` in the current directory. `pip` will then install ERRANT, [spaCy](https://spacy.io/), [NLTK](http://www.nltk.org/), [python-Levenshtein](https://pypi.org/project/python-Levenshtein/) and spaCy's default English model in this environment. You can deactivate the environment at any time by running `deactivate`, but must remember to activate it again whenever you want to use ERRANT.

#### BEA-2019 Shared Task
#### ERRANT and spaCy 2

ERRANT was originally designed to work with spaCy v1.9.0 and works best with this version. SpaCy v1.9.0 does not work with Python >= 3.7 however, and so we were forced to update ERRANT to be compatible with spaCy 2. Since spaCy 2 uses a neural system to trade speed for accuracy (see the [official spaCy benchmarks](https://spacy.io/usage/facts-figures#spacy-models)), this means ERRANT v2.2.0 is **over 4x slower** than ERRANT v2.1.0.

ERRANT v2.0.0 was designed to be fully compatible with the [BEA-2019 Shared Task](https://www.cl.cam.ac.uk/research/nl/bea2019st/). If you want to directly compare against the results in the shared task, you should make sure to install ERRANT v2.0.0 as newer versions may produce slightly different scores.
There is no way around this if you use Python >= 3.7, but we recommend installing ERRANT v2.1.0 if you use Python < 3.7.
```
pip3 install errant==2.0.0
pip3 install errant==2.1.0
```

#### ERRANT and spaCy 2

ERRANT was originally designed to work with spaCy v1.9.0 and so only officially supports this version. We nevertheless tested ERRANT v2.1.0 with spaCy v2.2.3 and found it to be **over 4x slower and ~2% less accurate**.
#### BEA-2019 Shared Task

This is mainly because spaCy 2 uses a neural system to trade speed for accuracy (see the [official spaCy benchmarks](https://spacy.io/usage/facts-figures#spacy-models)), but also because some Universal POS tag mappings changed, and so certain ERRANT rules no longer worked as intended. Although we could offset the accuracy loss by modifying ERRANT rules for the new POS mappings, there is nothing we can do about the significant speed loss, and so do not recommend spaCy 2 with ERRANT at this time.
ERRANT v2.0.0 was designed to be fully compatible with the [BEA-2019 Shared Task](https://www.cl.cam.ac.uk/research/nl/bea2019st/). If you want to directly compare against the results in the shared task, you should make sure to install ERRANT v2.0.0 as newer versions may produce slightly different scores. You can also use [Codalab](https://competitions.codalab.org/competitions/20228) to evaluate anonymously on the shared task datasets. ERRANT v2.0.0 is not compatible with Python >= 3.7.
```
pip3 install errant==2.0.0
```

## Source Install

Expand Down Expand Up @@ -100,10 +103,6 @@ Three main commands are provided with ERRANT: `errant_parallel`, `errant_m2` and

All these scripts also have additional advanced command line options which can be displayed using the `-h` flag.

#### Runtime

In terms of speed, ERRANT processes ~500 sents/sec in the fully automatic edit extraction and classification setting, but ~1000 sents/sec in the classification setting alone. These figures were calculated on an Intel Core i5-6600 @ 3.30GHz machine, but results will vary depending on how different/long the original and corrected sentences are.

## API

As of v2.0.0, ERRANT now also comes with an API.
Expand Down
7 changes: 1 addition & 6 deletions errant/__init__.py
Original file line number Diff line number Diff line change
@@ -1,10 +1,9 @@
from importlib import import_module
import logging
import spacy
from errant.annotator import Annotator

# ERRANT version
__version__ = '2.1.0'
__version__ = '2.2.0'

# Load an ERRANT Annotator object for a given language
def load(lang, nlp=None):
Expand All @@ -15,10 +14,6 @@ def load(lang, nlp=None):

# Load spacy
nlp = nlp or spacy.load(lang, disable=["ner"])
# Warning for spacy 2
if spacy.__version__[0] == "2":
logging.warning("ERRANT is 4x slower and 2% less accurate with spaCy 2. "
"We strongly recommend spaCy 1.9.0!")

# Load language edit merger
merger = import_module("errant.%s.merger" % lang)
Expand Down
38 changes: 7 additions & 31 deletions errant/en/classifier.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
import Levenshtein
from nltk.stem import LancasterStemmer
import spacy
import spacy.parts_of_speech as POS
import spacy.symbols as POS

# Load Hunspell word list
def load_word_list(path):
Expand Down Expand Up @@ -201,7 +201,7 @@ def get_two_sided_type(o_toks, c_toks):
if o_toks[0].text not in spell and \
o_toks[0].lower_ not in spell:
# Check if both sides have a common lemma
if same_lemma(o_toks[0], c_toks[0]):
if o_toks[0].lemma == c_toks[0].lemma:
# Inflection; often count vs mass nouns or e.g. got vs getted
if o_pos == c_pos and o_pos[0] in {"NOUN", "VERB"}:
return o_pos[0]+":INFL"
Expand All @@ -227,7 +227,7 @@ def get_two_sided_type(o_toks, c_toks):

# 3. MORPHOLOGY
# Only ADJ, ADV, NOUN and VERB can have inflectional changes.
if same_lemma(o_toks[0], c_toks[0]) and \
if o_toks[0].lemma == c_toks[0].lemma and \
o_pos[0] in open_pos2 and \
c_pos[0] in open_pos2:
# Same POS on both sides
Expand Down Expand Up @@ -316,7 +316,7 @@ def get_two_sided_type(o_toks, c_toks):
if len(set(o_pos+c_pos)) == 1:
# Final verbs with the same lemma are tense; e.g. eat -> has eaten
if o_pos[0] == "VERB" and \
same_lemma(o_toks[-1], c_toks[-1]):
o_toks[-1].lemma == c_toks[-1].lemma:
return "VERB:TENSE"
# POS-based tags.
elif o_pos[0] not in rare_pos:
Expand All @@ -328,19 +328,19 @@ def get_two_sided_type(o_toks, c_toks):
# Infinitives, gerunds, phrasal verbs.
if set(o_pos+c_pos) == {"PART", "VERB"}:
# Final verbs with the same lemma are form; e.g. to eat -> eating
if same_lemma(o_toks[-1], c_toks[-1]):
if o_toks[-1].lemma == c_toks[-1].lemma:
return "VERB:FORM"
# Remaining edits are often verb; e.g. to eat -> consuming, look at -> see
else:
return "VERB"
# Possessive nouns; e.g. friends -> friend 's
if (o_pos == ["NOUN", "PART"] or c_pos == ["NOUN", "PART"]) and \
same_lemma(o_toks[0], c_toks[0]):
o_toks[0].lemma == c_toks[0].lemma:
return "NOUN:POSS"
# Adjective forms with "most" and "more"; e.g. more free -> freer
if (o_toks[0].lower_ in {"most", "more"} or \
c_toks[0].lower_ in {"most", "more"}) and \
same_lemma(o_toks[-1], c_toks[-1]) and \
o_toks[-1].lemma == c_toks[-1].lemma and \
len(o_toks) <= 2 and len(c_toks) <= 2:
return "ADJ:FORM"

Expand Down Expand Up @@ -369,30 +369,6 @@ def exact_reordering(o_toks, c_toks):
return True
return False

# Input 1: A spacy orig token
# Input 2: A spacy cor token
# Output: Boolean; the tokens have the same lemma
# Spacy only finds lemma for its predicted POS tag. Sometimes these are wrong,
# so we also consider alternative POS tags to improve chance of a match.
def same_lemma(o_tok, c_tok):
# Basic lemmatisation for spacy >= 2 (avoids an error at least)
if spacy.__version__ != "1.9.0":
if o_tok.lemma == c_tok.lemma:
return True
return False
# Multi-POS lemmatisation for spacy 1.9.0
o_lemmas = []
c_lemmas = []
for pos in open_pos1:
# Lemmatise the lower cased form of the word
o_lemmas.append(nlp.vocab.morphology.lemmatize(
pos, o_tok.lower, nlp.vocab.morphology.tag_map))
c_lemmas.append(nlp.vocab.morphology.lemmatize(
pos, c_tok.lower, nlp.vocab.morphology.tag_map))
if set(o_lemmas).intersection(set(c_lemmas)):
return True
return False

# Input 1: An original text spacy token.
# Input 2: A corrected text spacy token.
# Output: Boolean; both tokens have a dependant auxiliary verb.
Expand Down
10 changes: 5 additions & 5 deletions errant/en/merger.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,11 +2,11 @@
from re import sub
from string import punctuation
import Levenshtein
import spacy.parts_of_speech as POS
import spacy.symbols as POS
from errant.edit import Edit

# Merger resources
open_pos = {POS.ADJ, POS.ADV, POS.NOUN, POS.VERB}
open_pos = {POS.ADJ, POS.AUX, POS.ADV, POS.NOUN, POS.VERB}

# Input: An Alignment object
# Output: A list of Edit objects
Expand Down Expand Up @@ -78,11 +78,11 @@ def process_seq(seq, alignment):
return process_seq(seq[:start], alignment) + \
merge_edits(seq[start:end+1]) + \
process_seq(seq[end+1:], alignment)
# Merge same POS or infinitive/phrasal verbs:
# Merge same POS or auxiliary/infinitive/phrasal verbs:
# [to eat -> eating], [watch -> look at]
pos_set = set([tok.pos for tok in o]+[tok.pos for tok in c])
if (len(pos_set) == 1 and len(o) != len(c)) or \
pos_set == {POS.PART, POS.VERB}:
if len(o) != len(c) and (len(pos_set) == 1 or \
pos_set.issubset({POS.AUX, POS.PART, POS.VERB})):
return process_seq(seq[:start], alignment) + \
merge_edits(seq[start:end+1]) + \
process_seq(seq[end+1:], alignment)
Expand Down
4 changes: 2 additions & 2 deletions setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@

setup(
name = "errant",
version = "2.1.0",
version = "2.2.0",
license = "MIT",
description = "The ERRor ANnotation Toolkit (ERRANT). Automatically extract and classify edits in parallel sentences.",
long_description = readme,
Expand All @@ -20,7 +20,7 @@
url = "https://github.com/chrisjbryant/errant",
keywords = ["automatic annotation", "grammatical errors", "natural language processing"],
python_requires = ">= 3.3",
install_requires = ["spacy==1.9.0", "nltk==3.4.5", "python-Levenshtein==0.12.0"],
install_requires = ["spacy>=2.2.0", "nltk==3.4.5", "python-Levenshtein==0.12.0"],
packages = find_packages(),
include_package_data=True,
entry_points = {
Expand Down

0 comments on commit 1a56544

Please sign in to comment.