PELIC-spelling

Version 1.0
Authors: Ben Naismith, John Starr, Eva Bacas
Contact: bnaismith@pitt.edu

This repo provides information and code about applying spelling correction to the PELIC dataset.

1. Overview

This README.md file introduces the PELIC-spelling repository which provides information and code about applying spelling correction to the PELIC dataset. To download and find out more about the PELIC dataset, see the PELIC-dataset repository. For information regarding publications and presentations based on PELIC data, as well as for information regarding the people and parties responsible for the corpus, please visit the Pitt ELI Corpus web page.

Spelling correction is an important element to consider in any corpus study involving learner data. The decision whether to correct texts or not will invariably impact results: in some instances it may be preferable to use the raw text, maintaining its integrity and avoiding an additional layer of processing. However, for other projects, corrected text may provide a more accurate representation of the language features being investigated.

There are three main components to the spelling correction process, presented in two Jupyter notebooks:

The SCOWL_wordlist: In this notebook we decide on a list of what we consider to be real words, using an edited version of the SCOWL wordlists.
PELIC_spelling: In this notebook we create a dataframe of misspellings, apply an automated spelling correction process, and re-incorporate the corrected text into our corpus.
PELIC_spelling_validation: In this notebook we detail a validation of the spell checker. Manual checking of spelling is performed on a sample of PELIC and is then compared to the output of the automated spell checker. The results indicate that spell-checker is highly accurate in terms of the total tokens in PELIC, but conservative resulting in lower precision. For details, please see the Jupyter notebook.

2. Repository contents

The PELIC-spelling repository contains 14 main files:

File	File type	Description
all_names.txt	text	list of over 90,000 names (first and last) from the 1990 US census data. Names collected by the names random name generator project
contractions.txt	text	short list of contractions approved as legitimate tokens (not misspellings)
frequency_bigramdictionary_en_243_342.txt	text	bigram frequency dictionary supplied by SymSpell spell correction module
frequency_dictionary_en_82_765.txt	text	frequency dictionary supplied by SymSpell spell correction module
hyphens.txt	text	list of hyphenated words which appear in PELIC and have been approved as legitimate tokens (not misspellings)
PELIC_compiled_spellcorrected.csv	csv	final output of updated `PELIC_compiled.csv` with spelling correction
PELIC_spelling.ipynb	Jupyter notebook	notebook demonstrating how spelling correction is applied to PELIC texts
PELIC-SCOWL.txt	text	a combination of the SCOWL_condensed.txt, contractions.txt, and hyphens.txt lists
README.md	markdown	this file describing the repository
SCOWL_condensed.txt	text	final compiled word list based on SCOWL word lists
SCOWL_supp.txt	text	short list of words manually approved as being legitimate words, e.g. proper names not found in SCOWL
SCOWL_wordlist.ipynb	Jupyter notebook	notebook demonstrating how the SCOWL_condensed word list is created
SCOWL_wordlist.txt	text	the full SCOWL wordlist before condensing
PELIC_spelling_validation.ipynb	Jupyter notebook	manual validation of the spell checker

3. SCOWL wordlist

This notebook produces a definitive list of 'real' words to use when deciding what to consider a word/non-word. The final output is the SCOWL_condensed.txt file. The primary wordlists are from the SCOWL set of word lists, freely availabe at http://wordlist.aspell.net/.

The notebook is divided into two main sections:

Exploratory Data Anaylsis : Here, we examine the various SCOWL dictionaries which include different language varieties, proper nouns, slang, abbreviations, etc. From this exploration, we opt to include all available dictionaries except the abbreviation dictionaries due to the high number of short strings of letters which may match learner errors. It is possible, however, to include these dictionaries if desired.
Compiling and condensing dictionaries : In the second part of the notebook, SCOWL_condensed is created by combining the various SCOWL dictionaries and then removing duplicates, blanks, and possessives. The final wordlist is slightly less than 500k words.

4. PELIC spelling

This notebook adds further processing to PELIC_compiled.csv in the PELIC-dataset repo by creating a column of tokens and their parts of speech which have been corrected in terms of spelling.

The notebook is divided into four main sections:

Building a non_words dataframe : We first collect all of the non-words from the PELIC dataset (in PELIC_compiled.csv) by extracting all words which are not found in SCOWL_condensed:

>>> non_words.head()

	tok_lem_POS	sentence	answer_id
0	('beacause', 'beacause', 'NN')	i organized the instructions by time, beacause to make tea people who want to make tea have to follow the instructions step by step.	8
1	('wallmart', 'wallmart', 'NN')	next, you need to buy a box of tea in wallmart or giant eagle.	11
2	('dovn', 'dovn', 'NN')	first, you should take some hot water, you can use dovn, mircowave or other ways.	13
3	('mircowave', 'mircowave', 'VBP')	first, you should take some hot water, you can use dovn, mircowave or other ways.	13
4	('paragragh', 'paragragh', 'NN')	every paragragh's instructions depend on a main idea.	16

Building a dataframe of misspellings and their frequencies : In the non-words dataframe above, each row is an occurrence of a misspelling (i.e. tokens). We then create a dataframe where each row is a misspelling type with frequency information attached:

>>> misspell_df.sample(5)

Index	misspelling	tok_lem_POS	freq
9164	spel	('spel', 'spel', 'VB')	1
5495	invesigate	('invesigate', 'invesigate', 'VB')	1
3645	estmatied	('estmatied', 'estmatied', 'JJ')	1
9313	straigten	('straigten', 'straigten', 'VB')	1
8455	hobbys	('hobbys', 'hobbys', 'NN')	2

Applying spelling correction : Having collected and organized the misspellings, we then correct these occurrences using SymSpell. In SymSpell complete sentence context is not considered, only bigrams and frequencies. Though this is not ideal, other well-known spellcheckers (hunspell, pyspell, etc.) use the same strategy - frequency based criteria for suggestions, without considering co-text beyond bigrams. As such, it is important to remember that accuracy of corrected tokens will not be 100% and must be taken into consideration.

>>> print(non_words2[['answer_id','misspelling','sentence','final_correction_POS']].sample(5))
# Sample of 5 rows and key columns

answer_id	misspelling	sentence	final_correction_POS
11487	('celemony', 'celemony', 'NN')	Third, the ANON_NAME_0-Ju international movie celemony is opened in my hometown.	('ceremony', 'ceremony', 'NN')
13444	('miliion', 'miliion', 'NN')	200 miliion people	('million', 'million', 'NN')
17707	('korian', 'korian', 'JJ')	Korian pizza is healthier than American pizza.	('korean', 'korean', 'JJ')
35162	('grammer', 'grammer', 'NN')	Although my grammer was not impeccable, they could usually understand what I meant.	('grammar', 'grammar', 'NN')
10839	('comunity', 'comunity', 'NN')	Second, truth make our comunity be truthable sociaty.	('community', 'community', 'NN')

Incorporating corrections into pelic_df : Finally, these corrected tokens are incorporated back into pelic_df, creating a new tok_lem_POS column for easy comparison to the original texts. Below is an example of an original and corrected text:

>>> print(pelic_df.loc[pelic_df.text.str.contains('becuase')].iloc[1,11]) #uncorrected
[(('My', 'my', 'PRP$'), ('friend', 'friend', 'NN'), ('is', 'be', 'VBZ'), ('realy', 'realy', 'JJ'), ('nise', 'nise', 'RB'), ('guy', 'guy', 'NN'), ('.', '.', '.'), ('I', 'i', 'PRP'), ('like', 'like', 'VBP'), ('hem', 'hem', 'JJ'), ('becuase', 'becuase', 'NN'), ('he', 'he', 'PRP'), ('is', 'be', 'VBZ'), ('friendlly', 'friendlly', 'RB'), ('and', 'and', 'CC'), ('lovliy', 'lovliy', 'NN'), ('.', '.', '.'))]

>>> print(pelic_df.loc[pelic_df.text.str.contains('becuase')].iloc[1,12]) #corrected
[('My', 'PRP$'), ('friend', 'NN'), ('is', 'VBZ'), ('real', 'JJ'), ('nice', 'RB'), ('guy', 'NN'), ('.', '.'), ('I', 'PRP'), ('like', 'VBP'), ('hem', 'JJ'), ('because', 'NN'), ('he', 'PRP'), ('is', 'VBZ'), ('friendly', 'RB'), ('and', 'CC'), ('lovely', 'NN'), ('.', '.')]

We can see here that many approrpriate corrections have been made, including beccuase -> because , nise -> nice , friendlly -> friendly , and lovily -> lovely . Importantly, incorrect spellings that are actual words, e.g. hem (should be him in this case) are not corrected. In addition, as limited context is considered, there will be some inaccuracies, e.g. realy (real nice is a frequent bigram) -> real rather than really.

Overall, the application of spelling correction is an important resource as it allows for more accurate tracking of what learners may have been intending to write. For example, learners may know a word in every sense, except for its spelling. However, as with any automated text manipulation, the added layer of processing will allow for errors to enter the data, and as such, must be considered carefully when drawing conclusions from the data.

5. Licenses

PELIC license:
PELIC dataset by Alan Juffs, Na-Rae Han, Ben Naismith is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
Based on a work at https://github.com/ELI-Data-Mining-Group/PELIC-dataset.

SCOWL license: SCOWL Copyright and License Agreement

Copyright 2000-2011 by Kevin Atkinson Permission to use, copy, modify, distribute and sell these word lists, the associated scripts, the output created from the scripts, and its documentation for any purpose is hereby granted without fee, provided that the above copyright notice appears in all copies and that both that copyright notice and this permission notice appear in supporting documentation. Kevin Atkinson makes no representations about the suitability of this array for any purpose. It is provided "as is" without express or implied warranty.

Back to top

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

PELIC-spelling

Table of contents

1. Overview

2. Repository contents

3. SCOWL wordlist

4. PELIC spelling

5. Licenses

Files

README.md

Latest commit

History

README.md

File metadata and controls

PELIC-spelling

Table of contents

1. Overview

2. Repository contents

3. SCOWL wordlist

4. PELIC spelling

5. Licenses