-
Notifications
You must be signed in to change notification settings - Fork 2
Whitelist and Blacklist Documentation
This page documents the creation of the Philter Whitelist and Blacklist, as of 09/24/18 (end of algorithm developments after philter_gamma release, and before philter_delta release).
We assembled a whitelist of ‘safe’ words using lists of terms and phrases from various medical term libraries and common word lists. First, medical terms from the NLM UMLS Lexicon (Source 1), the NLM MeSH thesaurus (Source 2), the MeSH scope notes (Source 2), the MeSH tree file for biomedical data (Source 3), the Medline Plus heath topic records (Source 4), and medical terms and definitions from SNOMED CT US (Source 5) were added to the Whitelist. Two lists of common medical abbreviations were added to the Whitelist (Source 6). FDA-approved drug names, dosages, and active ingredients (Source 7) and a list of ICD9 diagnosis names and codes were also added to the Whitelist (Source 8). Next, lists of the 20k most common English words (Source 9) and 1k most common verbs and their tenses (Source 10) were added to the whitelist.
- Source 1: https://lexsrv3.nlm.nih.gov/LexSysGroup/Projects/lexicon/2017/release/LEX/LEXICON
- Source 2: https://www.nlm.nih.gov/mesh/
- Source 3: ftp://nlmpubs.nlm.nih.gov/online/mesh/2017/
- Source 4: https://medlineplus.gov/xml.html
- Source 5: https://www.nlm.nih.gov/healthit/snomedct/us_edition.html
- Source 6: https://nurseslabs.com/medical-terminologies-abbreviations-listcheat-sheet/
- Source 7: https://www.fda.gov/Drugs/InformationOnDrugs/ucm079750.htm
- Source 8: https://www.cms.gov/Medicare/Coding/ICD9ProviderDiagnosticCodes/codes.html
- Source 9: https://raw.githubusercontent.com/first20hours/google-10000-english/master/20k.txt
- Source 10: https://www.worldclasslearning.com/english/five-verb-forms.html
Next, ~1000 false positives generated by the Whitelist in our pipeline were added to the Whitelist. Lastly, all single letters, numeric and alphanumeric words were removed from the Whitelist. All names added to the final Whitelist were tokenized on whitespace and symbols, and converted to lowercase.