Skip to content

Whitelist and Blacklist Documentation

Kathleen Muenzen edited this page Sep 27, 2018 · 1 revision

This page documents the creation of the Philter Whitelist and Blacklist, as of 09/24/18 (end of algorithm developments after philter_gamma release, and before philter_delta release).

Whitelist

Current length: 195253

1. Whitelist with all original components (whitelist_061418.json)

We assembled a whitelist of ‘safe’ words using lists of terms and phrases from various medical term libraries and common word lists. First, medical terms from the NLM UMLS Lexicon (Source 1), the NLM MeSH thesaurus (Source 2), the MeSH scope notes (Source 2), the MeSH tree file for biomedical data (Source 3), the Medline Plus heath topic records (Source 4), and medical terms and definitions from SNOMED CT US (Source 5) were added to the Whitelist. Two lists of common medical abbreviations were added to the Whitelist (Source 6). FDA-approved drug names, dosages, and active ingredients (Source 7) and a list of ICD9 diagnosis names and codes were also added to the Whitelist (Source 8). Next, lists of the 20k most common English words (Source 9) and 1k most common verbs and their tenses (Source 10) were added to the whitelist.

2. FP addition and cleaning round 1 (whitelist_plus_fps-3_cleaned.json)

Next, ~1000 false positives generated by the Whitelist in our pipeline were added to the Whitelist. Lastly, all single letters, numeric and alphanumeric words were removed from the Whitelist. All names added to the final Whitelist were tokenized on whitespace and symbols, and converted to lowercase.

Blacklist

Current length: 241745