-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
support kenlm models and surprisal from them #14
Conversation
…nized text, e.g. whitespace for `kenlm`
…ete CustomEncoding implementation.
… want that? maybe add an option to show but default to disabling it? do we also want bos?
@@ -17,6 +17,7 @@ plotext = "^5.0.2" | |||
matplotlib = "^3.5.2" | |||
pandas = "^1.4.3" | |||
openai = "^0.23.0" | |||
kenlm = {version = "^0.2.0", optional = true} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we don't want to force kenlm
as a dependency---only install it if people need it
accum += [m.BaseScore(st1, w, st2)] | ||
st1, st2 = st2, st1 | ||
if eos: | ||
accum += [m.BaseScore(st1, "</s>", st2)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this part should maybe be made false by default, since this is generating a score for EOS, which is a convention inconsistent with huggingface models surprisal
from transformers import tokenization_utils_base | ||
|
||
|
||
def pick_matching_token_ixs( | ||
def hf_pick_matching_token_ixs( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
such a method is not necessary for ngrams, I believe, but need to check how punctuation gets tokenized:
In [10]: [ce] = k.tokenize('hello, my name')
In [11]: ce.tokens
Out[11]: ('hello', ',', 'my', 'name')
merging this as it doesn't introduce any changes to anything current; only adds new implementation to support the kenlm model class. merging even though we have a few TODOs to address. |
in this PR we add support for KenLM models using the KenLM python bindings. note that due to the complications of installing KenLM we don't enforce it as a requirement for the repo, but it should be installed if someone wants to use this library to do inference with KenLM Ngram models using the
kenlm
python interface