Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support kenlm models and surprisal from them #14

Merged
merged 10 commits into from
Nov 8, 2023

Conversation

aalok-sathe
Copy link
Owner

@aalok-sathe aalok-sathe commented Nov 1, 2023

in this PR we add support for KenLM models using the KenLM python bindings. note that due to the complications of installing KenLM we don't enforce it as a requirement for the repo, but it should be installed if someone wants to use this library to do inference with KenLM Ngram models using the kenlm python interface

@aalok-sathe aalok-sathe marked this pull request as ready for review November 8, 2023 16:07
@@ -17,6 +17,7 @@ plotext = "^5.0.2"
matplotlib = "^3.5.2"
pandas = "^1.4.3"
openai = "^0.23.0"
kenlm = {version = "^0.2.0", optional = true}
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't want to force kenlm as a dependency---only install it if people need it

accum += [m.BaseScore(st1, w, st2)]
st1, st2 = st2, st1
if eos:
accum += [m.BaseScore(st1, "</s>", st2)]
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this part should maybe be made false by default, since this is generating a score for EOS, which is a convention inconsistent with huggingface models surprisal

from transformers import tokenization_utils_base


def pick_matching_token_ixs(
def hf_pick_matching_token_ixs(
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

such a method is not necessary for ngrams, I believe, but need to check how punctuation gets tokenized:

In [10]: [ce] = k.tokenize('hello, my name')

In [11]: ce.tokens
Out[11]: ('hello', ',', 'my', 'name')

@aalok-sathe
Copy link
Owner Author

merging this as it doesn't introduce any changes to anything current; only adds new implementation to support the kenlm model class. merging even though we have a few TODOs to address.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant