This is a repository for neural language models (LMs) trained on a large corpus of source code and a toolkit to work with such models.
Features:
- Autocompletion and bug prediction with pre-trained models we provide;
- Use the pre-trained models as a starting point for transfer learning or further fine-tuning;
- Training a model from scratch by choosing one of many available corpus pre-processing and training options.
This project uses fastai and pytorch libraries for NN training/inference. For corpus preprocessing giganticode-dataprep is used.
- Python version >= 3.6 required!
pip install giganticode-langmodels
git clone https://github.com//giganticode/langmodels
cd langmodels
python -m venv langmodels-venv
source langmodels-venv/bin/activate
pip install -r requirements.txt
The library is no longer tested under Windows but most of the functionality is expected to work.
>>> import langmodels.repository as repo
>>> trained_model = repo.load_default_model()
20...
To see which models are available, you can call `list_pretrained_models` function.
Set cached
parameter to True
(default is False
) to display only cached LMs (e.g. if offline).
>>> import langmodels.repository as repo
>>> repo.list_pretrained_models(cached=False)
<BLANKLINE>
ID BPE_MERGES LAYERS_CONFIG ARCH BIN_ENTROPY TRAINING_TIME_MINUTES_PER_EPOCH N_EPOCHS BEST_EPOCH SIZE_ON_DISK_MB TAGS
<BLANKLINE>
langmodel-large-split_10k_2_1024_191007. 10k 1024/2/1024=27726250 AWD_LSTM 2.1455788479 1429 6 5 350 ['BEST', 'DEFAULT']
112241_-_langmodel-large-split_10k_2_102
4_191022.141344_new
langmodel-large-split_10k_1_512_190926.1 10k 512/1/512=0 AWD_LSTM 2.69019493253 479 9 8 91 ['MEDIUM']
20146_new
langmodel-small-split-reversed_10k_1_512 10k 512/1/512=7180977 GRU 4.249997138977051 2 100 97 51 ['BEST_SMALL']
_200117.095729
langmodel-small-split_10k_1_512_190906.1 10k 512/1/512=0 AWD_LSTM 4.73768141172 4 19 18 84 ['TINY']
54943_new
dev_10k_1_10_190923.132328_new 10k 10/1/10=7172 AWD_LSTM 9.15688191092 0 0 -1 1 ['RANDOM']
<BLANKLINE>
Use query_all_models
method to get a list of ModelDescription
objects
>>> import langmodels.repository as repo
>>> repo.query_all_models()[0]
ModelDescription(id='langmodel-large-split_10k_2_1024_191007.112241_-_langmodel-large-split_10k_2_1024_191022.141344_new', bpe_merges='10k', layers_config='1024/2/1024=27726250', arch='AWD_LSTM', bin_entropy=2.1455788479, training_time_minutes_per_epoch=1429, n_epochs=6, best_epoch=5, size_on_disk_mb=350, tags=['BEST', 'DEFAULT'])
A model can be loaded by tag or by id.
You can specify if you want to load a model to CPU despite having cuda-supported GPU with force_use_cpu parameter
(defaults to False
). If cuda-supported GPU is not available, this parameter is disregarded.
>>> trained_model = repo.load_model_with_tag('BEST')
2...
>>> trained_model = repo.load_model_by_id('dev_10k_1_10_190923.132328_new', force_use_cpu=True)
2...
Also, you can use a lower-level API to load a model by path :
>>> import os
>>> from langmodels import project_dir
>>> path_to_model = os.path.join(project_dir, 'data', 'models', 'dev_10k_1_10_190923.132328')
>>> trained_model = repo.load_from_path(path_to_model)
2...
Example
>>> import langmodels.repository as repo
>>> trained_model = repo.load_default_model()
2...
>>> trained_model.feed_text('public static main() { if', extension='java')
# this does not change the state of the model:
>>> trained_model.predict_next_full_token(n_suggestions=5)
[('(', 0.67...), (',', 0.23...), ('{', 0.016...), ('new', 0.01...), ('}', 0.01...)]
# adding more context:
>>> trained_model.feed_text('(', extension='java')
>>> trained_model.predict_next_full_token(n_suggestions=3)
[('(', 0.15...), ('1', 0.14...), ('setLength', 0.03...)]
# resetting the state of the model (make it forget the context)
>>> trained_model.reset()
>>> trained_model.predict_next_full_token(n_suggestions=5)
[('new', 0.05...), ('.', 0.04...), ('this', 0.04...), ('*', 0.01...), ('gle', 0.01...)]
An LM can be used to calculate cross-entropies for each line of a file. High values can give an idea about unusual/suspicious chunks of code [[1]](#1).
Check section [LM Evaluation](#lm-evaluation) section to learn how to calculate cross-entropy for a project/file/string,
Check our vsc plugin for highlighting suspicious code.
>>> import os
>>> from langmodels import project_dir
>>> path_to_corpus = os.path.join(project_dir, 'data', 'dev')
>>> from langmodels.training.training import train
>>> from langmodels.lmconfig.datamodel import *
>>> train(LMTrainingConfig(corpus=Corpus(path=path_to_corpus))) # doctest: +SKIP
More parameters to customize corpus pre-processing, NN architecture, and the training process can be specified:
>>> import os
>>> from langmodels import project_dir
>>> path_to_corpus = os.path.join(project_dir, 'data', 'dev')
>>> from langmodels.training.training import train
>>> from langmodels.lmconfig.datamodel import *
>>> train(LMTrainingConfig(corpus=Corpus(path=path_to_corpus), prep_function=PrepFunction(options=PrepFunctionOptions(no_com=False, no_unicode=True)), arch=GruArch(n_layers=2), training=Training(schedule=RafaelsTrainingSchedule(max_epochs=1)))) # doctest: +SKIP
Below you can see all the default parameters specified explicitly:
>>> import os
>>> from langmodels import project_dir
>>> path_to_corpus = os.path.join(project_dir, 'data', 'dev')
>>> from langmodels.lmconfig.datamodel import *
>>> from langmodels.util.cuda import DeviceOptions
>>> from langmodels.training.training import train
>>> train(LMTrainingConfig(base_model=None, bs=32, corpus=Corpus(path=path_to_corpus, extensions="java"), prep_function=PrepFunction(corpus_api.bpe, ['10k'], PrepFunctionOptions(no_com=False, no_unicode=True, no_spaces=True, max_str_length=sys.maxsize)), arch=LstmArch(bidir=False, qrnn=False, emb_sz=1024, n_hid=1024, n_layers=3,drop=Dropouts(multiplier=0.5, oute=0.02, outi=0.25, outh=0.15, w=0.2, out=0.1),tie_weights=True, out_bias=True),bptt=200,training=Training(optimizer=Adam(betas=(0.9, 0.99)),sub_epochs=SubEpochs(50000),gradient_clip=0.3,activation_regularization=ActivationRegularization(alpha=2., beta=1.),schedule=RafaelsTrainingSchedule(init_lr=1e-4, mult_coeff=0.5, patience=0,max_epochs=1, max_lr_reduction_times=6),weight_decay=1e-6)), device_options=DeviceOptions(fallback_to_cpu=True), comet=False)
2...
<langmodels.model.TrainedModel object at ...
Training can be run from command line as simple as running train
command passing path to the config in json format
as --config
param. To override values in the json file (or default values if --config
param is not specified),
you can use --patch
param.
langmodels train --config="/path/to/json/config.json" --patch="bs=64,arch.drop.multiplier=3.0"
If neither --config
nor --patch
params are specified, the training will be running with the default parameters.
The json with the default parameters would look like follows:
Most probably, you would have to override at least the corpus.path
value.
For more options, run:
langmodels train --help
Langmodels provides an API to evaluate the performance of a language model on a given string, file, or corpus.
>>> from langmodels.evaluation import evaluate_on_string, evaluate_on_file, evaluate_on_path
>>> from pathlib import Path
>>> import tempfile
# Resetting model's state to make evaluation reproducible
>>> trained_model.reset()
# Evaluate on a string
>>> evaluate_on_string(trained_model, 'import java.lang.collections;')
{'n_samples': 7, 'Entropy': 12.2...}
# Evaluate on a file
>>> file = Path(project_dir) /'data' /'dev' /'valid' /'StandardDataTypeEmitter.java'
>>> evaluate_on_file(trained_model, file)
{'n_samples': 1528, 'Entropy': 22.9...}
#Evaluate on a coprus
>>> path = Path(project_dir) /'data' /'dev' /'valid'
>>> output_path = Path(tempfile.TemporaryDirectory().name)
>>> evaluate_on_path(trained_model, path, save_to=output_path)
2...
{'n_samples': 1647, 'Entropy': 23.2...}
Evaluation on a big corpora can take a lot of time. Therefore, the evaluation result data is saved to the disk.
Path to the evaluation data can be specified by save_to
parameter. It can be loaded as follows:
>>> from langmodels.evaluation import EvaluationResult
>>> evaluation = EvaluationResult.from_path(output_path)
For flexibility, one can use Pandas DataFrame API
to manipulate with evaluation result data:
EvaluationResult
is simply a wrapper around DataFrame
which can be accesses via data
property:
>>> evaluation.data
n_samples example Entropy
TokenType SubtokenNumber Project
ClosingBracket 1 StandardDataTypeEmitter.java 126 )</t> 29.8...
ClosingCurlyBracket 1 StandardDataTypeEmitter.java 22 }</t> 8.7...
Identifier 1 StandardDataTypeEmitter.java 169 write</t> 11.1...
2 StandardDataTypeEmitter.java 220 sin|k</t> 25.4...
3 StandardDataTypeEmitter.java 24 construct|or|Factory</t> 46.8...
4 StandardDataTypeEmitter.java 28 visit|or|Type|Arguments</t> 64.5...
5 StandardDataTypeEmitter.java 57 em|it|Parameter|ized|TypeName</t> 80.7...
6 StandardDataTypeEmitter.java 2 Standard|Data|Type|E|mit|ter</t> 107.9...
7 StandardDataTypeEmitter.java 8 em|it|Base|Class|And|Inter|faces</t> 131.5...
KeyWord 1 StandardDataTypeEmitter.java 69 for</t> 10.5...
MultilineComment 1 Licence.java 57 /</t> 11.2...
StandardDataTypeEmitter.java 87 /</t> 10.8...
2 Licence.java 32 th|e</t> 31.0...
StandardDataTypeEmitter.java 42 ad|t</t> 30.8...
3 Licence.java 19 li|mit|ations</t> 48.9...
StandardDataTypeEmitter.java 22 em|it|ter</t> 48.3...
4 Licence.java 10 L|ic|en|se</t> 61.6...
StandardDataTypeEmitter.java 10 L|ic|en|se</t> 61.6...
5 StandardDataTypeEmitter.java 1 Data|Type|E|mit|ter</t> 76.8...
NonCodeChar 1 StandardDataTypeEmitter.java 55 @</t> 3.6...
One 1 StandardDataTypeEmitter.java 1 1</t> 9.9...
OpeningBracket 1 StandardDataTypeEmitter.java 126 (</t> 10.2...
OpeningCurlyBracket 1 StandardDataTypeEmitter.java 22 {</t> 10.4...
Operator 1 StandardDataTypeEmitter.java 252 .</t> 9.7...
Semicolon 1 StandardDataTypeEmitter.java 119 ;</t> 9.4...
SpecialToken 1 Licence.java 1 <EOF></t> 15.8...
StandardDataTypeEmitter.java 1 <EOF></t> 14.8...
StringLiteral 1 StandardDataTypeEmitter.java 9 "."</t> 11.6...
2 StandardDataTypeEmitter.java 11 "\n|\n"</t> 10.1...
3 StandardDataTypeEmitter.java 7 " |{|\n"</t> 31.4...
4 StandardDataTypeEmitter.java 3 " |implement|s| "</t> 41.7...
5 StandardDataTypeEmitter.java 9 " |{|\|n|\n"</t> 61.8...
7 StandardDataTypeEmitter.java 5 " | | |@|Overrid|e|\n"</t> 79.2...
8 StandardDataTypeEmitter.java 4 "|Gener|ating| |data| |type| "</t> 101.2...
9 StandardDataTypeEmitter.java 1 " | | |Result|Type| |_|case|("</t> 105.7...
10 StandardDataTypeEmitter.java 3 " | | |v|o|id| |_|case|("</t> 117.2...
11 StandardDataTypeEmitter.java 2 " | | |public| |Result|Type| |_|case|("</t> 136.6...
12 StandardDataTypeEmitter.java 1 " | | |public| |v|o|id| |_|case|("</t> 137.8...
13 StandardDataTypeEmitter.java 1 "|Gener|ating| |multi|ple| |construct|or|s| |f... 179.9...
15 StandardDataTypeEmitter.java 3 " | | |prot|ected| |abstr|act| |Result|Type... 185.1...
16 StandardDataTypeEmitter.java 2 " | | |prot|ected| |abstr|act| |v|o|id| |_|... 194.6...
17 StandardDataTypeEmitter.java 1 " |x|)| |{| |_|default|(|x|)|;| |}|\|n|\n"</t> 243.7...
19 StandardDataTypeEmitter.java 1 " |x|)| |{| |return| |_|default|(|x|)|;| |}|\|... 269.8...
23 StandardDataTypeEmitter.java 1 "\n|\|n| | |public| |abstr|act| |<|Result|Typ... 299.4...
Zero 1 StandardDataTypeEmitter.java 1 0</t> 11.4...
Alternatively, EvaluationResult
provides aggregate()
and total()
methods to look at the data in specific demensions:
>>> evaluation.aggregate(['TokenType']).data
n_samples example Entropy
TokenType
ClosingBracket 126 )</t> 29.8...
ClosingCurlyBracket 22 }</t> 8.7...
Identifier 508 em|it|Base|Class|And|Inter|faces</t> 32.0...
KeyWord 69 for</t> 10.5...
MultilineComment 280 /</t> 25.6...
NonCodeChar 55 @</t> 3.6...
One 1 1</t> 9.9...
OpeningBracket 126 (</t> 10.2...
OpeningCurlyBracket 22 {</t> 10.4...
Operator 252 .</t> 9.7...
Semicolon 119 ;</t> 9.4...
SpecialToken 2 <EOF></t> 15.3...
StringLiteral 64 " |{|\n"</t> 73.8...
Zero 1 0</t> 11.4...
>>> evaluation.total()
{'n_samples': 1647, 'Entropy': 23.2...}
When evaluation is done on file or string, by default, the line of each token and its position in the line is saved. The version of LM-Powered that is currently under development uses this information to visualize entropies for each token.
>>> from langmodels.evaluation import evaluate_on_string
>>> evaluation = evaluate_on_string(trained_model, 'import java.lang.collections;')
>>> evaluation.data
n_samples example Entropy
TokenType SubtokenNumber LinePosition TokenPosition
Identifier 1 0 1 1 java</t> 11.2...
3 1 lang</t> 13.0...
2 0 5 1 collection|s</t> 27.5...
KeyWord 1 0 0 1 import</t> 11.5...
NonCodeChar 1 0 2 1 .</t> 11.3...
4 1 .</t> 3.7...
Semicolon 1 0 7 1 ;</t> 7.0...
Evaluation can be customized by passing EvaluationOptions
object with specified metrics
and characteristics
.
You can also specify n_processes
to use to run pre-processing and batch_size
to be used for inference:
>>> from langmodels.evaluation import *
>>> evaluate_on_path(trained_model, path, save_to=output_path, batch_size=3, n_processes=1, evaluation_options=EvaluationOptions(metric_names=['Entropy'], characteristics=[TokenType()]))
2...
>>> evaluation = EvaluationResult.from_path(output_path)
>>> evaluation.data
n_samples example Entropy
TokenType
ClosingBracket 126 )</t> 29.8...
ClosingCurlyBracket 22 }</t> 8.7...
Identifier 508 type|Arguments</t> 32.0...
KeyWord 69 for</t> 10.5...
MultilineComment 280 /</t> 25.6...
NonCodeChar 55 .</t> 3.6...
One 1 1</t> 9.9...
OpeningBracket 126 (</t> 10.2...
OpeningCurlyBracket 22 {</t> 10.4...
Operator 252 .</t> 9.7...
Semicolon 119 ;</t> 9.4...
SpecialToken 2 <EOF></t> 15.3...
StringLiteral 64 " | | |prot|ected| |abstr|act| |Result|Type... 73.8...
Zero 1 0</t> 11.4...