There are several models built in to the baseline
including BiLSTM, Transformer and ConvNet approaches.
The documentation here includes information on these models and the expected results as well as some documentation on the API design and how to make your own models
Our Baseline Tagger is a State-of-the-Art model architecture, and supports flexible embeddings, including contextual embeddings and one or more pre-trained sets of word embeddings. It has been shown that character-level modeling is important in deep models to support morpho-syntatic structure for tagging tasks.
The code uses word and character-level word embeddings. For character-level processing, a character vector depth is selected, along with a word-vector depth.
The character-level embeddings are based on Character-Aware Neural Language Models (Kim, Jernite, Sontag, Rush) and dos Santos 2014 (though the latter's tagging model is quite different). Unlike dos Santos' approach, here, one or more parallel filters are applied during the convolution (which is like the Kim approach). Unlike the Kim approach residual connections of like size filters are used, and since they improve performance for tagging, word vectors are also used.
Twitter is a challenging data source for tagging problems. The TweetNLP project includes hand annotated POS data. The original taggers used for this task are described here. The baseline that they compared their algorithm against got 83.38% accuracy. The final model got 89.37% accuracy with many custom features. Below, our simple BLSTM baseline with no custom features, and a very coarse approach to compositional character to word modeling significantly out-performs this.
To run our default model:
python --config config/twpos.json
To change the backend between TensorFlow and PyTorch, just change the backend
parameter (to tensorflow
or pytorch
).
For NER reporting, we report an F1 score on at each validation pass, and use F1 for early-stopping.
For tasks that require global coherency like NER tagging, it has been shown that using a linear chain CRF to learn a transition matrix between label states in conjunction with the output RNN tags improves performance. A simple technique that performs nearly as well is to replace the CRF with a coarse constrained decoder at inference time that simply checks for IOBES violations and effectively zeros out those transitions.
Our default mead configuration is SoTA for models without contextual embeddings or co-training, and is run with this command:
mead-train --config config/conll.json
- We find (along with most other researchers) that IOBES (BMES), on average, out-performs BIO (IOB2). We also find that IOB2 out-performs IOB1 (the original format in which the data is provided). Note these details do not change the actual model itself, it just trains the model in a way that seems to cause it to learn better, and when comparing implementations, its important to take note of which of the three formats are used.
Our constrained decoder can be run with the following command:
mead-train --config config/conll-no-crf.json
WNUT17 is a task that deals with NER for rare and emerging entities on Twitter. Scores are typically much lower on this task than for traditional NER tagging
Our default model provides a strong baseline for WNUT17, and can be trained with this command:
mead-train --config config/wnut.json
Our constrained decoder model can be run as follows:
mead-train --config config/wnut-no-crf.json
Ontonotes 5.0 is a NER dataset that is larger than CONLL2003 and contains more entity types. Dataset was created with Ontonotes 5.0 using the data splits from the CONLL2012 shared task version 4 for train/dev/test.
Out default CRF model can be run with the following:
mead-train --config config/ontonotes.json
SNIPS NLU is data created by SNIPS for benchmarking chatbots systems. It includes intent detection and slot filling, This is the slot filling model and can be trained with the following:
mead-train --config config/snips.json
We have done extensive testing on our tagger architecture. Here are the results after runing each model 10 times for each configuration. All NER models are trained with IOBES tags. The ELMo configurations are done using our ELMo Embeddings addon.
Config | Dataset | Model | Metric | Mean | Std | Min | Max |
---|---|---|---|---|---|---|---|
twpos.json | twpos-v03 | CNN-BLSTM-CRF | acc | 90.75 | 0.140 | 90.53 | 91.02 |
conll.json | CONLL2003 | CNN-BLSTM-CRF | f1 | 91.47 | 0.247 | 91.15 | 92.00 |
conll-no-crf.json | CONLL2003 | CNN-BLSTM-constrain | f1 | 91.44 | 0.232 | 91.17 | 91.90 |
conll-elmo.json | CONLL2003 | CNN-BLSTM-CRF | f1 | 92.26 | 0.157 | 92.00 | 92.48 |
wnut.json | WNUT17 | CNN-BLSTM-CRF | f1 | 40.33 | 1.13 | 38.38 | 41.99 |
wnut-no-crf.json | WNUT17 | CNN-BLSTM-constrain | f1 | 40.59 | 1.06 | 37.96 | 41.71 |
ontonotes.json | ONTONOTES | CNN-BLSTM-CRF | f1 | 87.41 | 0.166 | 87.14 | 87.74 |
snips.json | SNIPS | CNN-BLSTM-CRF | f1 | 96.04 | 0.28 | 95.39 | 96.35 |
You can use tag-text.py
to load a sequence tagger checkpoint and predict its labels
Loss functions are defined by the decoders. If a CRF is used then the loss is the CRF loss averaged over the number of examples in the mini-batch. When using a word level loss the loss is the sum of the cross entropy loss of each token averaged over the number of examples in the mini-batch. Both of these are batch level losses.
When reporting the loss every nsteps for the CRF the loss is the total CRF loss divided by the number of examples in the last nstep number of mini-batches. For word level loss it is the total word level loss divided by the number of examples in the last nstep number of batches.
The epoch loss is the total loss averaged over the total number of examples in the epoch.
Accuracy is computed on the token level and F1 is computed on the span level.
There is a base interface TaggerModel
defined for taggers inside of baseline/model.py and a specific framework-dependent sub-class TaggerModelBase
that implements that interface and also implements the base layer class defined by the underlying deep-learning framework (a Model
in Keras and Module
in PyTorch).
The TaggerModelBase
provides fulfills the create()
function required in the base interface (TaggerModel
) but leaves an abstract method for creation of the layers: create_layers()
and leaves an abstract forward method (forward()
for PyTorch and call()
for Keras).
While this interface provides lots of flexibility, its sub-class AbstractEncoderTaggerModel(TaggerModelBase)
provides much more utility and structure to tagging and as typically the best place to extend from when writing your own tagger.
The TaggerModelBase
follows a single basic pattern for modeling tagging consisting for 4 basic steps:
- embeddings
- encoder (transduction and projection to final number of labels)
- projection to logits space (the number of labels)
- decoder (typically a constrained greedy decoder or a CRF)
All of the MEAD-Baseline tagger models reuse steps 1. and 3. and define their own encoders by overriding the init_encode()
method. These hooks are called from the concrete implementation of create_layers()
, and the forward method is implemented by a simple flow that executes these layers.
Most taggers can be composed together by providing the encoder layer and wiring in a set of Embeddings (usually consisting of a combo of word features and word character-compositional features concatenated together), a decoder (typically a GreedyTaggerDecoder
or a CRF
). For example, a CNN-BiLSTM-CRF can be composed by providing:
- embeddings via concatenation of
LookupTableEmbeddingsModel
andCharConvEmbeddingsModel
- encoder e.g.
RNNTaggerModel
. For fine-tuned models this is typically a pass-through - default linear projection
- decoder via
CRF
For fine-tuning a language model like BERT, the typical approach is to remove the head from the model (AKA the final linear projection out to the vocab and the normalizing softmax), and to put a final linear projection to the output number of classes in its place. In MEAD-Baseline, the headless LM is lodaded as an embedding object and passed into an EmbeddingsStack
so the encoder itself should be pass through. Here is an example of fine-tuning BERT:
mead-train --config config/conll-bert.json
- Do not override the
create()
from the base class unless absolutely necessary.create_layers()
is the typical extension point for theTaggerModelBase
- Use the mead layers (8 mile) API to define your layers
- This will minimize any incompatibility and make it easy to switch frameworks later
- Use
AbstractEncoderTaggerModel
for simple models involving overriding theinit_encode()
method. If you can do this, avoid overiddingcreate_layers()