Skip to content

Latest commit

 

History

History

U-RaNER

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 

U-RaNER: A Unified Retrieval-augmented System for Multilingual Named Entity Recognition

The code is for the winner system (DAMO-NLP) of SemEval 2023 MultiCoNER II shared task over 9 out of 13 tracks. [Rankings], [Paper].

U-RaNER is a unified retrieval-augmented system, where we use both Wikipedia and Wikidata knowledge bases to build our retrieval module and explore the infusion approach to make more context visible to the model.

Guide

Quick Start

Datasets

To ease the code running, please download our pre-processed datasets.

datasets: TBD


Trained Models

Since there are 100+ trained models for our submission in the test phase, we only release our trained models for English (monolingual), Multilingual.

ckpts: TBD


Training on MultiCoNER II Datasets

adaseq train -c examples/SemEval2023_MultiCoNER_II/configs/wiki2048/bn.yaml

Experiment Results

Method data bn de en es fa fr hi it pt sv uk zh avg
Official Baseline orig 1.07 64.61 36.97 49.07 41.28 41.39 2.89 43.13 39.85 69.22 62.08 48.46 41.67
Bert-CRF orig 77.06 73.17 60.68 65.04 59.40 61.41 83.80 71.12 63.94 68.4 65.71 72.60 68.53
RaNER wiki128 89.12 76.78 71.32 68.24 76.76 74.61 88.78 83.43 76.7 77.06 78.26 75.84 78.08
U-RaNER wiki2048 91.24 82.41 81.86 80.82 81.25 83.48 90.85 87.66 83.42 82.06 83.27 78.94 83.94

Config File

You can find all the config files at here.

Building U-RaNER System

Retrieval Augmentation Module

Following [wang-etal-2022-damo], we build a multilingual KB based on Wikipedia of the 12 languages to search for the related documents. We download the latest (2022.10.21) version of the Wikipedia dump from Wikimedia and convert it to plain texts. The detailed processing procedure can be referred to [KB-NER].

In addition, we explore to enhance our retrieval system with Wikidata. Wikidata is a free and entity-centric knowledge base. Every entity of Wikidata has a page consisting of a label, several aliases, descriptions, and one or more entity types. The detailed procedure of entity linking can be found in our paper.


Named Entity Recognition Module

We use xlm-roberta-large (XLM-R) [conneau-etal-2020-unsupervised] as the initial checkpoint. The token representations are fed into a CRF layer to get the conditional probability, and the model is trained by maximizing the conditional probability and minimizing the cross entropy loss. Additionally, we utilize the infusion approach that [lewis2020retrieval] uses for better information interaction to provide a more extensive contextual view to the model, thus enabling better utilization of the retrieved context. Please refer to our code for more details.


Multi-stage Fine-tuning

[Wang-etal-2022-damo] experimentally demonstrates that MSF can leverage the annotations from all tracks and thus improve performance and accelerate training. In addition, we observe that inconsistent training set sizes on different language tracks can also lead to degradation of model performance. We use increasing batch size and upsampling strategy to address this issue.

Method bn de en es fa fr hi it pt sv uk zh avg
RaNER w/ batch 4 82.02 80.82 85.60 88.46 85.27 87.53 86.56 89.80 87.26 89.77 89.17 68.59 85.07
RaNER w/ batch 128 88.09 83.23 85.87 89.40 85.59 88.18 89.57 91.84 88.97 90.01 88.97 72.11 86.82
RaNER w/ scaling 90.82 86.27 85.86 89.88 86.15 88.70 90.99 91.50 89.24 90.85 88.95 75.71 87.91

Ensemble Module

The ensemble module ranks all spans in the predictions by the number of votes in descending order and selects the spans with more than 50% votes into the final prediction.

python -m tools.experiment.ensemble

Citation

@inproceedings{tan2023damonlp,
      title={DAMO-NLP at SemEval-2023 Task 2: A Unified Retrieval-augmented System for Multilingual Named Entity Recognition},
      author={Zeqi Tan and Shen Huang and Zixia Jia and Jiong Cai and Yinghui Li and Weiming Lu and Yueting Zhuang and Kewei Tu and Pengjun Xie and Fei Huang and Yong Jiang},
      year={2023},
      eprint={2305.03688},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}