We prompt ChatGPT to generate a instruction-following dataset for NER. The dataset comprises 45,889 input-output pairs, encompassing 240,725 entities and 13,020 distinct entity types. The dataset contains entity types from various domains, ranging from the general domain (e.g., Person) to the clinical domain (e.g., Medical Condition). Moreover, we observe variations in granularity among the entity types. For instance, Countyis the subset of Location, and Input Device is a subset of Product. These data characteristics offer extensive coverage of entity types, making them suitable for distilling capabilities from LLMs across various domains. A divide of entity types according to frequency is shown in the table below:
Data Construction Prompt |
---|
System Message: You are a helpful information extraction system. Prompt: Given a passage, your task is to extract all entities and identify their entity types. The output should be in a list of tuples of the following format: [("entity 1", "type of entity 1"), ... ]. Passage: {input_passage} |
Head/Tail | Frequency | Unique Entity Types | Example Entity Types |
---|---|---|---|
Top 1% | 74% | 130 | Person, Organization, Location, Date, Concept, Product, Event, Technology, Group, Medical Condition, ... |
Top 1%-10% | 19% | 1172 | Characteristic, Research, County, Module, Unit, Feature, Cell, Package, Anatomical Structure, Equipment, ... |
All the rest | 7% | 11718 | Attribute Value, Pokemon, Immune Response, Physiology, Animals, Cell Feature, FAC, Input Device, Ward, Broadcast, ... |
Unlike the existing work that tunes the models to do diverse tasks, we present a general recipe of instruction tuning for a specific task, where the pretrained model is tuned for a broad application class such as open NER.
Conversation-style Instruction Tuning Template |
---|
A virtual assistant answers questions from a user based on the provided text. User: Text: Xpassage Assistant: I've read this text. User: What describes t1 in the text? Assistant: y1 ... User: What describes tT in the text? Assistant: yT |
Negative Sampling Strategy | Movie | Restaurant | AI | Literature | Music | Politics | Science | Avg |
None | 19.1 | 19.1 | 25.1 | 39.5 | 42.7 | 48.9 | 26.2 | 31.5 |
Uniform | 42.5 | 29.0 | 42.5 | 53.3 | 57.4 | 56.8 | 52.6 | 47.7 |
Frequency | 42.4 | 31.7 | 53.5 | 59.4 | 65.0 | 60.8 | 61.1 | 53.4 |
Benchmark: The Universal NER benchmark encompasses 43 NER datasets across 9 domains, including general, biomedical, clinical, STEM, programming, social media, law, finance, and transportation domains. An overview of the data distribution is shown below.
Zero-shot Performance: UniversalNER surpasses existing instruction-tuned models at the same size (e.g., Vicuna) by a large margin. More importantly, UniversalNER outperform ChatGPT in terms of average F1. This demonstrates that our proposed targeted distillation from diverse inputs yields models that have superior performance on a broad application class while maintaining a relatively small model size. Domain breakdowns also show the improvements of UniversalNER over ChatGPT.
Supervised Fine-tuned Performance: For a fair comparison, we train UniversalNER-7B using the same training data in InstructUIE-11B. Results in the table below show UniversalNER-7B achieves an average F1 of 84.78% on the 20 datasets, surpassing both BERT-base and InstructUIE-11B by 4.69% and 3.62%, respectively. This experiment demonstrates the effectiveness of our model in the supervised setting.
Dataset | BERT-base | InstructUIE-11B | UniversalNER-7B |
---|---|---|---|
ACE05 | 87.30 | 79.94 | 86.69 |
AnatEM | 85.82 | 88.52 | 88.65 |
bc2gm | 80.90 | 80.69 | 82.42 |
bc4chemd | 86.72 | 87.62 | 89.21 |
bc5cdr | 85.28 | 89.02 | 89.34 |
Broad Tweet Corpus | 58.61 | 80.27 | 81.25 |
CoNLL03 | 92.40 | 91.53 | 93.30 |
FabNER | 64.20 | 78.38 | 81.87 |
FindVehicle | 87.13 | 87.56 | 98.30 |
GENIA | 73.3 | 75.71 | 77.54 |
HarveyNER | 82.26 | 74.69 | 74.21 |
MIT Movie | 88.78 | 89.58 | 90.17 |
MIT Restaurant | 81.02 | 82.59 | 82.35 |
MultiNERD | 91.25 | 90.26 | 93.73 |
ncbi | 80.20 | 86.21 | 86.96 |
OntoNotes | 91.11 | 88.64 | 89.91 |
PolyglotNER | 75.65 | 53.31 | 65.67 |
TweetNER7 | 56.49 | 65.95 | 65.77 |
WikiANN | 70.60 | 64.47 | 84.91 |
wikiNeural | 82.78 | 88.27 | 93.28 |
Avg | 80.09 | 81.16 | 84.78 |
@article{zhou2023universalner,
title={UniversalNER: Targeted Distillation from Large Language Models for Open Named Entity Recognition},
author={Wenxuan Zhou and Sheng Zhang and Yu Gu and Muhao Chen and Hoifung Poon},
year={2023},
eprint={2308.03279},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
This website is adapted from Nerfies and LLaVA, licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. We thank the LLaMA team for giving us access to their models.
Usage and License Notices: The data, code and checkpoint is intended and licensed for research use only. They are also restricted to uses that follow the license agreement of LLaMA, ChatGPT, and the original dataset used in the benchmark. The dataset is CC BY NC 4.0 (allowing only non-commercial use) and models trained using the dataset should not be used outside of research purposes.