UniNER

Instruction Data for NER

Download the dataset from [Dataset]

We prompt ChatGPT to generate a instruction-following dataset for NER. The dataset comprises 45,889 input-output pairs, encompassing 240,725 entities and 13,020 distinct entity types. The dataset contains entity types from various domains, ranging from the general domain (e.g., Person) to the clinical domain (e.g., Medical Condition). Moreover, we observe variations in granularity among the entity types. For instance, Countyis the subset of Location, and Input Device is a subset of Product. These data characteristics offer extensive coverage of entity types, making them suitable for distilling capabilities from LLMs across various domains. A divide of entity types according to frequency is shown in the table below:

Data Construction Prompt
System Message: You are a helpful information extraction system. Prompt: Given a passage, your task is to extract all entities and identify their entity types. The output should be in a list of tuples of the following format: [("entity 1", "type of entity 1"), ... ]. Passage: {input_passage}

Head/Tail	Frequency	Unique Entity Types	Example Entity Types
Top 1%	74%	130	Person, Organization, Location, Date, Concept, Product, Event, Technology, Group, Medical Condition, ...
Top 1%-10%	19%	1172	Characteristic, Research, County, Module, Unit, Feature, Cell, Package, Anatomical Structure, Equipment, ...
All the rest	7%	11718	Attribute Value, Pokemon, Immune Response, Physiology, Animals, Cell Feature, FAC, Input Device, Ward, Broadcast, ...

Mission-Focused Instruction Tuning

Unlike the existing work that tunes the models to do diverse tasks, we present a general recipe of instruction tuning for a specific task, where the pretrained model is tuned for a broad application class such as open NER.

Conversation-style Instruction Tuning: We adopt a conversation-style tuning format, where the language model (LM) is presented with a passage as input. Then, for each entity type that appears in the output, we transform it into a natural language question. Subsequently, we tune the LM to generate a structured output in the form of a JSON list containing all entities of the query type in the passage. We consider the reference entities (highlighted below) as gold tokens and apply a language modeling objective on these tokens.

Conversation-style Instruction Tuning Template
A virtual assistant answers questions from a user based on the provided text. User: Text: X_passage Assistant: I've read this text. User: What describes t₁ in the text? Assistant: y₁ ... User: What describes t_T in the text? Assistant: y_T

Negative sampling: During tuning, we randomly sample negative entity types from the collection of all entity types that do not appear in the passage as queries and set the expected outputs as empty JSON lists. The sampling of negative entity types is done with a probability proportional to the frequency of entity types in the entire dataset. This approach greatly improves the instruction tuning results.

Negative Sampling Strategy	Movie	Restaurant	AI	Literature	Music	Politics	Science	Avg
None	19.1	19.1	25.1	39.5	42.7	48.9	26.2	31.5
Uniform	42.5	29.0	42.5	53.3	57.4	56.8	52.6	47.7
Frequency	42.4	31.7	53.5	59.4	65.0	60.8	61.1	53.4

Please check out "UniversalNER" model checkpoint on [Models].

Performance

Universal NER Benchmark -- the largest NER benchmark to date

Benchmark: The Universal NER benchmark encompasses 43 NER datasets across 9 domains, including general, biomedical, clinical, STEM, programming, social media, law, finance, and transportation domains. An overview of the data distribution is shown below.

Zero-shot Performance: UniversalNER surpasses existing instruction-tuned models at the same size (e.g., Vicuna) by a large margin. More importantly, UniversalNER outperform ChatGPT in terms of average F1. This demonstrates that our proposed targeted distillation from diverse inputs yields models that have superior performance on a broad application class while maintaining a relatively small model size. Domain breakdowns also show the improvements of UniversalNER over ChatGPT.

Distribution of Universal NER benchmark.

Zero-shot performance on different domains.

Supervised Multitask Fine-tuning: New SoTA with a single model across different datasets

Supervised Fine-tuned Performance: For a fair comparison, we train UniversalNER-7B using the same training data in InstructUIE-11B. Results in the table below show UniversalNER-7B achieves an average F1 of 84.78% on the 20 datasets, surpassing both BERT-base and InstructUIE-11B by 4.69% and 3.62%, respectively. This experiment demonstrates the effectiveness of our model in the supervised setting.

Dataset	BERT-base	InstructUIE-11B	UniversalNER-7B
ACE05	87.30	79.94	86.69
AnatEM	85.82	88.52	88.65
bc2gm	80.90	80.69	82.42
bc4chemd	86.72	87.62	89.21
bc5cdr	85.28	89.02	89.34
Broad Tweet Corpus	58.61	80.27	81.25
CoNLL03	92.40	91.53	93.30
FabNER	64.20	78.38	81.87
FindVehicle	87.13	87.56	98.30
GENIA	73.3	75.71	77.54
HarveyNER	82.26	74.69	74.21
MIT Movie	88.78	89.58	90.17
MIT Restaurant	81.02	82.59	82.35
MultiNERD	91.25	90.26	93.73
ncbi	80.20	86.21	86.96
OntoNotes	91.11	88.64	89.91
PolyglotNER	75.65	53.31	65.67
TweetNER7	56.49	65.95	65.77
WikiANN	70.60	64.47	84.91
wikiNeural	82.78	88.27	93.28
Avg	80.09	81.16	84.78

Examples on Diverse Entity Recognition

BibTeX


  @article{zhou2023universalner,
      title={UniversalNER: Targeted Distillation from Large Language Models for Open Named Entity Recognition}, 
      author={Wenxuan Zhou and Sheng Zhang and Yu Gu and Muhao Chen and Hoifung Poon},
      year={2023},
      eprint={2308.03279},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Acknowledgement

This website is adapted from Nerfies and LLaVA, licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. We thank the LLaMA team for giving us access to their models.

Usage and License Notices: The data, code and checkpoint is intended and licensed for research use only. They are also restricted to uses that follow the license agreement of LLaMA, ChatGPT, and the original dataset used in the benchmark. The dataset is CC BY NC 4.0 (allowing only non-commercial use) and models trained using the dataset should not be used outside of research purposes.

UniversalNER

A case study on targeted distillation from LLMs