A case study on targeted distillation from LLMs

University of Southern California   ▶ Microsoft Research   *Equal Contribution

We propose a general recipe for targeted distilling where we train student models using mission-focused instruction tuning for a broad application class such as open information extraction. We show that this can maximally replicate LLM’s capabilities for the given application class, while preserving its generalizability across semantic types and domains. Using NER as a case study, we successfully distill these capabilities from LLMs into a much smaller model UniversalNER that can recognize diverse types of entities or concepts in text corpora from a wide range of domains. UniversalNER surpasses existing instruction-tuned models at the same size (e.g., Alpaca, Vicuna) by a large margin, and shows substantially better performance to ChatGPT.

Instruction Data for NER

Download the dataset from [Dataset]

We prompt ChatGPT to generate a instruction-following dataset for NER. The dataset comprises 45,889 input-output pairs, encompassing 240,725 entities and 13,020 distinct entity types. The dataset contains entity types from various domains, ranging from the general domain (e.g., Person) to the clinical domain (e.g., Medical Condition). Moreover, we observe variations in granularity among the entity types. For instance, Countyis the subset of Location, and Input Device is a subset of Product. These data characteristics offer extensive coverage of entity types, making them suitable for distilling capabilities from LLMs across various domains. A divide of entity types according to frequency is shown in the table below:

Data Construction Prompt
System Message: You are a helpful information extraction system.

Prompt: Given a passage, your task is to extract all entities and identify their entity types. The output should be in a list of tuples of the following format: [("entity 1", "type of entity 1"), ... ].

Passage: {input_passage}
Head/Tail Frequency Unique Entity Types Example Entity Types
Top 1% 74% 130 Person, Organization, Location, Date, Concept, Product, Event, Technology, Group, Medical Condition, ...
Top 1%-10% 19% 1172 Characteristic, Research, County, Module, Unit, Feature, Cell, Package, Anatomical Structure, Equipment, ...
All the rest 7% 11718 Attribute Value, Pokemon, Immune Response, Physiology, Animals, Cell Feature, FAC, Input Device, Ward, Broadcast, ...

Mission-Focused Instruction Tuning

Unlike the existing work that tunes the models to do diverse tasks, we present a general recipe of instruction tuning for a specific task, where the pretrained model is tuned for a broad application class such as open NER.

  • Conversation-style Instruction Tuning: We adopt a conversation-style tuning format, where the language model (LM) is presented with a passage as input. Then, for each entity type that appears in the output, we transform it into a natural language question. Subsequently, we tune the LM to generate a structured output in the form of a JSON list containing all entities of the query type in the passage. We consider the reference entities (highlighted below) as gold tokens and apply a language modeling objective on these tokens.
  • Conversation-style Instruction Tuning Template
    A virtual assistant answers questions from a user based on the provided text.
    User: Text: Xpassage
    Assistant: I've read this text.
    User: What describes t1 in the text?
    Assistant: y1
    User: What describes tT in the text?
    Assistant: yT
  • Negative sampling: During tuning, we randomly sample negative entity types from the collection of all entity types that do not appear in the passage as queries and set the expected outputs as empty JSON lists. The sampling of negative entity types is done with a probability proportional to the frequency of entity types in the entire dataset. This approach greatly improves the instruction tuning results.
  • Negative Sampling Strategy Movie Restaurant AI Literature Music Politics Science Avg
    None 19.1 19.1 25.1 39.5 42.7 48.9 26.2 31.5
    Uniform 42.5 29.0 42.5 53.3 57.4 56.8 52.6 47.7
    Frequency 42.4 31.7 53.5 59.4 65.0 60.8 61.1 53.4
Please check out "UniversalNER" model checkpoint on [Models].


Universal NER Benchmark -- the largest NER benchmark to date

Benchmark: The Universal NER benchmark encompasses 43 NER datasets across 9 domains, including general, biomedical, clinical, STEM, programming, social media, law, finance, and transportation domains. An overview of the data distribution is shown below.

Zero-shot Performance: UniversalNER surpasses existing instruction-tuned models at the same size (e.g., Vicuna) by a large margin. More importantly, UniversalNER outperform ChatGPT in terms of average F1. This demonstrates that our proposed targeted distillation from diverse inputs yields models that have superior performance on a broad application class while maintaining a relatively small model size. Domain breakdowns also show the improvements of UniversalNER over ChatGPT.

Distribution of Universal NER benchmark.
Zero-shot performance on different domains.

Supervised Multitask Fine-tuning: New SoTA with a single model across different datasets

Supervised Fine-tuned Performance: For a fair comparison, we train UniversalNER-7B using the same training data in InstructUIE-11B. Results in the table below show UniversalNER-7B achieves an average F1 of 84.78% on the 20 datasets, surpassing both BERT-base and InstructUIE-11B by 4.69% and 3.62%, respectively. This experiment demonstrates the effectiveness of our model in the supervised setting.

Dataset BERT-base InstructUIE-11B UniversalNER-7B
ACE05 87.30 79.94 86.69
AnatEM 85.82 88.52 88.65
bc2gm 80.90 80.69 82.42
bc4chemd 86.72 87.62 89.21
bc5cdr 85.28 89.02 89.34
Broad Tweet Corpus 58.61 80.27 81.25
CoNLL03 92.40 91.53 93.30
FabNER 64.20 78.38 81.87
FindVehicle 87.13 87.56 98.30
GENIA 73.3 75.71 77.54
HarveyNER 82.26 74.69 74.21
MIT Movie 88.78 89.58 90.17
MIT Restaurant 81.02 82.59 82.35
MultiNERD 91.25 90.26 93.73
ncbi 80.20 86.21 86.96
OntoNotes 91.11 88.64 89.91
PolyglotNER 75.65 53.31 65.67
TweetNER7 56.49 65.95 65.77
WikiANN 70.60 64.47 84.91
wikiNeural 82.78 88.27 93.28
Avg 80.09 81.16 84.78

Examples on Diverse Entity Recognition


      title={UniversalNER: Targeted Distillation from Large Language Models for Open Named Entity Recognition}, 
      author={Wenxuan Zhou and Sheng Zhang and Yu Gu and Muhao Chen and Hoifung Poon},


This website is adapted from Nerfies and LLaVA, licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. We thank the LLaMA team for giving us access to their models.

Usage and License Notices: The data, code and checkpoint is intended and licensed for research use only. They are also restricted to uses that follow the license agreement of LLaMA, ChatGPT, and the original dataset used in the benchmark. The dataset is CC BY NC 4.0 (allowing only non-commercial use) and models trained using the dataset should not be used outside of research purposes.