Language-Independent Named Entity Recognition (II)
Named entities are phrases that contain the names of persons,
organizations, locations, times and quantities.
Example:
[ORG U.N. ]
official
[PER Ekeus ]
heads
for
[LOC Baghdad ]
.
The shared task of
CoNLL-2003
concerns language-independent named entity recognition.
We will concentrate on four types of named entities: persons,
locations, organizations and names of miscellaneous entities that do
not belong to the previous three groups.
The participants of the shared task will be offered training and test
data for two languages.
They will use the data for developing a named-entity recognition
system that includes a machine learning component.
For each language, additional information (lists of names and
non-annotated data) will be supplied as well.
The challenge for the participants is to find ways of incorporating
this information in their system.
Background information
Named Entity Recognition (NER) is a subtask of Information Extraction.
Different NER systems were evaluated as a part of the Sixth Message
Understanding Conference in 1995
(MUC6).
The target language was English.
The participating systems performed well.
However, many of them used language-specific resources for performing
the task and it is unknown how they would have performed on another
language than English [PD97].
After 1995, NER systems have been developed for some European languages
and a few Asian languages.
There have been at least two studies that have applied one NER system
to different languages.
Palmer and Day [PD97] have used statistical methods
for finding named entities in newswire articles in Chinese, English,
French, Japanese, Portuguese and Spanish.
They found that the difficulty of the NER task was different for the
six languages but that a large part of the task could be performed
with simple methods.
Cucerzan and Yarowsky [CY99] used both
morphological and contextual clues for identifying named entities in
English, Greek, Hindi, Rumanian and Turkish.
With minimal supervision, they obtained overall F measures between 40
and 70, depending on the languages used.
In the shared task at
CoNLL-2002,
twelve different learning systems were applied to data in Spanish and
Dutch.
Software and Data
The CoNLL-2003 shared task data files contain four columns separated by
a single space.
Each word has been put on a separate line and there is an empty line
after each sentence.
The first item on each line is a word, the second a part-of-speech (POS)
tag, the third a syntactic chunk tag and the fourth the named entity
tag.
The chunk tags and the named entity tags have the format I-TYPE which
means that the word is inside a phrase of type TYPE.
Only if two phrases of the same type immediately follow each other,
the first word of the second phrase will have tag B-TYPE to show
that it starts a new phrase.
A word with tag O is not part of a phrase.
Here is an example:
U.N. NNP I-NP I-ORG
official NN I-NP O
Ekeus NNP I-NP I-PER
heads VBZ I-VP O
for IN I-PP O
Baghdad NNP I-NP I-LOC
. . O O
The data consists of three files per language: one training file and
two test files testa and testb.
The first test file will be used in the development phase for finding
good parameters for the learning system.
The second test file will be used for the final evaluation.
There are data files available for English and German.
The German files contain an extra column (the second) which holds the
lemma of each word.
The English data is a collection of news wire articles from the
Reuters
Corpus.
The annotation has been done by people of the University of Antwerp.
Because of copyright reasons we only make available the annotations.
In order to build the complete data sets you will need access to the
Reuters Corpus.
It can be obtained for research purposes without any charge from
NIST.
The German data is a collection of articles from the Frankfurter
Rundschau.
The named entities have been annotated by people of the University
of Antwerp.
Only the annotations are available here.
In order to build these data sets you need access to the
ECI Multilingual Text Corpus.
It can be ordered from the
Linguistic Data Consortium
(2003 non-member price: US$ 35.00).
Results
Sixteen systems have participated in the CoNLL-2003 shared task.
They used a wide variety of machine learning techniques and
different feature sets.
Here is the result table for the English test set:
+-----------+---------+-----------+
English | precision | recall | F |
+------------+-----------+---------+-----------+
| [FIJZ03] | 88.99% | 88.54% | 88.76±0.7 |
| [CN03] | 88.12% | 88.51% | 88.31±0.7 |
| [KSNM03] | 85.93% | 86.21% | 86.07±0.8 |
| [ZJ03] | 86.13% | 84.88% | 85.50±0.9 |
| [CMP03b] | 84.05% | 85.96% | 85.00±0.8 |
| [CC03] | 84.29% | 85.50% | 84.89±0.9 |
| [MMP03] | 84.45% | 84.90% | 84.67±1.0 |
| [CMP03a] | 85.81% | 82.84% | 84.30±0.9 |
| [ML03] | 84.52% | 83.55% | 84.04±0.9 |
| [BON03] | 84.68% | 83.18% | 83.92±1.0 |
| [MLP03] | 80.87% | 84.21% | 82.50±1.0 |
| [WNC03]* | 82.02% | 81.39% | 81.70±0.9 |
| [WP03] | 81.60% | 78.05% | 79.78±1.0 |
| [HV03] | 76.33% | 80.17% | 78.20±1.0 |
| [DD03] | 75.84% | 78.13% | 76.97±1.2 |
| [Ham03] | 69.09% | 53.26% | 60.15±1.3 |
+------------+-----------+---------+-----------+
| baseline | 71.91% | 50.90% | 59.61±1.2 |
+------------+--------- -+---------+-----------+
+-----------+---------+-----------+
German | precision | recall | F |
+------------+-----------+---------+-----------+
| [FIJZ03] | 83.87% | 63.71% | 72.41±1.3 |
| [KSNM03] | 80.38% | 65.04% | 71.90±1.2 |
| [ZJ03] | 82.00% | 63.03% | 71.27±1.5 |
| [MMP03] | 75.97% | 64.82% | 69.96±1.4 |
| [CMP03b] | 75.47% | 63.82% | 69.15±1.3 |
| [BON03] | 74.82% | 63.82% | 68.88±1.3 |
| [CC03] | 75.61% | 62.46% | 68.41±1.4 |
| [ML03] | 75.97% | 61.72% | 68.11±1.4 |
| [MLP03] | 69.37% | 66.21% | 67.75±1.4 |
| [CMP03a] | 77.83% | 58.02% | 66.48±1.5 |
| [WNC03] | 75.20% | 59.35% | 66.34±1.3 |
| [CN03] | 76.83% | 57.34% | 65.67±1.4 |
| [HV03] | 71.15% | 56.55% | 63.02±1.4 |
| [DD03] | 63.93% | 51.86% | 57.27±1.6 |
| [WP03] | 71.05% | 44.11% | 54.43±1.4 |
| [Ham03] | 63.49% | 38.25% | 47.74±1.5 |
+------------+-----------+---------+-----------+
| baseline | 31.86% | 28.89% | 30.30±1.3 |
+------------+--------- -+---------+-----------+
Here are some remarks on these results:
-
The baseline results have been produced by a system which only
selects complete unambiguous named entities which appear in the
training data.
-
The significance intervals for the F rates
have been obtained with bootstrap resampling
[Nor89].
F rates outside of these intervals are assumed to be significantly
different from the related F rate (p<0.05).
-
The results of the system of
[WNC03]
for the English test data have been corrected in their paper
after the submission deadline
(new F=82.69, see their paper).
A discussion of the shared task results can be found in the introduction paper
[TD03].
Related information
References
This is a list of papers that are relevant for this task.
CoNLL-2003 Shared Task Papers
- [TD03]
Erik F. Tjong Kim Sang and Fien De Meulder,
Introduction to the CoNLL-2003 Shared Task: Language-Independent
Named Entity Recognition.
In:
Proceedings of CoNLL-2003,
Edmonton, Canada, 2003, pp. 142-147.
paper:
[ps]
[ps.gz]
[pdf]
[bibtex]
(with corrections)
sheets:
[ps]
[ps.gz]
[pdf]
- [BON03]
Oliver Bender, Franz Josef Och and Hermann Ney,
Maximum Entropy Models for Named Entity Recognition
In:
Proceedings of CoNLL-2003,
Edmonton, Canada, 2003, pp. 148-151.
paper:
[ps]
[ps.gz]
[pdf]
[bibtex]
system output:
[tgz]
[files]
- [CMP03a]
Xavier Carreras, Lluís Màrquez, and Lluís Padró,
Learning a Perceptron-Based Named Entity Chunker via Online
Recognition Feedback.
In:
Proceedings of CoNLL-2003,
Edmonton, Canada, 2003, pp. 156-159.
paper:
[ps]
[ps.gz]
[pdf]
[bibtex]
system output:
[tgz]
[files]
- [CMP03b]
Xavier Carreras, Lluís Màrquez, and Lluís Padró,
A Simple Named Entity Extractor using AdaBoost.
In:
Proceedings of CoNLL-2003,
Edmonton, Canada, 2003, pp. 152-155.
paper:
[ps]
[ps.gz]
[pdf]
[bibtex]
system output:
[tgz]
[files]
- [CN03]
Hai Leong Chieu and Hwee Tou Ng,
Named Entity Recognition with a Maximum Entropy Approach.
In:
Proceedings of CoNLL-2003,
Edmonton, Canada, 2003, pp. 160-163.
paper:
[ps]
[ps.gz]
[pdf]
[bibtex]
system output:
[tgz]
[files]
- [CC03]
James R. Curran and Stephen Clark,
Language Independent NER using a Maximum Entropy Tagger.
In:
Proceedings of CoNLL-2003,
Edmonton, Canada, 2003, pp. 164-167.
paper:
[ps]
[ps.gz]
[pdf]
[bibtex]
system output:
[tgz]
[files]
- [DD03]
Fien De Meulder and Walter Daelemans,
Memory-Based Named Entity Recognition using Unannotated Data.
In:
Proceedings of CoNLL-2003,
Edmonton, Canada, 2003, pp. 208-211.
paper:
[ps]
[ps.gz]
[pdf]
[bibtex]
system output:
[tgz]
[files]
- [FIJZ03]
Radu Florian, Abe Ittycheriah, Hongyan Jing and Tong Zhang,
Named Entity Recognition through Classifier Combination.
In:
Proceedings of CoNLL-2003,
Edmonton, Canada, 2003, pp. 168-171.
paper:
[ps]
[ps.gz]
[pdf]
[bibtex]
system output:
[tgz]
[files]
- [Ham03]
James Hammerton,
Named Entity Recognition with Long Short-Term Memory.
In:
Proceedings of CoNLL-2003,
Edmonton, Canada, 2003, pp. 172-175.
paper:
[ps]
[ps.gz]
[pdf]
[bibtex]
system output:
[tgz]
[files]
- [HV03]
Iris Hendrickx and Antal van den Bosch,
Memory-based one-step named-entity recognition:
Effects of seed list features, classifier stacking, and
unannotated data.
In:
Proceedings of CoNLL-2003,
Edmonton, Canada, 2003, pp. 176-179.
paper:
[ps]
[ps.gz]
[pdf]
[bibtex]
system output:
[tgz]
[files]
- [KSNM03]
Dan Klein, Joseph Smarr, Huy Nguyen and Christopher D. Manning,
Named Entity Recognition with Character-Level Models.
In:
Proceedings of CoNLL-2003,
Edmonton, Canada, 2003, pp. 180-183.
paper:
[ps]
[ps.gz]
[pdf]
[bibtex]
system output:
[tgz]
[files]
- [MMP03]
James Mayfield, Paul McNamee and Christine Piatko,
Named Entity Recognition using Hundreds of Thousands of Features.
In:
Proceedings of CoNLL-2003,
Edmonton, Canada, 2003, pp. 184-187.
paper:
[ps]
[ps.gz]
[pdf]
[bibtex]
system output:
[tgz]
[files]
- [ML03]
Andrew McCallum and Wei Li,
Early results for Named Entity Recognition with Conditional Random
Fields, Feature Induction and Web-Enhanced Lexicons.
In:
Proceedings of CoNLL-2003,
Edmonton, Canada, 2003, pp. 188-191.
paper:
[ps]
[ps.gz]
[pdf]
[bibtex]
system output:
[tgz]
[files]
- [MLP03]
Robert Munro, Daren Ler, and Jon Patrick,
Meta-Learning Orthographic and Contextual Models for Language
Independent Named Entity Recognition.
In:
Proceedings of CoNLL-2003,
Edmonton, Canada, 2003, pp. 192-195.
paper:
[ps]
[ps.gz]
[pdf]
[bibtex]
system output:
[tgz]
[files]
- [WP03]
Casey Whitelaw and Jon Patrick,
Named Entity Recognition Using a Character-based Probabilistic
Approach.
In:
Proceedings of CoNLL-2003,
Edmonton, Canada, 2003, pp. 196-199.
paper:
[ps]
[ps.gz]
[pdf]
[bibtex]
system output:
[tgz]
[files]
- [WNC03]
Dekai Wu, Grace Ngai and Marine Carpuat,
A Stacked, Voted, Stacked Model for Named Entity Recognition.
In:
Proceedings of CoNLL-2003,
Edmonton, Canada, 2003, pp. 200-203.
paper:
[ps]
[ps.gz]
[pdf]
[bibtex]
(with corrections)
system output:
[tgz]
[files]
- [ZJ03]
Tong Zhang and David Johnson,
A Robust Risk Minimization based Named Entity Recognition System.
In:
Proceedings of CoNLL-2003,
Edmonton, Canada, 2003, pp. 204-207.
paper:
[ps]
[ps.gz]
[pdf]
[bibtex]
system output:
[tgz]
[files]
Other related publications
A paper that is related to the topic of this shared task is the
EMNLP-99 paper by Cucerzan and Yarowsky [CY99].
Interesting papers about using unsupervised data, though not for
named entity recognition, are those of
Mitchell [Mit99]
and
Banko and Brill [BB01].
- [BB01]
Michele Banko and Eric Brill,
Scaling to Very Very Large Corpora for Natural Language
Disambiguation.
In Proceedings of ACL 2001,
Toulouse, France, 2001, pp. 26-33.
http://www.research.microsoft.com/users/mbanko/ACL2001VeryVeryLargeCorpora.pdf
- [Bor99]
Andrew Borthwick,
A Maximum Entropy Approach to Named Entity
Recognition.
PhD thesis, New York University, 1999.
http://cs.nyu.edu/cs/projects/proteus/publication/papers/borthwick_thesis.ps
- [BV00]
Sabine Buchholz and Antal van den Bosch,
Integrating seed names and n-grams for a named entity list and
classifier,
In: Proceedings of LREC-2000, Athens, Greece, June
2000, pp. 1215-1221.
http://ilk.kub.nl/downloads/pub/papers/ilk.0002.ps.gz
- [CM03]
Xavier Carreras and Lluís Màrquez,
Phrase Recognition by Filtering and Ranking with Perceptrons.
In "Proceedings of the International Conference on Recent Advances
in Natural Language Processing, RANLP-2003", Borovets, Bulgaria, 2003.
http://
www.lsi.upc.es/~nlp/papers/2003/ranlp2003-cm.ps.gz
- [CMP02]
Xavier Carreras, Lluís Màrques and Lluís
Padró,
Named Entity Extraction using AdaBoost
In:
Proceedings of CoNLL-2002,
Taipei, Taiwan, 2002, pp. 167-170.
http://www.cnts.ua.ac.be/conll2002/ps/16770car.ps
- [CBFR99]
Nancy Chinchor, Erica Brown, Lisa Ferro and Patty Robinson,
1999 Named Entity Recognition Task Definition,
MITRE, 1999.
http://www.nist.gov/speech/tests/ie-er/er_99/doc/ne99_taskdef_v1_4.pdf
- [Col02]
Michael Collins,
Ranking Algorithms for Named-Entity Extraction: Boosting and the
Voted Perceptron
In Proceedings of ACL 2002,
University of Pennsylvania, PA, 2002.
http://www.ai.mit.edu/people/mcollins/papers/finalNEacl2002.ps
- [CS99]
Michael Collins and Yoram Singer,
Unsupervised models for named entity classification.
In Proceedings of the 1999 Joint SIGDAT Conference on Empirical
Methods in Natural Language Processing and Very Large
Corpora,
University of Maryland, MD, 1999.
http://citeseer.nj.nec.com/collins99unsupervised.html
- [CY99]
Silviu Cucerzan and David Yarowsky,
Language independent named entity recognition combining
morphological and contextual evidence.
In Proceedings of 1999 Joint SIGDAT Conference on EMNLP and
VLC,
University of Maryland, MD, 1999.
http://citeseer.nj.nec.com/cucerzan99language.html
- [Mit99]
Tom M. Mitchell,
The Role of Unlabeled Data in Supervised Learning.
In Proceedings of the Sixth International Colloquium on
Cognitive Science,
San Sebastian, Spain, 1999.
http://citeseer.nj.nec.com/mitchell99role.html
- [MMG99]
Andrei Mikheev, Marc Moens and Claire Grover,
Named Entity Recognition without Gazetteers,
In Proceedings of EACL'99,
Bergen, Norway, 1999, pp. 1-8.
http://www.ltg.ed.ac.uk/~mikheev/papers_my/eacl99.ps
- [Nor89]
Eric W. Noreen,
Computer-Intensive Methods for Testing Hypotheses
John Wiley & Sons,
1989.
- [PD97]
David D. Palmer and David S. Day,
A Statistical Profile of the Named Entity Task.
In Proceedings of Fifth ACL Conference for Applied Natural
Language Processing (ANLP-97),
Washington D.C., 1997
http://crow.ee.washington.edu/people/palmer/papers/anlp97.ps
- [TKS02]
Erik F. Tjong Kim Sang,
Introduction to the CoNLL-2002 Shared Task: Language-Independent
Named Entity Recognition.
In:
Proceedings of CoNLL-2002,
Taipei, Taiwan, 2002, pp. 155-158.
http://www.cnts.ua.ac.be/conll2002/ps/15558tjo.ps
Last update: December 05, 2005.
erik.tjongkimsang@ua.ac.be,
fien.demeulder@ua.ac.be