Text chunking consists of dividing a text in syntactically correlated parts of words. For example, the sentence He reckons the current account deficit will narrow to only # 1.8 billion in September . can be divided as follows:

[NP He ] [VP reckons ] [NP the current account deficit ] [VP will narrow ] [PP to ] [NP only # 1.8 billion ] [PP in ] [NP September ] .

Text chunking is an intermediate step towards full parsing. It was the shared task for CoNLL-2000. Training and test data for this task is available. This data consists of the same partitions of the Wall Street Journal corpus (WSJ) as the widely used data for noun phrase chunking: sections 15-18 as training data (211727 tokens) and section 20 as test data (47377 tokens). The annotation of the data has been derived from the WSJ corpus by a program written by Sabine Buchholz from Tilburg University, The Netherlands.

The goal of this task is to come forward with machine learning methods which after a training phase can recognize the chunk segmentation of the test data as well as possible. The training data can be used for training the text chunker. The chunkers will be evaluated with the F rate, which is a combination of the precision and recall rates: F = 2*precision*recall / (recall+precision) [Rij79]. The precision and recall numbers will be computed over all types of chunks.

Background Information

In 1991, Steven Abney proposed to approach parsing by starting with finding correlated chunks of words [Abn91]. Lance Ramshaw and Mitch Marcus have approached chunking by using a machine learning method [RM95]. Their work has inspired many others to study the application of learning methods to noun phrase chunking. Other chunk types have not received the same attention as NP chunks. The most complete work is [BVD99] which presents results for NP, VP, PP, ADJP and ADVP chunks. [Vee99] works with NP, VP and PP chunks. [RM95] have recognized arbitrary chunks but classified every non-NP chunk as VP chunk. [Rat98] has recognized arbitrary chunks as part of a parsing task but did not report on the chunking performance.

Software and Data

The train and test data consist of three columns separated by spaces. Each word has been put on a separate line and there is an empty line after each sentence. The first column contains the current word, the second its part-of-speech tag as derived by the Brill tagger and the third its chunk tag as derived from the WSJ corpus. The chunk tags contain the name of the chunk type, for example I-NP for noun phrase words and I-VP for verb phrase words. Most chunk types have two types of chunk tags, B-CHUNK for the first word of the chunk and I-CHUNK for each other word in the chunk. Here is an example of the file format:

   He        PRP  B-NP
   reckons   VBZ  B-VP
   the       DT   B-NP
   current   JJ   I-NP
   account   NN   I-NP
   deficit   NN   I-NP
   will      MD   B-VP
   narrow    VB   I-VP
   to        TO   B-PP
   only      RB   B-NP
   #         #    I-NP
   1.8       CD   I-NP
   billion   CD   I-NP
   in        IN   B-PP
   September NNP  B-NP
   .         .    O

The O chunk tag is used for tokens which are not part of any chunk. Instead of using the part-of-speech tags of the WSJ corpus, the data set used tags generated by the Brill tagger. The performance with the corpus tags will be better but it will be unrealistic since for novel text no perfect part-of-speech tags will be available.


Eleven systems have been applied to the CoNLL-2000 shared task. The systems used a wide variety of techniques. Here is an overview of the performance of these 11 systems on the test set together with other results (*) on this data set published after the workshop:

              | precision |   recall  ||     F     ||
   | [ZDJ01]  |   94.29%  |   94.01%  ||   94.13   || (*)
   | [KM01]   |   93.89%  |   93.92%  ||   93.91   || (*)
   | [CM03]   |   94.19%  |   93.29%  ||   93.74   || (*)
   | [KM00]   |   93.45%  |   93.51%  ||   93.48   ||
   | [Hal00]  |   93.13%  |   93.51%  ||   93.32   ||
   | [TKS00]  |   94.04%  |   91.00%  ||   92.50   ||
   | [ZST00]  |   91.99%  |   92.25%  ||   92.12   ||
   | [Dej00]  |   91.87%  |   92.31%  ||   92.09   ||
   | [Koe00]  |   92.08%  |   91.86%  ||   91.97   ||
   | [Osb00]  |   91.65%  |   92.23%  ||   91.94   ||
   | [VB00]   |   91.05%  |   92.03%  ||   91.54   ||
   | [PMP00]  |   90.63%  |   89.65%  ||   90.14   ||
   | [Joh00]  |   86.24%  |   88.25%  ||   87.23   ||
   | [VD00]   |   88.82%  |   82.91%  ||   85.76   ||
   | baseline |   72.58%  |   82.14%  ||   77.07   ||

The baseline result was obtained by selecting the chunk tag which was most frequently associated with the current part-of-speech tag. At the workshop, all 11 systems outperformed the baseline. Most of them (six of the eleven) obtained an F-score between 91.5 and 92.5. Two systems performed a lot better: Support Vector Machines used by Kudoh and Matsumoto [KM00] and Weighted Probability Distribution Voting used by Van Halteren [Hal00]. The papers associated with the participating systems can be found in the reference section below.

Related information


This reference section contains two parts: first the papers from the shared task session at CoNLL-2000 and then the other related publications.

CoNLL-2000 Shared Task Papers

Other related publications

Last update: April 19, 2011. erikt(at)