NP Bracketing

In 1995 Lance Ramshaw and Mitch Marcus put forward a standard data set for NP chunking: recognizing non-overlapping text parts that consist of noun phrases (NPs) [RM95]. These NPs were non-recursive and lacked post-modifying phrases such as prepositional phrases. The recognition of these NPs would be a useful step towards a complete parsing process. Therefore we propose to expand this NP chunking data set and use it for an extended task: NP bracketing, the recognition of all noun phrase structures in a text.

The NP chunking data set put forward by Ramshaw and Marcus consists of two parts: training data and test data. Both parts have been extracted from the Wall Street Journal corpus (WSJ). The training part contains sections 15-18 of this corpus and the test part consists of section 20. In order to make the data usable for the NP bracketing task, it needs to be extended with an annotation for the complete noun phrase structure. This information can be extracted from the WSJ corpus as well. The location of the chunking data files and the bracketing data files can be found in the Software and Data section on this page.

The performance of the machine algorithm is measured with two scores: precision and recall. Precision measures how many noun phrases found by the algorithm are correct and the recall rate contains the percentage of NPs defined in the corpus that were found by the chunking program. The machine learning algorithm is assumed to output a balanced bracketing structure which means that every opening bracket must match with a closing bracket and vice versa. The two rates can be combined in one measure: the F rate in which F = 2*precision*recall / (recall+precision) [Rij79].

The following results have been reported for this data set (CR=crossing rate):

             +------+-----------+--------++-------++
             |  CR  | precision | recall ||   F   ||
   +---------+------+-----------+--------++-------++
   | [TKS00] |      |   90.00%  | 78.38% || 83.79 || 
   | [TKS99] | 0.14 |   91.28%  | 76.06% || 82.98 || 
   +---------+------+-----------+--------++-------++

[KD00] have done similar work for both NPs and VPs. They obtained similar results with more training data but without using lexical information. [Bra99] has reported NP bracketing results for German.

NP chunking and NP bracketing are two intermediate steps to achieving the goal of the TMR network Learning Computational Grammars, to learn the structure of noun phrases.

Software and Data

Related information

References


Last update: May 08, 2005. erikt@uia.ua.ac.be