Clause Identification

Clauses are word sequences which contain a subject and a predicate. Here is an example of a sentence and its clauses obtained from Wall Street Journal section 15 of the Penn Treebank [MSM93]:

   (S The deregulation of railroads and trucking companies
      (SBAR that
          (S began in 1980)
      ) 
      enabled
      (S shippers to bargain for transportation)
      .
   )

The clauses of this sentence have been enclosed between brackets. A tag next to the open bracket denotes the type of the clause.

In the CoNLL-2001 shared task, the goal is to identify clauses in text. Training and test data for this task are available. This data consists of the same partitions of the Wall Street Journal part (WSJ) of the Penn Treebank as the widely used data for noun phrase chunking: sections 15-18 as training data (211727 tokens) and section 20 as test data (47377 tokens). The clause segmentation of the data has been derived from the Penn Treebank by a program written by Sabine Buchholz from Tilburg University, The Netherlands.

The shared task consists of three parts: identifying clause start positions, recognizing clause end positions and building complete clauses. We have not used clauses labeled with FRAG or RRC, and all clause labels have been converted to S. The goal of this task is to come forward with machine learning methods which after a training phase can recognize the clause segmentation of the test data as well as possible. For all three parts of the shared task, the clause segmentation methods will be evaluated with the F rate, which is a combination of the precision and recall rates: F = 2*precision*recall / (recall+precision) [Rij79].

Background Information

There have been some earlier studies in identifying clauses. [Abn90] used a clause filter as a part of his CASS parser. It consists of two parts: one for recognizing basic clauses and one for repairing difficult cases (clauses without subjects and clauses with additional VPs). [Eje96] showed that a parser can benefit from automatically identified clause boundaries in discourse. [Lef98] built a rule-based algorithm for finding clauses in English and Portuguese texts. [Ora00] used memory-based learning techniques for finding clauses in the Susanne corpus. His system included a rule-based post-processing phase for improving clause recognition performance.

Software and Data

The train and test data consist of four columns separated by spaces. Each word has been put on a separate line and there is an empty line after each sentence. The first column contains the current word, the second a part-of-speech tag derived by the Brill tagger, the third a chunk tag generated by a chunker [TKS00] and the fourth a corresponding clause tag extracted from the Penn Treebank. The chunk tags contain two parts: one stating whether the word is chunk initial (B) or not (I), and one holding the type of the chunk (NP, VP, PP, etcetera). There are two varieties of chunk tags: one for start/end parts which uses S (start), E (end) and X (neither), and one for the complete clause segmentation which contains (S* (start), *S) (end), * (neither) and several combinations such as (S(S*S) (two clause starts and one clause end). Here is an example:

              The   DT   B-NP  S/X/(S*
     deregulation   NN   I-NP  X/X/*
               of   IN   B-PP  X/X/*
        railroads  NNS   B-NP  X/X/*
              and   CC      O  X/X/*
         trucking   NN   B-NP  X/X/*
        companies  NNS   I-NP  X/X/*
             that  WDT   B-NP  S/X/(S*
            began  VBD   B-VP  S/X/(S*
               in   IN   B-PP  X/X/*
             1980   CD   B-NP  X/E/*S)S)
          enabled  VBD   B-VP  X/X/*
         shippers  NNS   B-NP  S/X/(S*
               to   TO   B-VP  X/X/*
          bargain   VB   I-VP  X/X/*
              for   IN   B-PP  X/X/*
   transportation   NN   B-NP  X/E/*S)
                .    .      O  X/E/*S)

In this example, the fourth column contains the clause tags for part 1, 2 and 3 of the shared task separated by slashes. In the third column, the O chunk tag is used for tokens which are not part of any chunk.

There are two evaluation programs (Perl) available: one for parts 1 and 2 (conlleval1) and one for part 3 (conlleval3). The input of the programs should consist of a file which is the same as the test data but which contains an additional final column which holds the results of that should be evaluated. The programs should be invoked as conlleval1 < file

http://www.cnts.ua.ac.be/conll2001/clauses/clauses.tgz
The data sets and evaluation software for this shared task in one gzipped tar file. You can also retrieve these files one by one: data and software. The first two columns in the data files have been extracted from the [RM95] NP chunking data which is available from: ftp://ftp.cis.upenn.edu/pub/chunker/
http://ilk.uvt.nl/team/sabine/homepage/software.html
The Perl script that was used for generating these training and test data sets from the Penn Treebank. It has been written by Sabine Buchholz from Tilburg University.

Results

Six systems have participated in the CoNLL-2001 shared task. They used a wide variety of machine learning techniques. Here is an overview of their performance on the test data of part 3 of the shared task (full clause identification) of the systems that have participated in the shared task together with other results (*) for this data set that were published after the workshop:

              +-----------+-----------++-----------++
     test org | precision |   recall  ||     F     ||
   +----------+-----------+-----------++-----------++
   | [CMPR02] |   90.18%  |   72.59%  ||   80.44   || (*)
   | [CM01]   |   84.82%  |   73.28%  ||   78.63   ||
   | [MP01]   |   70.89%  |   65.57%  ||   68.12   ||
   | [TKS01]  |   76.91%  |   60.61%  ||   67.79   ||
   | [PG01]   |   73.75%  |   60.00%  ||   66.17   ||
   | [Dej01]  |   72.56%  |   54.55%  ||   62.27   ||
   | [Ham01]  |   55.81%  |   45.99%  ||   50.42   ||
   +----------+-----------+-----------++-----------++
   | baseline |   98.44%  |   31.48%  ||   47.71   ||
   +----------+-----------+-----------++-----------++

The baseline results were produced by a system which only put clause brackets around sentences. All of the participating systems outperformed the baseline. Most systems obtained an F-rate between 62 and 68. One performed below the rest [Ham01] but has not used all training data. The system of Xavier Carreras and Luís Màrquez [CM01] outperformed all other systems both on the main part of the shared task (F=78.63) as the other two parts. It uses AdaBoost applied to decision trees.

Xavier Carreras has reported errors in the test data set testb3 which concerned the presence of duplicate clauses: (S(S words S)S). These clauses have been removed on August 3, 2003. Here are the results of the systems that participated in shared task for the corrected test data set:

              +-----------+-----------++-----------++
     test cor | precision |   recall  ||     F     ||
   +----------+-----------+-----------++-----------++
   | [CM03]   |   87.99%  |   81.01%  ||   84.36   || (*)
   | [CMPR02] |   90.18%  |   78.11%  ||   83.71   || (*)
   | [CM01]   |   84.82%  |   78.85%  ||   81.73   ||
   | [MP01]   |   70.85%  |   70.51%  ||   70.68   ||
   | [TKS01]  |   76.91%  |   65.22%  ||   70.58   ||
   | [PG01]   |   73.75%  |   64.56%  ||   68.85   ||
   | [Dej01]  |   72.56%  |   58.69%  ||   64.89   ||
   | [Ham01]  |   55.81%  |   49.49%  ||   52.46   ||
   +----------+-----------+-----------++-----------++
   | baseline |   98.44%  |   33.88%  ||   50.41   ||
   +----------+-----------+-----------++-----------++

The correction influences the recall and F rates, all of which improve.

The papers associated with the participating systems can be found in the reference section below.

Related information

http://www.cnts.ua.ac.be/conll2001/
Home page of the workshop on Computational Natural Language Learning (CoNLL-2001)

References

This reference section contains two parts: first the papers from the shared task session at CoNLL-2001 and then the other related publications.

CoNLL-2001 Shared Task Papers

Note: at the workshop some of the participants have presented results that are different from the ones mentioned in their paper. Whenever possible an update of the paper with the improved results is available alongside the original version.

[TD01]
Erik F. Tjong Kim Sang and Hervé Déjean, Introduction to the CoNLL-2001 Shared Task: Clause Identification. In: Proceedings of CoNLL-2001, Toulouse, France, 2001.
original: [abstract] [ps] [pdf] [bibtex]
update: [abstract] [ps] [pdf] [bibtex]
sheets: [ps] [pdf]
[CM01]
Xavier Carreras and Luís Màrquez, Boosting Trees for Clause Splitting. In: Proceedings of CoNLL-2001, Toulouse, France, 2001.
[ps] [pdf] [bibtex] [system output]
[Dej01]
Hervé Déjean, Using ALLiS for Clausing. In: Proceedings of CoNLL-2001, Toulouse, France, 2001.
[abstract] [ps] [pdf] [bibtex] [system output]
[Ham01]
James Hammerton, Clause identification with Long Short-Term Memory. In: Proceedings of CoNLL-2001, Toulouse, France, 2001.
[ps] [pdf] [bibtex] [system output]
[MP01]
Antonio Molina and Ferran Pla, Clause Detection using HMM. In: Proceedings of CoNLL-2001, Toulouse, France, 2001.
original: [ps] [pdf] [bibtex] [system output]
update: paper not available [system output]
[PG01]
Jon D. Patrick and Ishaan Goyal, Boosted Decision Graphs for NLP Learning Tasks. In: Proceedings of CoNLL-2001, Toulouse, France, 2001.
original: [abstract] [ps] [pdf] [bibtex] [system output]
update: paper not available [system output]
[TKS01]
Erik F. Tjong Kim Sang, Memory-Based Clause Identification. In: Proceedings of CoNLL-2001, Toulouse, France, 2001.
original: [ps] [pdf] [bibtex] [system output]
update: [ps] [pdf] [bibtex] [system output]

Other related publications

[Abn90]
Steven Abney, Rapid Incremental Parsing with Repair. In "Proceedings of the 8th New OED Conference: Electronic Text Research", University of Waterloo, Ontario, 1990.
http://whorf.sfs.nphil.uni-tuebingen.de/~abney/90j.ps.gz
[Bri94]
Eric Brill, Some Advances in Rule-Based Part of Speech Tagging. In "Proceedings of the Twelfth National Conference on Artificial Intelligence (AAAI-94)", Seattle, Washington, 1994.
ftp://ftp.cs.columbia.edu/pub/cs4999/brill94.ps
[CM03]
Xavier Carreras and Lluís Màrquez, Phrase Recognition by Filtering and Ranking with Perceptrons. In "Proceedings of the International Conference on Recent Advances in Natural Language Processing, RANLP-2003", Borovets, Bulgaria, 2003.
http:// www.lsi.upc.es/~nlp/papers/2003/ranlp2003-cm.ps.gz
[CMPR02]
Xavier Carreras, Luís Màrquez, Vasin Punyakanok and Dan Roth, Learning and Inference for Clause Identification. In "Proceedings of the 13th European Conference on Machine Learning", ECML'02, Helsinki, Finland, 2002.
http://www.lsi.upc.es/~nlp/papers/2002/ecml02-cmpr.ps.gz
[Eje96]
Eva Ejerhed, Finite State Segmentation of Discourse into Clauses. In "Proceedings of the ECAI '96 Workshop on Extended finite state models of language", ECAI '96, Budapest, Hungary, 1996.
http://www.kornai.com/ECAI/ejerhed.ps.gz
[Lef98]
Vilson J. Leffa, Clause processing in complex sentences. In: "Proceedings of LREC'98", Granada, Espanha, 1998.
http://atlas.ucpel.tche.br/~leffa/granada.pdf
[MSM93]
Mitchell P. Marcus, Beatrice Santorini and Mary Ann Marcinkiewicz, Building a large annotated corpus of English: the Penn Treebank, In: "Computational Linguistics", 19:2, 1993.
http://morph.ldc.upenn.edu/Catalog/docs/treebank2/cl93.html (paper)
http://morph.ldc.upenn.edu/Catalog/LDC95T7.html (corpus information)
[Ora00]
Constantin Orasan, A hybrid method for clause splitting in unrestricted English texts, In: "Proceedings of ACIDCA'2000", Monastir, Tunisia, 2000.
http://www.wlv.ac.uk/sles/compling/papers/orasan-00.pdf
[RM95]
Lance A. Ramshaw and Mitchell P. Marcus, Text Chunking Using Transformation-Based Learning. In: "Proceedings of the Third ACL Workshop on Very Large Corpora", Association for Computational Linguistics, 1995.
ftp://ftp.cis.upenn.edu/pub/chunker/wvlcbook.ps.gz
[Rij79]
C.J. van Rijsbergen, "Information Retrieval". Buttersworth, 1979.
[TKS00]
Erik F. Tjong Kim Sang. Text Chunking by System Combination. In "Proceedings of CoNLL-2000 and LLL-2000", Lisbon, Portugal, 2000.
http://www.cnts.ua.ac.be/conll2000/ps/15153tjo.ps
[TKS02]
Erik F. Tjong Kim Sang, Memory-Based Shallow Parsing, In Journal of Machine Learning Research, volume 2 (March), 2002, pp. 559-594.
http://arXiv.org/abs/cs.CL/0204049

Last update: April 19, 2011 erikt(at)xs4all.nl