Using Wiktionary to build an Italian part-of-speech tagger

  Tom De Smedt (Computational Linguistics Research Group, University of Antwerp)
  Fabio Marfia (Dipartimento di Elettronica, Politecnico di Milano)

 

Pattern contains part-of-speech taggers for a number of languages (including English, Spanish, German, French and Dutch). Part-of-speech tagging is useful in many data mining tasks. A part-of-speech tagger takes a string of text and identifies the sentences and the words in the text along with their word type. The word type or part-of-speech can vary according to a word's role in the sentence. For example, in English, can can be a verb ("Can I have a can of soda?") or a noun ("Can I have a can of soda?").

The output takes the following form:

Can I have a can of soda ?
MD PRP VB DT NN IN NN .

POS-tag MD indicates a modal verb, PRP a personal pronoun, VB a verb, DT a determiner, NN a noun and IN a preposition. The tags are part of the Penn Treebank II tagset.

Pattern uses Brill's algorithm to construct its part-of-speech taggers. Other algorithms are more robust, but a Brill tagger is fast and compact (i.e., 1 MB of data) so it makes a good candidate for Pattern. There are many languages for which Pattern does (or did) not have a tagger – for example Italian.

Brill's algorithm essentially produces a lexicon of known words and their part-of-speech tag, along with some rules for unknown words, or rules that change the tag according to a word's role in the sentence.

In the past, written text (e.g., 1 million words) had to be tagged manually by human annotators, and then fed to the algorithm. Manual annotation is expensive and time consuming. Today many resources are freely available. One such resource is Wiktionary, where many people collaborate to produce a free multilingual dictionary. 

 


1. Mining Wiktionary for part-of-speech tags

If you take a look at: http://en.wiktionary.org/wiki/Index:Italian/a, you'll see a list of thousands of Italian words that start with a- together with their part-of-speech tag. Since Wiktionary's content is free, we can mine the HTML of the page to automatically populate a lexicon. We can also mine the pages for words starting with b-, c-, and so on, to expand our lexicon.

The following script uses the pattern.web module to accomplish this. The URL class has a download() method that retrieves the HTML from a given web address. The DOM class takes a string of HTML and transforms it into a tree of nested elements. We can then search the tree with CSS selectors for the elements we need, i.e., the words and their type:

from pattern.web import URL, DOM

url = "http://en.wiktionary.org/wiki/Index:Italian/"

lexicon = {}
for ch in "abcdefghijklmnopqrstuvwxyz0":
    print ch, len(lexicon)
    # Download the HTML source of each Wiktionary page (a-z).
    html = URL(url + ch).download(throttle=10, cached=True)
    # Parse the HTML tree.
    dom = DOM(html)
    # Iterate through the list of words and parse the part-of-speech tags.
    # Each word is a list item:
    # <li><a href="/wiki/additivo">additivo</a><i>n adj</i></li>
    for li in dom("li"):
        try:
            word = li("a")[0].content
            pos = li("i")[0].content.split(" ")
            if word not in lexicon:
                lexicon[word] = []
            lexicon[word].extend(pos)
        except:
            pass

We end up with a lexicon dictionary that contains about a 100,000 words, each linked to a list of part-of-speech tags. For example: la → DT, PRP, NN.

We don't have any tags for punctuation marks, but we can add them manually:

for punctuation, tag in (
  (u".", "."), (u'"', '"'), (u"+", "SYM"), (u"#", "#"),
  (u"?", "."), (u'“', '"'), (u"-", "SYM"), (u"$", "$"),
  (u"!", "."), (u'”', '"'), (u"*", "SYM"), (u"&", "CC"),
  (u"¡", "."), (u"(", "("), (u"=", "SYM"), (u"/", "CC"),
  (u":", ":"), (u")", ")"), (u"<", "SYM"), (u"%", "CD"),
  (u";", ":"), (u",", ","), (u">", "SYM"), (u"@", "IN"), (u"...", ".")):
    lexicon[punctuation] = tag

 


2. Mining Wiktionary for word inflections 

In many languages, words inflect according to tense, mood, person, gender and number. This is true for verbs (discussed later) and often for nouns and adjectives. In Italian, the plural form of the noun affetto (affection) is affetti, while the plural feminine form of the adjective affetto (affected) is affette. Unfortunately, the inflected forms are not always in the Wiktionary index. We need to mine deeper to retrieve them. This is a time-consuming process. We need to set a high throttle between requests to avoid being blacklisted by Wiktionary's servers.

This script defines an inflect() function. Given a word, it returns a dictionary of word forms:

from pattern.web import URL, DOM, plaintext
import re

def inflect(word, language="italian"):
    inflections = {}
    url = "http://en.wiktionary.org/wiki/" + word.replace(" ", "_") 
    dom = DOM(URL(url).download(throttle=10, cached=True))
    pos = ""
    # Search the header that marks the start for the given language:
    # <h2><span class="mw-headline" id="Italian">Italian</span></h2>
    e = dom("#" + language)[0].parent
    while e is not None: # e = e.next_sibling
        if e.type == "element":
            if e.tag == "hr": # Horizontal line = next language.
                break
            if e.tag == "h3": # <h3>Adjective [edit]</h3>
                pos = plaintext(e.content.lower())
                pos = pos.replace("[edit]", "").strip()[:3].rstrip("ouer") + "-"
            # Parse inflections, using regular expressions.
            s = plaintext(e.content)
            # affetto m (f affetta, m plural affetti, f plural affette)
            if s.startswith(word):
                for gender, regexp, i in (
                  ("m" , r"(" + word + r") m", 1),
                  ("f" , r"(" + word + r") f", 1),
                  ("m" , r"(" + word + r") (mf|m and f)", 1),
                  ("f" , r"(" + word + r") (mf|m and f)", 1),
                  ("m" , r"masculine:? (\S*?)(,|\))", 1),
                  ("f" , r"feminine:? (\S*?)(,|\))", 1),
                  ("m" , r"(\(|, )m(asculine)? (\S*?)(,|\))", 3),
                  ("f" , r"(\(|, )f(eminine)? (\S*?)(,|\))", 3),
                  ("mp", r"(\(|, )m(asculine)? plural (\S*?)(,|\))", 3),
                  ("fp", r"(\(|, )f(eminine)? plural (\S*?)(,|\))", 3),
                  ( "p", r"(\(|, )plural (\S*?)(,|\))", 2),
                  ( "p", r"m and f plural (\S*?)(,|\))", 1)):
                    m = re.search(regexp, s, re.I)
                    if m is not None:
                        # {"adj-m": "affetto", "adj-fp": "affette"}
                        inflections[pos + gender] = m.group(i)
            #print s
         e = e.next_sibling
    return inflections

We can add a call to inflect() for each noun, adjective or verb in the inner loop of our miner (see step 1):

if any(tag in pos for tag in ("n", "v", "adj")):
    for pg, w in inflect(word).items():
        p, g = pg.split("-") # pos + gender: ("adj", "f")
        if w not in lexicon:
            lexicon[w] = []
        if p not in lexicon[w]:
            lexicon[w].append(p)

 


3. Mining Wikipedia for texts

The lexicons bundled in Pattern are about 500KB to 1MB in file size. If we save our Italian lexicon as a file, it is about 2MB (or 4MB with the inflections from step 2). We may want to reduce it, by removing less important words. Which words to remove? We don't want to remove la; it looks important in Italian. We can assess a word's importance by counting how many times it occurs in written text.

The following script uses the pattern.web module to retrieve Italian texts from Wikipedia. The Wikipedia class has a search() method that returns a WikipediaArticle. We then use the pattern.vector module to count the words in articles:

from pattern.web import Wikipedia
from pattern.vector import words

frequency = {}
# Spreading activation.
# Parse links from seed article & visit those articles.
links, seen = set(["Italia"]), {}
while len(links) > 0:
    try:
        article = Wikipedia(language="it").search(links.pop(), throttle=10)
        seen[article.title] = True
        # Parse links from article.
        for link in article.links:
            if link not in seen:
                links.add(link)
        # Parse words from article. Count words.
        for word in words(article.string):
            if word not in frequency:
                frequency[word] = 0
            frequency[word] += 1
        print sum(frequency.values()), article.title
    except:
        pass
    # Collect a reliable amount of words (e.g., 1M).
    if sum(frequency.values()) > 1000000:
        break

#top = sorted((count, word) for word, count in frequency.items())
#top = top[-1000:]
#print top

We should also boost our miner by including contemporary newspaper articles:

from glob import glob

# Up-to-date newspaper articles:
for f in glob("repubblica-*.txt"):
    for word in words(open(f).read()):
        if word not in frequency:
            frequency[word] = 0
        frequency[word] += 1

We end up with a frequency dictionary with about 1,000,000 words (115,000 unique words) and their wordcount. For example, di occurs 70,000 times, la occurs 30,000 times and indecifrabilmente (indecipherable) a single time. This is a word that we could remove and replace with a morphological rule -mente → RB (adverb). Morphological rules are discussed further.

 


4. Preprocessing a CSV-file

This is a good time to store the data (so we don't need to rerun the miner). We map Wiktionary's word tags to Penn Treebank II, and combine the entries in lexicon and frequency. We then use pattern.db to store the result as a CSV-file.

from pattern.db import Datasheet

PENN = {  "n": "NN",
          "v": "VB",
        "adj": "JJ",
        "adv": "RB",
    "article": "DT",
       "prep": "IN",
       "conj": "CC",
        "num": "CD",
        "int": "UH",
    "pronoun": "PRP",
     "proper": "NNP" 
}
     
SPECIAL = ["abbr", "contraction"]
special = set()

csv = Datasheet()
for word, pos in lexicon.items():
    if " " not in word:
        f = frequency.get(word, frequency.get(word.lower(), 0))
        # Map to Penn Treebank II tagset.
        penn  = [PENN[tag] for tag in pos if tag in PENN]
        penn += [tag] if tag in ("SYM",".",",",":","\"","(",")","#","$") else [] 
        penn  = ", ".join(penn)
        # Collect tagged words in the .csv file.
        csv.append((f, word, penn))
        # Collect special words for post-processing.
        for tag in SPECIAL:
            if tag in pos:
                special.add(word)

csv.columns[0].sort(reverse=True)
csv.save("it-lexicon.csv")

print special

We end up with a CSV-file of Italian words and their part-of-speech tag, sorted by frequency:

Frequency Word Part of speech
71,655 di IN
44,934 e CC
32,216 il DT
29,378 la DT, PRP, NN
26,998 che PRP, JJ, CC
26,702 in IN
23,617 a IN
22,581 del  
18,577 per IN
16,824 della  

Distribution of Italian words. Top five is di, e, il, la, che.


As shown, the distribution of words approximates Zipf's law. The most frequent word appears nearly twice as much as the second most frequent word, and so on. The top 10% most frequent covers 90% of Italian language use. This implies that we can remove part of "Zipf's long tail" (words that occur only once). If we have a lexicon that covers the top 10%, and tag all unknown words as NN, we have a tagger that is about 90% accurate. This is the baseline. We can improve it by 1-5% by determining good morphological and contextual rules for unknown words.

 


5. Morphological rules based on word suffixes

When we remove words from the lexicon (to reduce file size), the tagger may no longer recognize some words. By default, it will tag unknown words as NN. We can improve the tags of unknown words using morphological rules. Examine the English en-morphology.txt to see the rule format.

One way to predict tags is to look at word suffixes. For example, English adverbs usually end in -ly. In Italian they end in -mente. The following script determines the most frequent tag for each word suffix:

from pattern.db import Datasheet

lexicon = {}
for frequency, word, tags in Datasheet.load("it-lexicon.csv"):
    lexicon[word] = tags.split(", ")
from collections import defaultdict

# {"mente": {"RB": 2956.0, "JJ": 8.0, NN: "2.0"}}
suffix = defaultdict(lambda: defaultdict(float))
for w in lexicon:
    if len(w) > 5: # Last 5 characters.
        x = w[-5:] #
        for tag in lexicon[w]:
            suffix[x][tag] += 1.0

# Map the dictionary to a list sorted by total tag count.
suffix = [(sum(tags.values()), x, tags) for x, tags in suffix.items()]
suffix = sorted(suffix, reverse=True)

for n, x, tags in suffix[:100]:
    # Relative count per tag (0.0-1.0).
    # This shows the tag distribution per suffix more clearly.
    tags = [("%.3f" % (i/n), tag) for tag, i in tags.items()]
    tags = sorted(tags, reverse=True)
    print x, n, tags
Suffix Frequency Parts of speech
-mente 2,969 99% RB + 0.5% JJ + 0.5% NN
-zione 2,501 99% NN + 0.5% JJ + 0.5% NNP
-abile 1,400 97% JJ + 2% NN + 0.5% RB + 0.5% NNP
-mento 1,375 99% NN + 0.5% VB + 0.5% JJ
-atore 1,218 84% NN + 16% JJ

We can also run the script for suffixes of 4 or 3 characters. We then manually construct an it-morphology.txt file with interesting rules. For example: -mente → RB has a high coverage (2,969 words in the lexicon) and a high precision (99%). We add this rule to the ruleset:

NN mente fhassuf 5 RB x

 


6. Contextual rules

When we constructed a CSV-file (see step 4), we saw that some words can have multiple tags, depending on their role in the sentence. In English, in "I can", "you can" or "we can", can is a verb. In "a can" and "the can" it is a noun. We could generalize this in two contextual rules: PRP + can → VB, and DT + can → NN. Examine the English en-context.txt to see the rule format.

We can create contextual rules by hand. We can also analyze a corpus of tagged texts (a treebank) to predict how word tags change according to the surrounding words. However, a corpus of tagged texts implies that it was tagged with another part-of-speech tagger. It is a thin line between using someone else's tagger and plagiarizing someone else's tagger. We should contact the authors and/or cite their work.

For Italian, we can use the freely available WaCKy corpus (Baroni, Bernardini, Ferraresi & Zanchetta, 2009). The following script reads 1 million words from the WaCKy MultiTag Wikipedia corpus. For words that can have multiple tags, it records the tag of the preceding word and its frequency:

from pattern.db import Datasheet

ambiguous = {}
for frequency, word, tags in Datasheet.load("it-lexicon.csv"):
    tags = tags.split(", ")
    tags = [tag for tag in tags if tag] 
    if len(tags) != 1 and int(frequency) > 100:
        ambiguous[word] = (int(frequency), tags)
from codecs import open
from collections import defaultdict

# Map TANL tags to Penn Treebank II.
# medialab.di.unipi.it/wiki/Tanl_POS_Tagset
TANL = {
     "A": "JJ",
     "B": "RB",
     "C": "CC", "CC": "CC", "CS": "IN",
     "D": "DT",
     "E": "IN",
    "FF": ",", "FS": ".", "FB": "(",
     "I": "UH",
     "N": "CD",
     "P": "PRP", "PP": "PRP$",
     "R": "DT",
     "S": "NN", "SP": "NNP", 
     "T": "DT",
     "V": "VB", "VM": "MD"
}

# Word tags linked to frequency of preceding word tag:
# {"le": {"DT": {"IN": 1580}, "PRP": {"VB": 105}}}
context = defaultdict(lambda: defaultdict(lambda: defaultdict(int)))

window = [] # [(word1, tag1), (word2, tag2), (word3, tag3)]

for i, s in enumerate(open("/downloads/wikiMT", encoding="utf-8")):
    s = s.split("\t")
    if i > 1000000:
        break
    if i > 1 and len(s) >= 3:
        word, tag = s[0:2] # ("l'", "RD", "il")
        tag = TANL.get(tag[:2]) or \
              TANL.get(tag[:1]) or tag
        window.append((word, tag))
    if len(window) > 3:
        window.pop(0)
    if len(window) == 3 and window[1][0] in ambiguous:
        w1, tag1 = window[0] # word left
        w2, tag2 = window[1] # word that can have multiple tags
        w3, tag3 = window[1] # word right
        context[w2][tag2][tag1] += 1

We can then examine the output, sorted by word frequency:

for word in reversed(sorted(ambiguous, key=lambda k: ambiguous[k][0])):
    print word
    for tag in context[word]:
        left = context[word][tag]
        s = float(sum(left.values()))
        left = [("%.2f" % (n / s), x) for x, n in left.items()]
        left = sorted(left, reverse=True)
        print "\t", int(s), tag, left[:5]
Word Tag Frequency Preceding part-of-speech tag
la DT 14,379 31% VB + 27% IN + 9% CC
la PRP 346 58% VB + 12% CC + 11% PRP
la NN 7 43% NN + 43% IN + 14% VB
che PRP 7,938 38% NN + 28% + 12% JJ
che IN 2,789 46% VB + 17% RB + 13% NN
che DT 129 32% NN + 24% + 8% VB
che CC 123 33% JJ + 17% NN + 15% IN

What can we learn from the output? The word la (DT, PRP or NN?) we can simply tag as DT in our lexicon, since the other cases are negligible (2%). The word che (PRP, IN, DT or CC?) we can tag as PRP in our lexicon (covering 72% of all cases) and create a rule VB + che → IN (covering another 12%). We manually add this rule to the ruleset:

PRP IN WDPREVTAG VB che

We can run variations of the above script to look at the words after, or both before and after. 

Reference:

Baroni, M., Bernardini, S., Ferraresi, A. & Zanchetta, A. (2009). The WaCky Wide Web: A Collection of Very Large Linguistically Processed Web-Crawled Corpora. Language Resources and Evaluation, 43(3), 209–226.

 


7. Subclassing the pattern.text Parser class

In summary, we constructed an it-lexicon.csv with the frequency and part-of-speech tags of known words (steps 1-4) together with an it-morphology.txt (step 5) and an it-context.txt (step 6). We can use these to create a parser for Italian by subclassing the base Parser in the pattern.text module. The pattern.text module has base classes for Parser, Lexicon, Morphology, etc. Take a moment to review the source code, and the source code of other parsers in Pattern. You'll notice that all parsers follow the same simple steps. A template for new parsers is included in pattern.text.xx.

The Parser base class has the following methods with default behavior:

  • Parser.find_tokens()  finds sentence markers (.?!) and splits punctuation marks from words,
  • Parser.find_tags()    finds word part-of-speech tags,
  • Parser.find_chunks()  finds words that belong together (e.g., the black cats),
  • Parser.find_labels()  finds word roles in the sentence (e.g., subject and object), 
  • Parser.find_lemmata() finds word base forms (cats → cat)
  • Parser.parse()        executes the above steps on a given string.

We will need to redefine find_tokens() with rules for Italian abbreviations and contractions (e.g., dell'anno = di + l' + anno). Remember the special set in step 4? It contains the data we need:

from pattern.text import Parser

ABBREVIATIONS = [
    "a.C.", "all.", "apr.", "b.c.", "c.m.", "C.V.", "d.C.", 
    "Dott.", "ecc.", "egr.", "giu.", "Ing.", "orch.", "p.es.", 
    "Prof.", "prof.", "ql.co.", "Spett."
]

CONTRACTIONS = {
     "all'": "all' ",
    "anch'": "anch' ",
       "c'": "c' ",
    "coll'": "coll' ",
     "com'": "com' ",
    "dall'": "dall' ",
    "dell'": "dell' ",
     "dev'": "dev' ",
     "dov'": "dov' ",
      "mo'": "mo' ",
    "nell'": "nell' ",
    "sull'": "sull' "
}

class ItalianParser(Parser):
    
    def find_tokens(self, tokens, **kwargs):
        kwargs.setdefault("abbreviations", ABBREVIATIONS)
        kwargs.setdefault("replace", CONTRACTIONS)
        return Parser.find_tokens(self, tokens, **kwargs)

We can then create an instance of the ItalianParser and feed it our data. We need to convert it-lexicon.csv to an it-lexicon.txt file in the right format (a word and its tag on each line). This only needs happens the first time, of course.

w = []
for frequency, word, tags in Datasheet.load("it-lexicon.csv"):
    if int(frequency) >= 1: # Adjust to tweak file size.
        for tag in tags.split(", "):
            if tag:
                w.append("%s %s" % (word, tag)); break

open("it-lexicon.txt", "w", encoding="utf-8").write("\n".join(w))

Load the lexicon and the rules in an instance of ItalianParser:

from pattern.text import Lexicon

lexicon = Lexicon(
        path = "it-lexicon.txt", 
  morphology = "it-morphology.txt", 
     context = "it-context.txt", 
    language = "it"
)

parser = ItalianParser(
     lexicon = lexicon,
     default = ("NN", "NNP", "CD"),
    language = "it"
)

def parse(s, *args, **kwargs):
    return parser.parse(s, *args, **kwargs)

It is still missing features (notably lemmatization) but our Italian parser is essentially ready for use:

print parse("Il gatto nero faceva le fusa.")
Il gatto nero faceva le fusa .
DT NN JJ VB DT NN .

In the next steps, we will look at how we can enrich the parser with a lemmatizer, based on verb conjugation and noun singularization. 

 


8. Mining Wiktionary for verb conjugations

Italian verbs inflect by person, tense and mood. For example, the 1st person present indicative of essere (to be) is sono (I am). We can mine the verb conjugation tables from Wiktionary for frequent verbs, and use the data to expand our lexicon or to build a lemmatizer.

The verb conjugation table for a given verb is on the same page that we mined with inflect() in step 2. So, for many words the HTML may already be cached locally and the process should not take too long.

from pattern.web import URL, DOM, plaintext as plain

MOOD, TENSE, PARTICIPLE = (
    ("indicative", "conditional", "subjunctive", "imperative"),
    ("present", "imperfect", "past historic", "future"),
    ("present participle", "past participle")
)

def conjugate(verb, language="italian"):
    url  = URL("http://en.wiktionary.org/wiki/%s" % verb)
    dom  = DOM(url.download(throttle=10, cached=True))
    conj = {"infinitive": verb}
    mood = None
    for table in dom("table.inflection-table"):
        # Search the header that marks the start for the given language:
        # <h2><span class="mw-headline" id="Italian">Italian</span></h2>
        h2 = table.parent.parent
        while h2:
            h2 = h2.previous
            if getattr(h2, "tag", "") == "h2" and \
               getattr(h2("span")[0], "id", "") != language:
                continue
        for tr in table("tr"):
            for th in tr("th"):
                # <th>indicative</th>
                if th.content in MOOD:
                    mood = th.content
                # <th>present</th><td>sono</td><td>sei></td>...
                if th.content in TENSE:
                    conj[th.content, mood] = [plain(td.content) for td in tr("td")]
                # <th>gerund</th><td>essendo</td>
                if th.content in PARTICIPLE:
                    conj[th.content] = plain(th.next.next.content)
            # <th>imperative</th></tr><tr><td></td><td>sii</td>...
            if mood == "imperative" and len(tr("th")) == 0:
                conj["present", mood] = [plain(td.content) for td in tr("td")]
        return conj
    return {}