100 days of web mining

In this experiment, we collected Google News stories at regular 1-hour intervals between November 22, 2010, and March 8, 2011, resulting in a set of 6,405 news stories. We grouped these per day and then determined the top daily keywords using tf-idf, a measurement of a word's uniqueness or importance. For example: if the word news is mentioned every day, it is not particularly unique at any single given day. 

To set up the experiment we used the Pattern web mining module for Python.
The basic script is simple enough:

from pattern.web    import Newsfeed, plaintext
from pattern.db     import date
from pattern.vector import Model, Document, LEMMA
news, url = {}, 'http://news.google.com/news?output=rss'
for story in Newsfeed().search(url, cached=False):
    d = str(date(story.date, format='%Y-%m-%d'))
    s = plaintext(story.description)
    # Each key in the news dictionary is a date: news is grouped per day.
    # Each value is a dictionary of id => story items.
    # We use hash(story.description) as a unique id to avoid duplicate content.
    news.setdefault(d, {})[hash(s)] = s
Your code will probably have some preprocessing steps to save and load the mined news updates.
m = Model()
for date, stories in news.items():
    s = stories.values()
    s = ' '.join(s).lower()
    # Each day of news is a single document.
    # By adding all documents to a model we can calculate tf-idf.
    m.append(Document(s, stemmer=LEMMA, exclude=['news', 'day'], name=date))
for document in m:
    print document.name
    print document.keywords(top=10)

In the image below, important words (i.e., events) that occured across multiple days are highlighted (we took a word's document frequency as an indication). You might remember the North Korean artillery attack on a South Korean island on November 23, 2010, the arrest of Julian Assange from Wikileaks in the beginning of December, the shooting of congresswoman Gabrielle Gifford, the unrest in Egypt and the subsequent ousting of Hosni Mubarak, and the Libyan revolt.


See full size image 


Simultaneously, we mined Twitter messages containing the words I love or I hate – 35,784 love-tweets and 35,212 hate-tweets in total. One would expect a correlation between important media events and strongly voiced opinions on Twitter, right? Not so. Out of all hate-tweets, only one matched a media event. On November 24, the most discussed word in hate-tweets was food, correlating with news on December 1st (but this relation is not very meaningful).

The name Mubarak (for example) was only mentioned five times in our Twitter-corpus (e.g., in "I love you as much as Mubarak loves his chair", or in "How do I hate thee, Mubarak? Let the people count the ways"). The names of Gabrielle Gifford or Julian Assange were never mentioned. Perhaps we missed a number of tweets correlating with media events. Perhaps the Twitter buzz does not discuss news in terms of I love or I hate. Instead, consider these tweets from the hate-corpus which are exemplar in terms of language use: "I hate when dudes text me boring shit, whats up uvsvorjibne go fuck yourself!", or "I hate that my son is making me watch this dumb ass movie. I'm gonna fart and see how long it takes for him to notice". The word ass occurs 2,439 times. Phrases containing I love or I hate seem to be used by teenagers predominantly. Perhaps adults retweet or express their opinions in more intricate language forms such as irony or sarcasm. We then calculated document frequency for each word in the hate-corpus. A higher document frequency indicates that the word is present in more documents (i.e., bundles of daily tweets). By taking the top most frequent words we get an idea of words habitually used in the corpus during a 100-day timeframe:

m = Model.load('tweet-hate.pickle')
w = m.vector.keys()
w = [(m.df(w), w) for w in w]
w = sorted(w, reverse=True)        
print w[:10]
Top 10: bitch shit school girl time fuck friend ass nigga justin bieber


Daily drudge

Assuming we have a collection of hate-tweets organised as a list of (tweet, date)-tuples, it is not difficult to group the tweets by day and look at the difference between weekdays and weekends. In general, this difference is small, likely because Twitter messages are retweeted across days.

daily = {}
days = ['mon', 'tue', 'wed', 'thu', 'fri', 'sat', 'sun']
for tweet, date in hate_tweets:
    # Collect tweets in a dictionary indexed by weekday.

m = Model()
for k, v in daily.items:
    m.append(Document(' '.join(v).lower(), name=k, stemmer=LEMMA))
for document in m:
    print document.name
    print document.keywords(10)

Here are the top keywords of hate-tweets grouped by day:

Monday Tuesday Wednesday Thursday Friday Saturday Sunday
5622 6610 5689 5727 4596 3472 3573
monday school shit shit shit shit shit
shit shit time time time time girl
time time school girl girl girl time
school bitch girl bitch school ass bitch
bitch damn bitch talk bitch bitch fuck

Traffic is heaviest on Tuesday, almost twice as much as in weekends. The litany of swear words is constant across different days. If we filter these out we arrive at a new level of universal annoyance:

Monday Tuesday Wednesday Thursday Friday Saturday Sunday
monday sleep sick sick sick sleep katie
morning sick snow home song home monday
sleep cold cold cold sleep night justin
home home song morning cold wake sick
sick morning morning snow morning hair real

Here, tweets appear to be preoccupied with sleep, early mornings, bad weather and sickness (our data is from November to March). On Saturday however, hair and nighttime play a more prominent role. Sunday is Justin Bieber-bashing day. Some more filtering reveals the importance of cars and dates on Friday, movies on Saturday, games on Sunday, and mothers in general:

Monday Tuesday Wednesday Thursday Friday Saturday Sunday
monday mad nigga mom night night monday
math night talk nigga kid wake mom
teacher weather kid talk wake mom game
mom annoy mad stuff date movie fan
wrong talk question suck car house watch



The pattern.en.wordlist module has a number of lists of words (ACADEMIC, PROFANITY, TIME) that can be used to filter noise from a document. For example, academic words include domain, research, technology, profanity includes words such as shit and hell

from pattern.en.wordlist import ACADEMIC
from pattern.vector import Document
d = Document(open("paper.txt").read(), exclude=ACADEMIC) 


Twitter is the new shampoo.

In another experiment, we mined Twitter for 35,371 tweets containing the words is the new, such as: "green is the new gold" or "lipstick is the new trend". We can parse the nouns in the comparison from each tweet and bundle them in a graph. Calculating centrality should then give us an idea of new concepts pointing to newer concepts, pointing to the newest concept.


Click to play movie | Watch on Vimeo


We calculated eigenvector centrality (i.e., PageRank) on the full graph, a measurement of how many nodes are (indirectly) connected to each node in the graph. Nodes with a high eigenvector weight can be considered more important. We still get a lot of profanity noise, but green energy, China as emerging economy and handheld devices such as Google's Android phone are salient examples of trends surfacing in 2010-2011. The top 100 for X is-the-new Y includes:

Twitter black green shit hell money food China ass Android Perry hipster


Source code

The visualization was realized in NodeBox for OpenGL, a Python module that generates 2D interactive animation using OpenGL. It comes with functionality for drawing graphs – more specifically with a Graph object that is identical to pattern.graph.Graph. Output from Pattern can easily be plugged into NodeBox. Essentially, we have the Twitter data stored in a Datasheet on which we create an iterator function. Each time comparison() is called it yields a (concept1, concept2, date)-tuple (or None), where concept1 is the new concept2:

from pattern.db     import Datasheet, date
from pattern.en     import parse, Sentence
from pattern.search import search

rows = iter(Datasheet.load('the_new.txt'))
def comparison(exclude=['', 'i', 'me', 'it', 'side', 'he', 'mine']):   
        r = rows.next() # Fetch next row in the matrix.
        p = Sentence(parse(r[2], lemmata=True, light=True))
        m = search('NP be the new NP', p)
        a = m[0].constituents()[ 0].head.string.upper().strip('.!'"#')
        b = m[0].constituents()[-1].head.string.upper().strip('.!'"#')
        if a not in exclude and b not in exclude:
            # Additionally, we could check if a and b occur in WordNet
            # to get "clean" nouns in the output.
            return b, a, date(r[-1])
We load a portion of the comparisons into a graph. This is the tricky part. If we load all of them, we don't have anything to animate. If we load too few, there may not be enough hooks to connect new concepts to. 
from nodebox.graphics import *
from nodebox.graphics.physics import Graph

g = Graph()
for i in range(700):
    n = comparison()
    if n is not None:
        b,a,d = n
        g.add_node(b, stroke=(1,1,1,0.1), text=(1,1,1,0.5), fontsize=7)
        g.add_node(a, stroke=(1,1,1,0.1), text=(1,1,1,0.5), fontsize=7)
        g.add_edge(b, a, stroke=(1,1,1,0.1))
g = g.split()[0]

Next we implement a NodeBox draw() loop, which is called each frame of animation.
It updates and draws the graph, and incrementally adds new comparisons to it.

def draw(canvas): 
    background(0.18, 0.22, 0.28)
    translate(300, 300)
    for i in range(4): # Add up to 4 new nodes per frame.
        n = comparison()
        if n:
            b,a,d = n
            if a in g or b in g:
                if a in g: g[a].text.fill.alpha = 0.75
                if b in g: g[b].text.fill.alpha = 0.75
                g.add_node(a, stroke=(1,1,1,0.1), text=(1,1,1,0.5), fontsize=7)
                g.add_node(b, stroke=(1,1,1,0.1), text=(1,1,1,0.5), fontsize=7)
                g.add_edge(b, a, stroke=(1,1,1,0.1))
    for n in g.nodes:
        # Nodes with more connections grow bigger.
        n.radius = 3 + n.weight*6 + len(n.links)*0.5

canvas.size = 600, 600
canvas.fps = 40