pattern.table

The pattern.table module offers a convenient way to work with tabular data. It can be used to store and analyze data retrieved with the pattern.web module in a uniform way, i.e. as a Unicode CSV file – instead of relying on custom text files.

It can be used by itself or with other pattern modules: web | table | en | search  | vector | graph.


Documentation

 


Table

A Table is a matrix of rows and columns, where each row and column can be retrieved as a list. Values can be any kind of Python object, but only str, unicode, int, float, bool and None are (by default) correctly imported from file with Table.load().

table = Table(rows=[])table = Table.load(path, separator=',', decoder=lambda j,v: v)
table.rows                  # List of rows (each row = list of values).
table.columns               # List of columns (each column = list of values).

table[i]                    # Row with index i.
table[i,j]                  # Value in row i at column j.

table.insert(i, row, default=None)
table.append(row, default=None)
table.extend(rows, default=None)
table.map(function=lambda value: value)table.copy(rows=ALL, columns=ALL)
table.slice(i, j, n, m)
table.group(j, function=FIRST, key=lambda value: value)
table.save(path, separator=',', encoder=lambda j,v: v)

  • Table.insert() and Table.append() fill missing columns (at the right) with a default value.
    Table.columns.insert() and Table.columns.append() fill missing rows with the default.
    The given list is copied to the table, i.e. changes to the original list are not reflected in the table.
  • Table.map() applies the given function to each item in the table.
  • Table.copy() returns a new Table from a selective list of row and/or column indices.
  • Table.slice() returns a new Table, from row i and column j spanning n rows and m columns.

Sorting rows & columns

Table.rows and Table.columns are lists, in which each item is a list of row or column values. They can be sorted just like any other Python list. Table.columns.sort() makes the most sense if you use its extra order argument, which is a list of columns indices:

>>> from pattern.table import Table, uid, pprint
>>> t = Table(rows=[
>>>     [uid(), 'broccoli', 'vegetable', 'green' ],
>>>     [uid(), 'turnip',   'vegetable', 'purple'],
>>>     [uid(), 'banana',   'fruit',     'yellow']])
>>> t.columns.sort(order=[0,2,1,3])
>>> pprint(t, fill='')

1  vegetable  broccoli   green
2  vegetable  turnip     purple 
3  fruit      banana     yellow

Table.columns[j].sort() sorts all the rows in the table according to the values in column j. This is similar to clicking on a column header in a datasheet view.

Table.rows.swap() and Table.columns.swap() swaps the two rows / columns with given indices.

Grouping rows

Table.group() returns a new Table with unique values in column j by grouping rows with a function.

table.group(j, function=FIRST, key=lambda value: value)

The function takes a list of column values as input and returns a single value. It can also be a list of functions, one for each separate column. A TypeError will be raised when a function cannot handle the data in its column. The key argument can be used to prepare the values in column j for comparison, for example: key=lambda date: date.year groups a column of Date objects by year. 

Function Description
FIRST Returns the first item in the list.
LAST Returns the last item in the list.
COUNT Returns the number of items in the list.
MAX Returns the highest value in the list.
MIN Returns the lowest value in the list.
SUM Returns the sum of all (numeric) values in the list.
AVG Returns the average of all (numeric) values in the list.
STDEV Returns the standard deviation (the amount of variation from the average).

For example:

>>> t = Table(rows=[
>>>     [uid(), 'broccoli',  'vegetable', 'green' ],
>>>     [uid(), 'turnip',    'vegetable', 'purple'],
>>>     [uid(), 'asparagus', 'vegetable', 'white' ],
>>>     [uid(), 'banana',    'fruit',     'yellow'],
>>>     [uid(), 'orange',    'fruit',     'orange']])
>>> g = t.copy(columns=[2,0]) # A copy with the type and id columns.
>>> g = g.group(0, COUNT)     # Group by type, count rows per type.
>>> pprint(g, fill='')

vegetable  3
fruit      2

This example uses a custom function to concatenate all names per type:

>>> g = t.copy(columns=[2,1])
>>> g = g.group(0, function=lambda list: '+'.join(list))
>>> pprint(g, fill='')

vegetable  broccoli+turnip+asparagus
fruit      banana+orange    

Table utility commands: uid | index | flip | pprint

The pattern.table module has three commands that are useful in combination with a Table:

uid()
index(list)flip(table)pprint(table, truncate=40, padding=" ", fill=".")

The uid() command returns a unique number, starting from 1, useful for a column index.

The flip() command returns a new Table with the given table's columns as rows, and vice versa.

The pprint() command pretty-prints the table, with columns aligned vertically. The truncate parameter defines the maximum column width, padding the spacing between columns and fill the character used to outline columns.

The index() command returns a value → index dictionary that is efficient for lookup operations. It is faster to do: value in dict instead of: value in list, and: dict[value] instead of: list.index(value). This can be used (for example) to append a batch of rows to a large table with a unique column. An index can be created for the column and used to check if the new row can be added:

X = index(table.columns(j))
u = [row for row in rows if row[j] not in X] # Assumes 'rows' is unique.
table.rows.extend(u)

Note that the index is "static": if the column changes the index will have to recreated.

 


CSV

Comma-separated values is a simple text format for a table, where each row is a new line and each value is separated by a comma. Table.save() exports the table to a Unicode CSV file. Table.load() returns a new Table from a given CSV file.

table = Table.load(path, separator=',', decoder=lambda j,v: v)

table.save(path, separator=',', encoder=lambda j,v: v)

When reading a CSV file, the text is parsed and table values converted back to  unicode, int, float, bool and None if possible. For other data types a decoder needs to be defined. This function takes each (column index, column item) and is expected to parse the item (a Unicode string) into the correct data format (e.g. a Date). When saving a CSV file, an encoder needs to be given that returns a string from a data type.

Date objects are saved and loaded as strings, but it is easy to convert these back to dates:

>>> from pattern.table import Table, date
>>> t = Table(rows=[
>>>     ['moon landing',   date('July 20, 1969')],
>>>     ['united nations', date('24 October 1945')]
>>> ])
>>> t.save('history.txt')
>>> t = Table.load('history.txt', decoder=lambda j,v: j==1 and date(v) or v)
>>> print t
>>> print type(t.columns[1][0])

[[u'moon landing', '1969-07-20 00:00:00'],
 [u'united nations', '1945-10-24 00:00:00']]
<class 'pattern.table.Date'>

 


Date

It is often useful to keep a date stamp for each row in the table. The Date object is a convenient subclass of Python's datetime.datetime. The date() command returns a Date object. A date can be generated from an int (POSIX timestamp), a string or NOW. Optionally, an input format (e.g. "%d/%m/%Y") and a print format can be given. The default print format is "YYYY-MM-DD hh:mm:ss"

date(int)date(NOW, format=DEFAULT)
date(string)date(string, format=DEFAULT)date(string, inputformat, format=DEFAULT)date(year, month, day, hours, minutes, seconds, format=DEFAULT)

If no input format is given, a few common formats will be tried:

Format Example
%Y-%m-%d %H:%M:%S 2010-09-21 09:27:01
%a, %d %b %Y %H:%M:%S %z Tue, 9 Sep 2010 17:58:28 +0000
%Y-%m-%dT%H:%M:%SZ 2010-09-20T09:27:01Z
%Y-%m-%d %H:%M 2010-09-20 09:27
%Y-%m-%d 2010-09-20
%d/%m/%Y 20/09/2010
%d %B %Y 9 september 2010
%B %d %Y September 9 2010
%B %d, %Y September 09, 2010

All formats used in pattern.web's web services (e.g. Bing search result) are automatically detected.
For a full overview of date format syntax, see: http://docs.python.org/library/time.html#time.strftime.

 

Date calculations

The time() command can be used to add (or subtract) time to a Date:

time(days=0, seconds=0, minutes=0, hours=0)

>>> from pattern.table import date, time
>>> d  = date('4 november 2011')
>>> d += time(days=2, hours=5)
>>> print d

2011-11-06 05:00:00