Sunday, July 20, 2008

Web Trigrams

Chris Harrison's Web Trigrams

I love envisioning information, and the Internet has made this infinately easier. My brother passed on a link to Chris Harrison's experiments in visualizing Google's n-gram data. From Harrison's site:

"Back in late 2006, Google released a massive set of web n-gram data (basically pieces of sentences). A trigram (n=3), for example, might be "I like food" or "frog is tasty." Each n-gram is also labeled with the number of times it appeared in Google's corpus. The entire archive, which is almost 100GB uncompressed, has unigrams (n=1) through fivegrams (n=5). The data set is offered through the LDC for those who are interested (link).

As soon as I got my hands on the data, I quickly got to work on some straight forward visualizations. The first type compares two sets of trigrams, each starting with a different word. One visualization compares 'He' with 'She', while the other uses 'I' and 'You'. In the case of the 'He' vs. 'She', the top 120 trigrams for each were identified. The frequencies of the second word in the trigrams were combined and sorted, and rendered in decreasing frequency-of-use order. A similar process was used to create a ranking for the third (and final) word in the trigrams. Words are sized according to the square root of their use frequencies. The color-coded lines act like paths (a tree structure), enumerating all of the trigrams. The process was identical for the 'I' and 'You' version, except that only the top 75 trigrams were used.

These visual comparisons allow us to see differences in how the two subjects are used - both where they are similar and diverge. For example, among the top 120 trigrams, 'He' and 'She' have many common second words. However, they differ on some interesting ones, for example, only 'he' connects to 'argues', while only 'she' connects to 'love' (within the top 120)."

Visit Chris Harrison's Web Triagrams.

No comments: