Google Ngram Viewer

The Google Labs Ngram Viewer is the first tool of its kind, capable of precisely and rapidly quantifying cultural trends based on massive quantities of data. It is a gateway to culturomics! The browser is designed to enable you to examine the frequency of words (banana) or phrases (‘United States of America’) in books over time. You’ll be searching through over 5.2 million books: ~4% of all books ever published!

There are lots of different things you can check, like your favorite word (Supercalifragilisticexpialidocious) or person (President Taft; Chief Justice Taft) or part of the holiday (Christmas Tree).

It can be fun to compare things, too; whether it’s people (Galileo, Darwin, Freud, Einstein), pieces of music (Beethoven’s First, Beethoven’s Second, Beethoven’s Third, Beethoven’s Fourth, Beethoven’s Fifth, Beethoven’s Sixth, Beethoven’s Seventh, Beethoven’s Eighth, Beethoven’s Ninth), facts about grammar (sneaked, snuck), or increasingly precise values for the speed of light (‘2.99796, 2.997925, 2.99792458’).

The browser allows you to search different collections of books (called ‘corpora’). You’ll definitely want to try taking advantage of more than one corpus. For instance, compare ‘centre, center’ in both American and British English. Corpora are available in English, Chinese, French, German, Hebrew, Russian, and Spanish, so you can examine effects in many different cultures and compare them to one another (‘feminism’ in English vs. ‘féminisme’ in French, for instance.) If you look carefully, you can occasionally see evidence of censorship (such as ‘Marc Chagall’ in the German corpus under the Nazis.)

But even with all that data, you’ll need to carefully interpret your results. Some effects are due to changes in the language we use to describe things (‘The Great War’ vs. ‘World War I’). Others are due to actual changes in what interests us (note how ‘slavery’ peaks during the Civil War and during the Civil Rights movement.)

Watch out for the time period your are looking into: the best data is the data for English between 1800 and 2000. Before 1800, there aren’t enough books to reliably quantify many of the queries that first come to mind; after 2000, the corpus composition undergoes subtle changes around the time of the inception of the Google Books project. The other corpora are smaller, and can’t be used to go as far back in time; their metadata has also not been subjected to as much scrutiny as English in the bicentennial period.

Basically, if you’re going to use this corpus for scientific purposes, you’ll need to do careful controls to make sure it can support your application. Like with any other piece of evidence about the human past, the challenge with culturomic trajectories lie in their interpretation. In this paper, and in its supplementary online materials, we give many examples of controls, and of methods for interpreting trajectories.

This browser is based on the ‘Bookworm’ browser created by Jean-Baptiste Michel, Yuan Kui Shen, and Erez Lieberman Aiden at the Cultural Observatory at Harvard University. An extremely detailed description may be found in our Science publication, which is available once you register (don’t worry, it’s free – you don’t need to subscribe) at Science. If you find this tool helpful for your research, please cite:

Jean-Baptiste Michel*, Yuan Kui Shen, Aviva Presser Aiden, Adrian Veres, Matthew K. Gray, The Google Books Team, Joseph P. Pickett, Dale Hoiberg, Dan Clancy, Peter Norvig, Jon Orwant, Steven Pinker, Martin A. Nowak, and Erez Lieberman Aiden*. Quantitative analysis of culture using millions of digitized books. Science. Published Online Ahead of Print: 12/16/2010.

Good luck and happy browsing!

Leave a Reply

Your email address will not be published.

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>