FAQ – Culturomics

It turns out that if you do something that appears on the front page of the New York Times, people will ask you a lot of questions about it. Instead of answering one-by-one, we decided to collect many of the commonly asked questions about our recent paper in Science in one place. Most of these issues are addressed in great detail in either the paper or in the Supplemental Online Materials, but those can be a somewhat daunting and technical read, so we made this to provide people with a convenient reference.

I. BIG PICTURE

II. ABOUT THE PAPER

III. DATA CONTENTS

IV. DATA QUALITY

V. DATA INTERPRETATION

VI. N-GRAMS VIEWER

VII. ABOUT CULTUROMICS

I. Big picture

1. Is this supposed to replace close reading of texts?

Absolutely not. Anyone who has appreciated the work of a great artist – say, Shakespeare – or an insightful scholar – say, Michael Walzer’s Exodus and Revolution – couldn’t possibly think that quantitative approaches can replace close reading.

Quite the opposite is true: quantitative methods can be a great source of ideas that can then be explored further by studying primary texts.

2. How does this relate to other methods in the humanities?

Our hope is that the culturomic approach will be able to supplement existing techniques.

3. But you can’t quantify culture.

That’s doesn’t mean you can’t use quantitative methods to get insight into culture. Did you ever notice how, when you walk around a doctor’s office, there are lots of signs that give you ‘Health’-related information? And how elementary schools are the same way, except that the walls are packed with information about what the kids are ‘learning’? And how corporate headquarters tend to have those big motivational posters about things like ‘Value Creation’? Even if the buildings were emptied of all people and objects, you could probably still tell which was which from the ‘writing on the wall’. In fact, a robot could this, just by counting the instances of ‘health’, ‘learning’, and ‘Value Creation’ that it finds on the walls.

In short, when you quantify n-grams that are associated with particular aspects of culture, you can learn something about what a society is thinking about in a particular time and place.

4. I’m really excited about this tool and I want to use it in my classes.

Great! That’s a wonderful idea. But please make sure you read our paper and the supplemental materials first. As with any form of measurement, it’s important to understand the limitations of this tool before you try to use it.

5. What’s the best way to find out about the data and how it was generated?

The materials appearing on our website, on the N-gram Viewer, and even in this Q & A, are meant to be brief and handy guides. The most authoritative and comprehensive source is our paper in Science and its Supplemental Online Materials.

6. How do I cite results that come from the data or the viewer?

If you’re writing an academic publication, please cite:

Jean-Baptiste Michel*, Yuan Kui Shen, Aviva Presser Aiden, Adrian Veres, Matthew K. Gray, The Google Books Team, Joseph P. Pickett, Dale Hoiberg, Dan Clancy, Peter Norvig, Jon Orwant, Steven Pinker, Martin A. Nowak, and Erez Lieberman Aiden*. Quantitative Analysis of Culture Using Millions of Digitized Books. Science (Published online ahead of print: 12/16/2010).

II. About the paper

1. Why did you publish your paper in Science instead of in a humanities journal?

Our goal was to pick a journal that could (i) perform a thorough review on a highly technical and multidisciplinary paper; (ii) turn the paper around quickly; and (iii) get the paper exposure to both scientists and humanists. Science does the first two very well, and, because of its excellent connections in the press, had the potential to reach humanists as well.

Just as importantly, when you’ve spent four years working on a project, a lot of people have put their faith in you to make it a success. We’ve had a fair bit of experience writing for journals like Science, and we were confident in our ability to write a paper about this project that Science would take seriously. We didn’t want to take any unneeded risks.

2. Why were there no humanists involved in this project?

That’s incorrect. Erez studied Philosophy at Princeton as an undergrad and did a master’s degree in Jewish History working with Elisheva Carlebach. Two of our other authors, Joseph Pickett (PhD, English Language and Literature, UMichigan) and Dale Hoiberg (PhD Chinese Literature, UChicago) are the Executive Editor of the American Heritage Dictionary and the Editor-in-Chief of the Encylopaedia Britannica. In addition, we were in contact with many humanists throughout the life of the project.

But more than just wrong, it’s irrelevant. What matters is the quality of the data and the analyses in the paper and what it means for how we think about a great variety of phenomena – not the degrees we happen to hold or not to hold. If what we seek is a serious conversation about this work, we shouldn’t exclude anyone who has something significant and thoughtful to say. That would be a shame.

III. Data contents

1. Why didn’t you make the full text available?

It’s under copyright.

2. But frequency timelines for all the N-grams aren’t as good as the full text. Why do we only get N-grams?

Here’s how that came about. We knew that the full text would not be released any time soon, if ever, given the copyright issues. If the full text corpus could not be released, we thought it was wrong – from a scientific standpoint – to use the full text corpus to do analyses and then publish a paper containing results that could not be evaluated by the scientific community at large.

Instead, we resolved to identify an alternative data type, based on the full text corpus, with the following two properties: (i) it would enable us to clearly demonstrate the transformative nature of the full corpus, and (ii) it would be maximally innocuous from the copyright standpoint. Ultimately we chose tables showing the frequency of N-grams over time.

At this point, we drew ourselves a line in the sand: if we couldn’t release the N-grams tables in full, we wouldn’t publish anything at all. Of course, we had no guarantee whatsoever that Google would release the N-grams. It was a huge risk and on many occasions (especially after the Publisher lawsuit was filed!) we thought this work would never see the light of day.

Over the course of the four-year project, we were eventually able to convincingly make the case that N-gram tables were an extraordinarily powerful tool and that they should be released. This is to a great extent a testament to our collaborators Dan Clancy, Peter Norvig, and Jon Orwant, who went to bat for the project time and time again. We consider it very fortunate that this effort succeeded.

3. But the same word can mean different things at different times.

The problem of synonymy over the centuries is a real challenge. It’s often possible, though sometimes difficult, to deal with this issue in a particular case of interest. (As a simple example, try replacing the less general ‘comic’ with the more specific ‘comic book’.) But it’s impossible to do in an automated way for all the n-grams.

This is why we spend much of the paper talking about n-grams that correspond to dates and to people’s names: these n-grams tend to be much more constant in their meaning over time. Even so, formidable challenges can arise: we spend many pages in the Supplemental Online Materials showing how to associate an n-gram timeline with a specific person.

4. All these n-grams are out of context. I can’t tell how they are being used.

That’s true, but only to an extent. There are at least two tricks you can use to help alleviate the problem.

Picking highly specific N-grams. If you want to track an n-gram that is ambiguous, consider tracking a closely related N-gram that isn’t. For instance, suppose you were interested in tracking the country Chad. Of course, ‘Chad’ can also be the name of a person, or a controversial part of a ballot. Consider tracking ‘Republic of Chad’ instead.
Use the samples. To estimate the frequency with which a certain n-gram is used in a certain way in a certain year, look at the hits from that year by eye and classify the contexts yourself. Then extrapolate by assuming that this breakdown is representative for all hits in that year. Sure, it’s a lot of manual labor, but using this method it becomes possible to estimate frequencies of n-grams in arbitrarily subtle contexts.

5. We don’t know what is in the corpus. Can you give us a bibliography?

The supplemental materials of the paper describe how the corpora were constructed in painstaking detail. In addition to providing a great deal of information, we hope those sections impress upon readers the seriousness with which we took our task and the care we took (indeed, the years we took!) to get it right. If we were to distill everything down to a single sentence, you should know this: the vast majority of the books from 1800-2000 come from Google’s library partners, and so the composition of the corpus reflects the kind of books that libraries tend to acquire.

But, as folks like Ted Underwood and Matthew Jockers have noted, we haven’t released a full, 5.2 million book bibliography telling the public which books are in which corpora. And we agree with them: that’s a big issue. We hoped to release a full list of all books in the various corpora together with the paper, but have not received permission to do so yet.

Unlike the N-gram tables, we did not draw a line in the sand about the corpus bibliographies: it would have been wrong to keep the N-grams data out of public hands on account of the lack of a bibliography.

6. Why did you exclude periodicals?

At the time we compiled the corpus, the dating of periodicals in Google Books was very poor. Leaving them in without fixing the dates would have dramatically decreased the quality of the N-grams tables, more than counterbalancing the advantage of having periodicals represented. Trying to fix the dates would have taken an enormous amount of time and introduced additional biases into the underlying data. As a result, we chose to exclude periodicals from this first dataset.

7. Books don’t represent all of culture; you need other types of texts, too.

Of all the various types of historical texts that are being digitized today, the Google Books effort is by far the most sophisticated and advanced. That made books the obvious place to start.

But, as we emphasize in the paper, books are only one aspect of culture. We hope to create similar resources based on magazines, journals, newspapers, patents, and many other aspects of recorded history in the future. Had we waited for all that to be done before writing a paper, we could easily have kept working for another decade without a publication. It would have been wrong to keep this hidden away until that was done.

8. I want to break this down by subject.

So do we. But it’s hard enough to do this well that we didn’t include it in the first paper.

9. Why didn’t you do part-of-speech tagging?

We didn’t have time to do it for the first paper.

10. Why didn’t you account for how many copies of each book were printed?

Because that is totally impossible.

IV. Data quality

1. The OCR is not perfect.

Optical Character Recognition, or OCR, is the set of algorithms that transform a picture of a word into the corresponding digital text. Better OCR means that the computer ‘reads’ the word in the picture correctly more often.

For our initial dataset, we got rid of the texts with the worst OCR. Especially in the English bicentennial (1800-2000) period which is the focus of our initial paper, the OCR quality reflects the cutting edge of what machines can do today. Obviously there are lots of errors. That’s intrinsic to any project of this scale: the full text of the corpus contains trillions of letters! Even if one in a thousand is misrecognized, that still leaves you with a billion errors.

For comparison, when the draft of the human genome was first published in 2001, 91% of the sequence had less than 1 error in 10,000 DNA bases (the rest was worse). Since the human genome is 3 billion letters long, the 91% that was high-quality (for the standards of the time) contained hundreds of thousands of errors. Culturomicists, like genomicists, need to bear such issues in mind.

Please see the Supplemental Online Materials of our paper for more details.

2. The OCR tends to consistently make certain errors, like misreading the letter s in older books as an f.

In older English books, the letter ‘s’, except when it appeared at the end of a word, was written differently than it is today. This so-called ‘medial s’ looks kind of like the letter f, and so it confuses the OCR, which misclassifies it as an f. Thus the word ‘husband’ might be misread as ‘hufband.’ This is the most prominent systemic OCR error in our data.

Fortunately, the medial s disappeared shortly after 1800, so it only affects the very first years of the 1800-2000 target period that we focused on.

Furthermore, it’s an easy fix for almost all n-grams. Instead of using the data for your target n-gram, use the sum of all n-grams involving the medial and non-medial forms: for instance, you might sum ‘husband’ and ‘hufband’. This isn’t currently possible with the online N-gram viewer, but is very easy to do once you’ve downloaded the raw data.

3. The metadata dates are not perfect.

That’s absolutely right. In fact, back in 2008 when we first started making N-gram timelines, we used to regularly find huge, completely spurious peaks all over the place. There are a lot of reasons for this. For example, some metadata providers assign any book whose date is unknown the date 1899; others use 1905; still others use different dates. There’s no easy way to tell which books these are. Furthermore, all issues of a periodical were often assigned to the date of first publication of that periodical, leading to huge, totally erroneous peaks in those ‘founding’ years for many n-grams.

We spent about a year working on this, systematically excluding millions of books whose metadata seemed ‘fishy’ in any way in the effort to ‘clean up’ the signal.

There are still a lot of errors; if you pick a random year and choose a random book, there’s a 1 in 20 chance that the date is wrong. But this is much better than it was before our cleanup, when the error rate was 1 in 4. Furthermore, the errors are much less systematic; far fewer spurious peaks appear, especially in the period from 1800-2000 in English. Instead, the errors are spread out much more evenly among all years.

This is very apparent when you look at the trajectories corresponding to years like ‘1950’.

4. What’s happening before 1800?

By and large the OCR, the dating accuracy, and the volume of data are all much bigger issues before 1800 than after. That’s why our paper doesn’t use any data before 1800.

Please see the Supplemental Online Materials of our paper for more details.

5. What’s happening after 2000?

Before 2000, most of the books in Google Books come from library holdings. But when the Google Books project started back in 2004, Google started receiving lots of books from publishers. This dramatically affects the composition of the corpus in recent years and is why our paper doesn’t use any data from after 2000.

Please see the Supplemental Online Materials of our paper for more details.

6. There are obviously very significant issues in terms of language classification, data volume, and dating in the Russian, Chinese, and Hebrew.

Yes, there absolutely are. For instance, the Hebrew corpus contains many texts that transition back and forth from Hebrew to Aramaic within a single line. Please see the Supplemental Online Materials of our paper for more details.

7. What’s in the English fiction corpus?

Crucially, it’s not just actual works of fiction! The English fiction corpus contains some fiction and lots of fiction-associated work, like commentary and criticism. We created the fiction corpus as an experiment meant to explore the notion of creating a subject-specific corpus. We don’t actually use it in the main text of our paper because the experiment isn’t very far along. Even so, a thoughtful data analyst can do interesting things with this corpus, for instance by comparing it to the results for English as a whole. Please see the Supplemental Online Materials of our paper for more details.

8. Why didn’t you just release the data from English between 1800 and 2000?

The rest of the data is harder to use, but it can enable new insights if you’re willing to invest lots of your time and be really careful. We considered releasing a far smaller subset of the n-grams that are a bit easier to use, but we decided that it would be much better to release more data and trust people to approach it in a thoughtful and scientifically responsible way.

9. Will the data and metadata get better?

Of course.

10. But if the data changes, my results will change.

In order to make sure that your work can always be reproduced, we’ll always preserve the older versions of the data intact.

V. Data Interpretation

1. What are the main types of error and what are their primary consequences for interpreting individual n-grams?

The three most often encountered sources of error and bias in the data are:

Optical Character Recognition (OCR). Sometimes, the word in the corpus doesn’t match the word in the book because the computer ‘misread’ it.
1. Primary Consequence 1: Spurious N-gram Formation. Extremely short 1-grams can form erroneously due to OCR error on a different 1-gram (for instance, the spurious formation of the word ‘beft’ due to the OCR error mistaking an ‘s’ for an ‘f’, which is common up to the first decade of the 19^th century).
2. Primary Consequence 2: Overall reduction in the frequency of an n-gram. For the most part, what OCR error does to a given n-gram between 1800 and 2000 is to cause any instance of it to be incorrectly recognized with some finite probability. This will tend to reduce the overall frequency of the n-gram, reducing the amplitude of its trajectory but not changing its overall shape.
Dating. Sometimes, a book is assigned to the wrong date.
1. Primary Consequence: Spurious formation of small peaks. If an n-gram is very common in a given year, the mis-dating of a single book won’t do much. But if a book containing an n-gram is mistakenly assigned to a year in which that n-gram is rare, a spurious peak can arise. This can lead to lots of little peaks when you take an n-gram that is really common today but didn’t exist before the last few decades; you’ll see lot of little peaks from books that are erroneously assigned a too-early date.
Library acquisition bias. These books came from libraries, which means that they reflect the process by which libraries choose which books to acquire and preserve.
1. Primary Consequence: When you see an upward trend, you can’t tell if people were more interested in a word, phrase, or topic, and hence use the n-gram more in published works, or whether what’s actually happening is that libraries are taking a topic more seriously and are acquiring more books that pertain to it. (The same is true, mutatis mutandis, for downward trends.)

2. When there are so many sources of error, how can you interpret the data?

All data has errors, especially in the early days of a new form of measurement. Some of the errors are random, whereas others systematically skew the results. Open up any paper in any reasonable scientific journal, and you’ll see all kinds of jagged graphs and spurious peaks and irrelevant features. More than anything else, what makes someone a good scientist is the ability to interpret data effectively in the presence of an array of red herrings and potential confounds.

The most crucial thing that makes such interpretation possible is detailed knowledge of how the data was collected and intimate familiarity with the data itself (the kind that results from staring at it, in different ways and through different lenses, for months and years).

Every single time you try to do a culturomic analysis, you need to make sure your conclusions are not confounded by any of the classic types of error. As an example, let’s look at figure 4E of the paper, a study of changes in usage frequency of the Nazis, and of people they censored, during the Third Reich.

Here’s what we wrote about it in the paper:

We probed the impact of censorship on a person’s cultural influence in Nazi Germany. Led by such figures as the librarian Wolfgang Hermann, the Nazis created lists of authors and artists whose “undesirable”, “degenerate” work was banned from libraries and museums and publicly burned (26-28). We plotted median usage in German for five such lists: artists (100 names), as well as writers of Literature (147), Politics (117), History (53), and Philosophy (35) (Fig 4E). We also included a collection of Nazi party members [547 names, ref (7)]. The five suppressed groups exhibited a decline. This decline was modest for writers of history (9%) and literature (27%), but pronounced in politics (60%), philosophy (76%), and art (56%). The only group whose signal increased during the Third Reich was the Nazi party members [a 500% increase; ref (7)].

As we note in the supplemental materials, the list of suppressed artists comes directly from the catalog of the Nazi’s ‘degenerate’ art exhibition, as was reconstructed recently for an exhibit at the LA County Museum of Art. The ‘degenerate’ art exhibition was a government-sponsored exhibit that cruelly mocked art that the Nazi disliked. The other four blacklists were created by Wolfgang Hermann. Finally, the list of Nazis is one that we compiled by automatically compiling all individuals in Wikipedia who were assigned to several Wikipedia categories associated with the Nazis.

Note that the inclusion of the latter group is an example of a classic scientific technique: the use of a ‘control group’ to be contrasted to the victims of censorship. The use of careful controls is invaluable in demonstrating that the property being contrasted (being a person on the blacklist vs. being a Nazi) is driving the effect observed.

Of course, the most striking feature of the data is that there is a dramatic difference between the trajectories of the suppressed groups and those of the Nazis (i.e., the control group). This is consistent with the hypothesis that these groups were effectively censored, whereas the fame of the Nazis increased because of the regime. Still, we’re not done. Here are some of the questions you might ask yourself before drawing a conclusion.

Could it be random noise? To check that, we might look at (i) statistical significance; (ii) individual trajectories from the various groups to check if they look consistent with the overall trend, and (iii) results for many different lists of Nazis and censorship victims culled from different places; presumably if this is dumb luck then different and largely-or-completely disjoint lists should not all exhibit the same tendency. In fact, all three of these methods suggest that the effect is real.

Could the dramatic difference be the result of OCR? No, that’s inconceivable.

Could it result from misdated books? No; there’s no reason to believe that books containing names of Nazis are so selectively misdated to the period of the Third Reich, and that books containing the names of censorship victims exhibit such a strong tendency to get misdated away from that period.

(Another way to rule out the above two possibilities is to check whether the same effects are seen in English; but they aren’t, even though the English corpus OCR is identical and the date metadata was assembled using the same algorithm. We illustrate a typical example of the difference between the corpora in Figure 4A using the trajectory of ‘Marc Chagall’ in English and in German.)

Finally, could this be the result of a bias in library acquisitions during the period? I.e., were there more books being published about the censorship victims than we see reflected in our corpus, and fewer books published about the Nazis, but libraries were tending to avoid the former and embrace the latter?

Sure. But of course, that’s the point: German libraries were being told to take these books off their shelves, and rioters were being told to burn them. So whether the bias is entirely due to publication rates, or entirely due to library acquisition/retention patterns, or some combination of the two, what we are seeing is consistent with the hypothesis that the blacklisted groups were being censored.

Finally, it’s worth noting that the censorship seemed to be much more effective against some groups (i.e., folks writing about philosophy) than others (i.e., folks writing about history). Again, the fact that we’ve carefully compared a number of groups that are very similar, except for one variable (what they write about) makes it more reasonable to believe that the difference in the magnitude of the observed effect is the result of increased Nazi effectiveness in suppressing one group vs. the other.

Why might that be?

Sounds like a job for close reading.

VI. N-grams viewer

1. Why don’t you take into account that there are more books around now than there were back in 1800?

We do. When we plot an n-gram, either in the paper or on the viewer, we always divide the number of matches in a given year (stored in the n-gram tables) by the total number of words in the relevant corpus in that year. We do this separately for every single year in the plot.

2. Sometimes I have a term that’s mostly zero until the middle of the 20^th century, but beforehand I still see little peaks or plateaus that appear from out of nowhere and then disappear.

Errors in the date assigned to a book can sometimes lead to little peaks from out of nowhere; for instance if a book from 1985 gets misdated as 1885 and brings all of its 1985 lingo with it. This is especially common in 1899 and 1905 for reasons described in the question about metadata quality. This can also happen because a book first published in 1885 is reprinted much later, with a preface written much later, and the new edition is scanned and assigned a date of 1885.

In such a case, you’ll see a small peak (without smoothing) or a small plateau (with smoothing).

After you’ve stared at a thousand of these, you’ll almost always correctly recognize such features as noise without giving it a second thought.

3. Sometimes I see a really big peak for a query in a particular year, and I can’t explain it.

Occasionally there’s an individual book that is just obsessed with a particular n-gram (for instance, a biography). If that n-gram is extremely rare in general, this will lead to a disproportionate increase in the frequency of X in the entire dataset for that year. If smoothing is turned on, then that huge peak can become a big plateau.

Of course, this only affects you when you are counting the frequency of an n-gram in books (the only view currently available in the N-gram Viewer), but not when you are looking at the fraction of books containing a particular n-gram (which is available in the publicly downloadable data). So it is really easy to tell when this is happening if you download the raw data.

4. The N-grams Viewer isn’t powerful enough. I can’t use it to do X.

Many things are impossible because the underlying n-grams tables aren’t a powerful enough data type: for instance, knowing how frequently the same book talks about two given topics. We’re sorry about that.

Many other things, like figuring out which words are the most likely to co-locate with a target word, summing and averaging over many trajectories, running thousands of queries in batch, and retrieving raw data, are possible using n-grams tables, but not enabled at the present time through the N-grams viewer.

In these cases, please remember that the viewer is provided for the convenience of the public and, for the computationally savvy, as an advertisement for the underlying data. What you can do if you have a copy of the data on your computer dwarfs what you can do with the browser, and it always will.

For instance, not a single analysis done in the paper is possible using the N-grams browser. If you’ve got the skills and you’re excited about the n-grams, download the data.

5. Sometimes I click through to the individual books, and there are lots of mistakes.

A really important thing to bear in mind is that the current N-grams viewer links to a search for your n-gram over all Google Books, not just the ones in the corpus. That means you see lots of erroneous hits that were actually filtered out in producing the n-grams data.

We understand that this is far from ideal. Given the deadline, the alternative would have been to have no convenient way of getting from the N-grams viewer to the primary texts, and that would have been much worse. And remember, the viewer is a work-in-progress.

VII. About culturomics

1. Why did you call your approach ‘culturomics’?

Various fields with the suffix “-omics” (genomics, proteomics, transcriptomics, and a host of others) have emerged in recent years. These approaches tend to rely on (i) hundreds or thousands of people in massive, multi-institutional and multi-national consortia, (ii) novel technologies enabling the assembly of vast datasets containing a very specific type of data, and (iii) the deployment of sophisticated computational and quantitative methods in order to interpret the resulting data. These fields have created data resources and computational infrastructures that have energized biology.

The effort to digitize and analyze the world’s books has proceeded along these lines. We hope that it will be a forerunner of similar efforts whose goal will be to digitize and analyze other aspects of recorded history, from newspapers, to manuscripts, to incunabula, to artwork. Such efforts would create resources of extraordinary power for scholars and scientists interested in the study of human culture.

So, it’s not the most beautiful word, but it captures what we think we’re trying to do.

2. How does this relate to corpus linguistics?

The data we have released can be used to study certain linguistic phenomena in phenomenal detail. But that’s just one application. Most of what we do in the first paper isn’t linguistics at all.

3. How does this relate to “humanities computing” and “digital humanities”?

Culturomics is part of what’s known as “humanities computing” or the “digital humanities”. Of course, the digital humanities are a very broad field, comprising a vast array of ways in which computation can help humanists. It includes such things as tools that aid in teaching, citation, and collaboration as well as digital collections of various types.

Culturomics is much more narrowly defined: its goal is to digitize and analyze data about culture on extremely large scales: all books, all newspapers, all manuscripts, etc.

I. Big picture

II. About the paper

III. Data contents

IV. Data quality

V. Data Interpretation

VI. N-grams viewer

VII. About culturomics

Leave a Reply Cancel reply

Sidebar