{"id":251,"date":"2010-12-23T12:37:23","date_gmt":"2010-12-23T12:37:23","guid":{"rendered":"http:\/\/www.iamslic.org\/blog\/?p=251"},"modified":"2010-12-23T12:37:23","modified_gmt":"2010-12-23T12:37:23","slug":"google-opens-books-to-new-cultural-studies","status":"publish","type":"post","link":"https:\/\/www.iamslic.org\/blog\/?p=251","title":{"rendered":"Google Opens Books to New Cultural Studies"},"content":{"rendered":"<p>I found this publication to be most interesting. Try it out with names, subject topics, research trends, etc. most interesting the comparision with geographic locations.<\/p>\n<p>Yours, Michael J. Gomez<\/p>\n<p>Science\u00c2\u00a0 17 December 2010:<br \/>\nVol. 330 no. 6011 p. 1600<br \/>\nDOI: 10.1126\/science.330.6011.1600<\/p>\n<p>* News of the Week<\/p>\n<p>Digital Data<br \/>\nGoogle Opens Books to New Cultural Studies<\/p>\n<p>1. John Bohannon<\/p>\n<p>In March 2007, a young man with dark, curly hair and a Brooklyn accent knocked on the door of Peter Norvig, the head of research at Google in Mountain View, California. It was Erez Lieberman Aiden, a mathematician doing a Ph.D. in genomics at Harvard University, and he wanted some data. Specifically, Lieberman Aiden wanted access to Google Books, the company&#8217;s ambitious\u00e2\u20ac\u201dand controversial\u00e2\u20ac\u201dproject to digitally scan every page of every book ever published.<\/p>\n<p>By analyzing the growth, change, and decline of published words over the centuries, the mathematician argued, it should be possible to rigorously study the evolution of culture on a grand scale. \u00e2\u20ac\u0153I didn&#8217;t think the idea was crazy,\u00e2\u20ac\u009d recalls Norvig. \u00e2\u20ac\u0153We were doing the scanning anyway, so we would have the data.\u00e2\u20ac\u009d<\/p>\n<p>The first explorations of the Google Books data are now on display in a study published online this week by Science (<a href=\"http:\/\/www.sciencemag.org\/content\/early\/2010\/12\/16\/science.1199644.abstract%29\" target=\"_blank\">www.sciencemag.org\/content\/early\/2010\/12\/16\/science.1199644.abstract)<\/a>. The researchers have revealed 500,000 English words missed by all dictionaries, tracked the rise and fall of ideologies and famous people, and, perhaps most provocatively, identified possible cases of political suppression unknown to historians. \u00e2\u20ac\u0153The ambition is enormous,\u00e2\u20ac\u009d says Nicholas Dames, a literary scholar at Columbia University.<br \/>\nFigure<br \/>\nView larger version:<\/p>\n<p>* In this page<br \/>\n* In a new window<\/p>\n<p>&#8220;CREDITS: J. B. MICHEL ET AL.; WORDLE.COM&#8221;<\/p>\n<p>The project almost didn&#8217;t get off the ground because of the legal uncertainty surrounding Google Books. Most of its content is protected by copyright, and the entire project is currently under attack by a class action lawsuit from book publishers and authors. Norvig admits he had concerns about the legality of sharing the digital books, which cannot be distributed without compensating the authors. But Lieberman Aiden had an idea. By converting the text of the scanned books into a single, massive \u00e2\u20ac\u0153n-gram\u00e2\u20ac\u009d database\u00e2\u20ac\u201da map of the context and frequency of words across history\u00e2\u20ac\u201dscholars could do quantitative research on the tomes without actually reading them. That was enough to persuade Norvig.<\/p>\n<p>Lieberman Aiden teamed up with fellow Harvard Ph.D. student Jean-Baptiste Michel. The pair were already exploring ways to study written language with mathematical techniques borrowed from evolutionary biology. Their 2007 study of the evolution of English verbs, for example, made the cover of Nature. But they had never contended with the amount of data that Google Books offered. It currently includes 2 trillion words from 15 million books, about 12% of every book in every language published since the Gutenberg Bible in 1450. By comparison, the human genome is a mere 3-billion-letter poem.<\/p>\n<p>Michel took on the task of creating the software tools to explore the data. For the analysis, they pulled in a dozen more researchers, including Harvard linguist Steven Pinker. The first surprise, says Pinker, is that books contain \u00e2\u20ac\u0153a huge amount of lexical dark matter.\u00e2\u20ac\u009d Even after excluding proper nouns, more than 50% of the words in the n-gram database do not appear in any published dictionary. Widely used words such as \u00e2\u20ac\u0153deletable\u00e2\u20ac\u009d and obscure ones like \u00e2\u20ac\u0153slenthem\u00e2\u20ac\u009d (a type of musical instrument) slipped below the radar of standard references. By the research team&#8217;s estimate, the size of the English language has nearly doubled over the past century, to more than 1 million words. And vocabulary seems to be growing faster now than ever before.<\/p>\n<p>It was also possible to measure the cultural influence of individual people across the centuries. For example, notes Pinker, tracking the ebb and flow of \u00e2\u20ac\u0153Sigmund Freud\u00e2\u20ac\u009d and \u00e2\u20ac\u0153Charles Darwin\u00e2\u20ac\u009d reveals an ongoing intellectual shift: Freud has been losing ground, and Darwin finally overtook him in 2005.<\/p>\n<p>Analysis of the n-gram database can also reveal patterns that have escaped the attention of historians. Aviva Presser Aiden led an analysis of the names of people that appear in German books in the first half of the 20th century. (She is a medical student at Harvard and the wife of Erez Lieberman Aiden.) A large number of artists and academics of this era are known to have been censored during the Nazi period, for being either Jewish or \u00e2\u20ac\u0153degenerate,\u00e2\u20ac\u009d such as the painter Pablo Picasso. Indeed, the n-gram trace of their names in the German corpus plummets during that period, while it remains steady in the English corpus.<\/p>\n<p>Once the researchers had identified this signature of political suppression, they analyzed the \u00e2\u20ac\u0153fame trace\u00e2\u20ac\u009d of all people mentioned in German books across the same period, ranking them with a \u00e2\u20ac\u0153suppression index.\u00e2\u20ac\u009d They sent a sample of those names to a historian in Israel for validation. Over 80% of the people identified by the suppression index are known to have been censored\u00e2\u20ac\u201dfor example, because their names were on blacklists\u00e2\u20ac\u201dproving that the technique works. But more intriguing, there is now a list of people who may have been victims of suppression unknown to history.<\/p>\n<p>\u00e2\u20ac\u0153This is a wake-up call to the humanities that there is a new style of research that can complement the traditional styles,\u00e2\u20ac\u009d says Jon Orwant, a computer scientist and director of digital humanities initiatives at Google. In a nod to data-intensive genomics, Michel and Lieberman Aiden call this nascent field \u00e2\u20ac\u0153culturomics.\u00e2\u20ac\u009d Humanities scholars are reacting with a mix of excitement and frustration. If the available tools can be expanded beyond word frequency, \u00e2\u20ac\u0153it could become extremely useful,\u00e2\u20ac\u009d says Geoffrey Nunberg, a linguist at the University of California, Berkeley. \u00e2\u20ac\u0153But calling it \u00e2\u20ac\u02dcculturomics\u00e2\u20ac\u2122 is arrogant.\u00e2\u20ac\u009d Nunberg dismisses most of the study&#8217;s analyses as \u00e2\u20ac\u0153almost embarrassingly crude.\u00e2\u20ac\u009d<\/p>\n<p>Although he applauds the current study, Dames has a score of other analyses he would like to perform on the Google Books corpus that are not yet possible with the n-gram database. For example, a search of the words in the vicinity of \u00e2\u20ac\u0153God\u00e2\u20ac\u009d could reveal \u00e2\u20ac\u0153semantic shifts\u00e2\u20ac\u009d over history, Dames says. But the current database only reveals the five-word neighborhood around any given term.<\/p>\n<p>Orwant says that both the available data and analytical tools will expand: \u00e2\u20ac\u0153We&#8217;re going to make this as open-source as possible.\u00e2\u20ac\u009d With the study&#8217;s publication, Google is releasing the n-gram database for public use. The current version is available at <a href=\"http:\/\/www.culturomics.org\/\" target=\"_blank\">www.culturomics.org<\/a>.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>I found this publication to be most interesting. Try it out with names, subject topics, research trends, etc. most interesting the comparision with geographic locations. Yours, Michael J. Gomez Science\u00c2\u00a0 17 December 2010: Vol. 330 no. 6011 p. 1600 DOI: 10.1126\/science.330.6011.1600 * News of the Week Digital Data Google Opens Books to New Cultural Studies [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[8],"tags":[98],"_links":{"self":[{"href":"https:\/\/www.iamslic.org\/blog\/index.php?rest_route=\/wp\/v2\/posts\/251"}],"collection":[{"href":"https:\/\/www.iamslic.org\/blog\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.iamslic.org\/blog\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.iamslic.org\/blog\/index.php?rest_route=\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/www.iamslic.org\/blog\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=251"}],"version-history":[{"count":0,"href":"https:\/\/www.iamslic.org\/blog\/index.php?rest_route=\/wp\/v2\/posts\/251\/revisions"}],"wp:attachment":[{"href":"https:\/\/www.iamslic.org\/blog\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=251"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.iamslic.org\/blog\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=251"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.iamslic.org\/blog\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=251"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}