Archive for December, 2010

Google Opens Books to New Cultural Studies

I found this publication to be most interesting. Try it out with names, subject topics, research trends, etc. most interesting the comparision with geographic locations.

Yours, Michael J. Gomez

Science  17 December 2010:
Vol. 330 no. 6011 p. 1600
DOI: 10.1126/science.330.6011.1600

* News of the Week

Digital Data
Google Opens Books to New Cultural Studies

1. John Bohannon

In March 2007, a young man with dark, curly hair and a Brooklyn accent knocked on the door of Peter Norvig, the head of research at Google in Mountain View, California. It was Erez Lieberman Aiden, a mathematician doing a Ph.D. in genomics at Harvard University, and he wanted some data. Specifically, Lieberman Aiden wanted access to Google Books, the company’s ambitious—and controversial—project to digitally scan every page of every book ever published.

By analyzing the growth, change, and decline of published words over the centuries, the mathematician argued, it should be possible to rigorously study the evolution of culture on a grand scale. “I didn’t think the idea was crazy,” recalls Norvig. “We were doing the scanning anyway, so we would have the data.”

The first explorations of the Google Books data are now on display in a study published online this week by Science (www.sciencemag.org/content/early/2010/12/16/science.1199644.abstract). The researchers have revealed 500,000 English words missed by all dictionaries, tracked the rise and fall of ideologies and famous people, and, perhaps most provocatively, identified possible cases of political suppression unknown to historians. “The ambition is enormous,” says Nicholas Dames, a literary scholar at Columbia University.
Figure
View larger version:

* In this page
* In a new window

“CREDITS: J. B. MICHEL ET AL.; WORDLE.COM”

The project almost didn’t get off the ground because of the legal uncertainty surrounding Google Books. Most of its content is protected by copyright, and the entire project is currently under attack by a class action lawsuit from book publishers and authors. Norvig admits he had concerns about the legality of sharing the digital books, which cannot be distributed without compensating the authors. But Lieberman Aiden had an idea. By converting the text of the scanned books into a single, massive “n-gram” database—a map of the context and frequency of words across history—scholars could do quantitative research on the tomes without actually reading them. That was enough to persuade Norvig.

Lieberman Aiden teamed up with fellow Harvard Ph.D. student Jean-Baptiste Michel. The pair were already exploring ways to study written language with mathematical techniques borrowed from evolutionary biology. Their 2007 study of the evolution of English verbs, for example, made the cover of Nature. But they had never contended with the amount of data that Google Books offered. It currently includes 2 trillion words from 15 million books, about 12% of every book in every language published since the Gutenberg Bible in 1450. By comparison, the human genome is a mere 3-billion-letter poem.

Michel took on the task of creating the software tools to explore the data. For the analysis, they pulled in a dozen more researchers, including Harvard linguist Steven Pinker. The first surprise, says Pinker, is that books contain “a huge amount of lexical dark matter.” Even after excluding proper nouns, more than 50% of the words in the n-gram database do not appear in any published dictionary. Widely used words such as “deletable” and obscure ones like “slenthem” (a type of musical instrument) slipped below the radar of standard references. By the research team’s estimate, the size of the English language has nearly doubled over the past century, to more than 1 million words. And vocabulary seems to be growing faster now than ever before.

It was also possible to measure the cultural influence of individual people across the centuries. For example, notes Pinker, tracking the ebb and flow of “Sigmund Freud” and “Charles Darwin” reveals an ongoing intellectual shift: Freud has been losing ground, and Darwin finally overtook him in 2005.

Analysis of the n-gram database can also reveal patterns that have escaped the attention of historians. Aviva Presser Aiden led an analysis of the names of people that appear in German books in the first half of the 20th century. (She is a medical student at Harvard and the wife of Erez Lieberman Aiden.) A large number of artists and academics of this era are known to have been censored during the Nazi period, for being either Jewish or “degenerate,” such as the painter Pablo Picasso. Indeed, the n-gram trace of their names in the German corpus plummets during that period, while it remains steady in the English corpus.

Once the researchers had identified this signature of political suppression, they analyzed the “fame trace” of all people mentioned in German books across the same period, ranking them with a “suppression index.” They sent a sample of those names to a historian in Israel for validation. Over 80% of the people identified by the suppression index are known to have been censored—for example, because their names were on blacklists—proving that the technique works. But more intriguing, there is now a list of people who may have been victims of suppression unknown to history.

“This is a wake-up call to the humanities that there is a new style of research that can complement the traditional styles,” says Jon Orwant, a computer scientist and director of digital humanities initiatives at Google. In a nod to data-intensive genomics, Michel and Lieberman Aiden call this nascent field “culturomics.” Humanities scholars are reacting with a mix of excitement and frustration. If the available tools can be expanded beyond word frequency, “it could become extremely useful,” says Geoffrey Nunberg, a linguist at the University of California, Berkeley. “But calling it ‘culturomics’ is arrogant.” Nunberg dismisses most of the study’s analyses as “almost embarrassingly crude.”

Although he applauds the current study, Dames has a score of other analyses he would like to perform on the Google Books corpus that are not yet possible with the n-gram database. For example, a search of the words in the vicinity of “God” could reveal “semantic shifts” over history, Dames says. But the current database only reveals the five-word neighborhood around any given term.

Orwant says that both the available data and analytical tools will expand: “We’re going to make this as open-source as possible.” With the study’s publication, Google is releasing the n-gram database for public use. The current version is available at www.culturomics.org.

Comments off

Greetings

I wish you health, happiness, peace and joy during the holiday season and for 2011.

Happy Christmas

Marcia,  IAMSLIC President

Comments (1)

Membership Committee

Firstly a welcome to Sally Taylor as the new CYAMUS rep on the Committee and thanks to outgoing Janet Webster for her years of work and good ideas.
Do you know what the Committee does and how to contact your reps? http://www.iamslic.org/people/committees-taskforces/membership
Do you have ideas or a passion to help recruit new members? If so, we want to hear from you.
Do you remember your password and how to access the Membership Directory, Resource Sharing etc.?
We send an information note in English, French or Spanish to new members with passwords and how to access the various services, systems and people. Old members who have forgotten ‘how to’ can request a copy – and just this once my email follows.
Jean Collins on behalf of Membership Committee collinsjean1@gmail.com

Comments off

Volunteer opportunity in the Galapagos Islands

Dear IAMSLIC members,

I have been living in the Galapagos for the past three months as Volunteer Librarian at the Charles Darwin Research Station. It has been a wonderful opportunity to manage a small library with such a specialized collection and to live in such a unique environment.

I will be leaving at the end of April, so we are looking for a new volunteer to start in April or May 2011 for a 6-12 month period. We are accepting applications now and plan to make a decision by late January.

If you are interested, please see the volunteer posting at:

http://darwinfoundation.org/english/pages/interna.php?txtCodiInfo=11
[click on the PDF Volunteer for FCD Librarian]

Please email me at sally.taylor@ubc.ca if you have any questions.

Sally Taylor

My lunch spot

Comments off

WAVES – Our renewed Online Library Catalogue

Now available!  WAVES the renewed online library catalogue

The Fisheries and Oceans Canada libraries are proud to present our new WAVES catalogue. One-stop access to the collections of all 11 Department of Fisheries and Oceans (DFO) Canada and Canadian Coast Guard (CCG) libraries is easier than ever. Choose the new simple search, or go to the familiar advanced search to access one of the most comprehensive collections of marine science literature anywhere.

Full text DFO and CCG publications are easy to find and access online with a simple click of the mouse.

The new WAVES catalogue is available around the world on the Internet.

Bookmark our new url: http://waves-vagues.dfo-mpo.gc.ca/waves-vagues/eng.htm 

Marcia Croy-Vanwely, IAMSLIC President

Fisheries & Oceans Canada

Comments off