November 17 , 2008
Google Digitization Project Meeting
Ed Van Gemert, UW Libraries; Frances Haugen, Google
Ed Van Gemert:
The focus of UW Madison’s digitization efforts has been on sending historical documents and government documents. Google digitizes 10,000 items of our items every 6 months; we have a 500,000-item commitment for the duration of the agreement. No money changes hands. Outcomes include greater public access to documents; our Google partnership; and the recent settlement agreement between Google, Authors’ Guild, and AAUP. The agreement builds a book registry for in-copyright material, with the ability to secure an institutional subscription to all content, in and out of copyright. We will evaluate the value of such a subscription for networking across the UW System. We have digitized 160,000 items to this point, with public–domain materials included in MadCat. We are also preserving the digital files at University of Michigan.
Thanks to Project Manager Irene Zimmerman, WHS partners for contributing about a third of the content, the Google project management team, UW Madison Legal Services, and the campus administration.
Haugen works in the area of search quality, exploring new ways to expose content in books. Why Google does what it does: to “organize the world’s information and make it universally accessible and useful.”
Book search has three main goals: comprehensiveness (digitizing all the world’s books), accessibility (any person, anywhere), and diversity (broad reach of subjects & languages, intentionally scan books from areas that are under-represented). The in-print books that can be shown are from publishers in the partner program, the out of pring items that they can show are from libraries; and the other 75% of content is hidden. Every day users preview 40% of partner books and 17% of public domain books. Every month, users preview 81% of partner books and 78% of public domain books. Preview means at least one page read, but when the threshold is set at ten pages read, the numbers still stay high (e.g. 60% of public domain books are previewed).
An application of interest for Google book search in education is real-time collaboration. Google tools can be used for collaboration, research projects, real-time feedback on essays, peer editing, cooperative note-taking, and online assignment submission. Examples of applications for education:
- Can take a quote from a primary source document and do a book search for books that include/discuss quote and context around it.
- Can use publication date in book search: “War date:1861-1865”
- Can highlight part of the book (public domain content), create a document and paste in the link to bring in image from book / snippet.
- Can compose course packs
- Can share a “library”, labels & reviews.
Question and Answer:
- Exploring making a standard-form book more readable on a phone.
- What does Google do to help users become aware of features of tools and links between them? They don’t distribute these things very well.
- How do they protect in-copyright books when users can take multiple snippets and string them together? Snippets are pre-computed: page is divided into units: some lines can never be seen, even though you can match to that line. 10% of the pages are also blacklisted. If a copyright owner opts out, then you can only see a metadata page.
- Sales team talks directly to publishers: by making your book accessible, you can dramatically improve sales. Looking for more ways to give value back to rights-holders, for example, data on how books are read (e.g. heatmaps)
- How is relevancy ranking being developed in the full-text book search environment? This is a multilayered process: guide to book, help reading and using content. Possibly, development of sub-corpora. Data comes from human interaction with books that makes new functionality possible.
- Some problems are technological (e.g. scan quality is a processing quality issue, Google can do this better, sophisticated de-warping technologies), some problems are operational.
- Do you envision a day when scholars can work with structured markup, rather than the images? With the settlement, there will be a research corpus.
- Algorithmic creation of structured markup from images may enable every book to be an e-book.
- Settlement outlines procedures to see if a book is in copyright
11/21/08 Sarah McDaniel for the University Library Committee