Brewster Kahle of the Internet Archive opened Museums and the Web 2007 with an inspiring keynote address. He argued that providing universal access to all published human knowledge is within our grasp. Yes, you read that correctly: public access to all published texts, audio, video, etc. is possible and practical with our current infrastructure. We’re talking Sumerian tablets to the latest thing deposited in the U.S. Library of Congress.
And, Kahle reports, this project can be undertaken relatively economically. Take books, for example. Kahle estimates that maybe 100 million books have ever been published. One book, he said, can be digitized to a size of 1 MB. One million books = 100 terabytes. The computer hardware to store 100 terabytes costs about $150,000 and fits in a podium-sized cabinet.
Anyone can download these books to read them, or could order books via print on demand. Printing and binding a 100-page book costs $1. So your average book would be about $3 to print and give away to folks in regions of the world where there aren’t many books to be had. Kahle cited a Harvard study that concluded it costs $3 to lend a book to people, so why not, Kahle asked, give the books away? And Kahle’s group did this very thing, putting together bookmobiles to be sent to such remote areas as rural Uganda to print books. (Problem: They hadn’t yet digitized the right books, the ones that would be in demand in Uganda. It appears the Ugandans found the selection lacking.) One alternative to print on demand for developing countries: the $100 laptop from MIT.
Of course, there are also the costs of scanning and digitizing all these books to be taken into account. It costs about 10¢/page to do this domestically, or $10/book to send the tomes to India or China, have them scanned, and sent back. Scanning books domestically, Kahle estimates, would cost $30 million per million books.
Any giant book digitization project must contend with copyright issues and access to the physical books. Out-of-copyright works are free of legal constraints, but the printed copies themselves aren’t always readily available for scanning. They’re in private collections or being preserved in archives, for example. In-copyright books are a sticky wicket. Kahle reports that out-of-print books are, by definition, not commercially viable, and thus negotiating the right to print them noncommercially is apparently not too difficult. Books that are still in print, however, will probably have to be digitized by their publishers, who will want, in turn, to keep them under commercial lock and key.
Currently, the Internet Archive digitizes books at a rate of 12,000 per month.
Kahle is also interested in audio files. He estimates there are two to three million published audio works. However, rights issues are, in Kahle’s words, “thornier” than for texts. Still, commercial recordings aside, there are plenty of folk cultures who may want to preserve and distribute their aural culture yet lack the resources to do so. The Internet Archive promises such groups unlimited storage and bandwidth forever, for free. The Archive offers the same deal to legal “bootleg” copies of rock concerts where the bands being recorded gave fans permission to record their concerts. The Archive once again secures permission from bands–as Kahle points out, there’s a big difference between recording a concert to swap on tape and putting that concert online for the world to access. Currently, the Archive has 36,000 concerts online. The audio files also include speeches, radio, commericals, and more. Such materials cost $10/disk or $10 per hour of audio to digitize.
Moving on to moving images: There are 150,000 to 200,000 films that have been released theatrically. Over half of these, Kahle reports, are Indian. Current 800 of these movies that have fallen into the public domain are available on the Internet Archive. It costs $100 to $200 per hour of movie to digitize celluloid. The archive is also recording material from 20 television networks worldwide, but has not placed its million hours of TV content online because of copyright restrictions. The one exception? You can find TV broadcasts from the week of September 11, 2001.
By opening up the servers to anyone with a movie to upload, the Internet Archive is serving subcultures that aren’t widely known. These include speed runs–videos of people navigating entire computer games in record time–and animation of Lego bricks.
The Internet Archive is perhaps most famous for the Wayback Machine, which has been collecting web pages since 1996. The Archive collects the entire web every two months. The two-month window is a key one because the average web page life is 100 days–that is, after 100 days it’s likely to have been changed or deleted.
Kahle referenced historical burning and pillaging of libraries, and emphasized the importance of having more than one copy of the Archive. Accordingly, the Archive has given one copy to the new Library of Alexandria. The Archive is also being copied onto servers in Amsterdam. Currently, the primary Archive is in San Francisco and represents a petabyte of information. For those of you keeping track at home, that’s 1,000,000,000,000,000 bytes.
Kahle pointed out that facilitating the sharing of materials that don’t originate within an institution remains a novel idea. Still, respectful cultural institutions that don’t make a profit have a good record of getting permissions and rights for materials of all sorts. The exception is with software, which is protected by the Digital Millennium Copyright Act. The Archive spent $30,000 in legal fees to eke out three years of permission to reproduce the materials.
Kahle’s passion for the subject peaked when he discussed the political and social issues surrounding the project. He was especially emphatic that the aggregate material currently owned by cultural institutions doesn’t end up behind commercial gates. He declared such a scenario would be “a nightmare.” Already, he reminded the audience, academic content in journals is inaccessible to most people, locked behind paid institutional subscriptions accessible only on corporate sites.
His motto? “Public or Perish.”