Memory institutions know the headaches of storing their ever-expanding physical collections: fire, flood, access & space over the long-term. But storing digital assets presents even more diverse challenges: attacks by hackers, deep fakes, censorship, and the unforeseeable cost of storing bits for centuries. Could a new approach—decentralized storage—offer some solutions? That was the focus of an Internet Archive webinar on February 24.
In the utopian version of decentralized storage, there would be collaborative, authenticated, co-hosted collections. Wendy Hanamura, Director of Partnerships at the Internet Archive, said this would make information less prone to censorship and less vulnerable to a security breach. “Taken together, resiliency, persistence, self-certification and interoperability — that is the promise of decentralized storage,” she said.
Librarians and archivists are a key part of creating a solution that is networked, said Jonathan Dotan, Founder of the Starling Lab, the first major research lab devoted to Web 3.0 technologies.
“As a community, if we can all come together to guarantee the integrity of information, we’re in a unique position to create a new foundation of digital trust,” Dotan said. “When we think about decentralization, it’s not a single destination. It’s an unfolding process in which we continually strive to bring more and more diverse nodes into our system. And the more diverse those notes are, the more that they’re going to be able to store and verify information.”
Other speakers at the webinar included Arkadiy Kukarkin, Decentralized Web Lead Engineer for the Internet Archive, and Dominick Marino, Senior Solutions Architect and Ecosystem lead at STORJ.
Register for the next session: Keeping Your Personal Data Personal: How Decentralized Identity Drives Data Privacy March 31 @ 1pm PT / 4pm ET Register here
I was fourteen years old when I watched the Egyptian revolution unfold before my eyes. One of the main things people protested against was the degree of censorship everyone was subjected to. Book bans in particular were popular for many decades leading up to the revolution. Interestingly, eleven years after the revolution, I am seeing the same arguments the Egyptian government made in Egypt for book bans made here in America by local school boards and politicians. My experience has taught me that, regardless of content, book banning is harmful because it weakens the democratic process and works against making societies cohesive.
The Egyptian government extensively banned books during the latter half of the 20th century. Those in power argued for the need for more parental and educational control. They also made arguments focused on the effect certain books have on polarizing the public on race, politics, religion, or sex and the importance of maintaining social order and decorum. Books discussing political and religious themes were banned with the most frequency, including a novel by Nobel laureate Naguib Mahfouz. Thus, the product of book banning was the revolution—many years later. In essence, the revolution was an amalgamation of a seemingly unidentified people, split among social, religious, and political lines, coming together to reconcile the calamities of over half a century. Namely, the effects of being unable to discuss relevant and pertinent ideas and issues—a side effect of book bans.
As I am wrapping up my last semester in law school I see parallels of what happened in Egypt taking place in America: people split among political, social, and sometimes religious lines. They are divided over issues that have come up partly due to discrimination, police brutality, and more recently and intensely, book bans.
In Florida, a school removed 16 books pending review because they contained “obscene material,” including Khaled Hosseini’s The Kite Runner and Toni Morrison’s Beloved. In Washington, a school district removed Harper Lee’s To Kill a Mockingbird because of its depiction of race relations and use of racist language. And last month, a school board in Tennessee voted unanimously to ban Maus, a Pulitzer Prize-winning graphic novel about the Holocaust. The school board argued this book should not be taught in classrooms because it contains material that is inappropriate for students, specifically because “of its unnecessary use of profanity and nudity and its depiction of violence and suicide.” In other words, people and local governments are making the same arguments I heard growing up supporting book bans. Specifically, they stress the need for more parental control, the inappropriateness of discussing sexuality, and the dangers of debating race. The same harmful effects I saw in Egypt, I see here: book banning is weakening the democratic process and working against making society less tolerant and cohesive.
Perhaps it is necessary to remind ourselves why we read in the first place. We read to empower ourselves and others. We read to learn perspectives and perhaps to develop our own. We read to understand the power of ideas and the effect they had and continue to have on us as a society. We read to open mental doors and windows of tolerance. We read to challenge ourselves, to reach new heights and understanding. We could disagree with many books, sure, but that is precisely why we read: to critically think about issues and better ourselves and our society in the process. Stated differently, we read to maintain and strengthen the social threads that weave our communities tightly together.
Book-banning in today’s online world is largely a political act. Books may not be available in local libraries, but they remain available on the Internet and in online libraries like the Internet Archive’s, where you can borrow them for free. In a way, the Internet Archive plays a similar role to that of the Internet in pre-revolution Egypt: it is a space where people can read, listen, and watch uploaded works and items compiled in one place. But online libraries aren’t completely safe either. For example, the Goliath of the publishing world, Penguin Random House (PRH), used copyright law as an excuse to effectively ban Maus from the Internet Archive’s digital shelves. PRH made it clear that they wanted to assert total control over this banned book in order to maximize its own profits in the wake of the Tennessee School Board’s decision. This has the same impact on society as book banning.
When societies censor books, they threaten to lose their culture and, in time, their identity. By banning books, societies jeopardize their political and social institutions because books are the primary tool to spread and develop ideas. With the fight to ban books extending to the online world, the threat has become as clear as ever. You could argue that book banning is about many things—the illusion of parental control, the polarization of the public, or disagreement on topics like race, politics, or sex. However, the bottom line remains clear: book bans serve no one, and no society can overcome its issues by banning books.
Lavanya Singh was eager to write lots of code after her freshman year of college, but she knew it was hard to find a place that would give her a chance. Then she landed a spot with the Google Summer of Code (GSoC) program working at the Internet Archive.
Paired with Mark Graham, director of The Wayback Machine, Singh was asked to create a systematic way to archive news sources from all around the world.
“Mark basically gave me that problem and said: ‘Go figure it out,’” she recalls, grateful for the challenge, the tight knit community at the Internet Archive, and the mentorship provided throughout the project. “The Internet Archive really trusts their interns and gives you an opportunity to do huge scale technical projects that are going to be useful in the long run.”
The experience gave Singh skills and confidence that led to other internships and a job as a software engineer, following graduation this spring from Harvard University with a degree in computer science and philosophy.
For 17 years, GSoC has given more than 18,000 students from 112 countries the chance to learn about programming up close. Google selects students (called “contributors”) and matches them with organizations doing open-source projects. All told, the students have created 40 million lines of code since the program’s inception in 2005. It has helped launch careers, like Singh’s, and provided a pipeline of potential employees for the 746 organizations that have participated. Google recently posted its Google Summer of Code timeline for 2022 for applicants for the paid positions, which last 12 weeks.
“It is truly a benefit and service to students. For some, it can be transformational,” said Singh’s mentor, Graham, of the Internet Archive. “But it also helps us. It’s a way to learn about new talent. And it’s a way for the Internet Archive to increase our visibility and demonstrate that we are part of this community of organizations.”
GSoC provides an infrastructure to match promising programmers with projects that can be difficult to find and is especially relevant now with people working remotely, said Brenton Cheng, a senior engineer with the Internet Archive.
“It’s been an incredible way by which people all over the world can get opportunities to work with companies, creating openings that might not be available to them otherwise,” said Cheng, who has mentored several student contributors over the years.
Staff assign mini-projects designed to give students hands-on experience and a sense of accomplishment. Students are also included in team meetings, invited to give input and present their work, said Cheng.
Recent GSoC projects and contributors:
Rakesh Chinta focused on building advanced features for the existing Chrome extension for the Wayback Machine (2017);
Zhengyue Cheng created a “map” of the web via the Wayback Machine (2018);
Kanchan Joshi improved site navigation for Archive.org (2019);
Giacomo Cignoni made a significant contribution with his BookReader Selection & Dark Mode project. He worked to give public domain works the ability to have text selection over the book page images (2020);
Tabish Shaikh helped improve the adoption of Open Library with his Adoption of BookLovers project – redesigning the Book Page and making it clearer what services were offered (2020);
Nolan Windham worked on the Open Book Genome Project. It centered on the ability for computers and machines to read a book on our behalf, and extract metadata that can then be made publicly useful to the world. Through the process, nearly 10,000 new books were added to the lending system (2021);
Xin Yue Chen focused on linking Wikipedia references to Internet Archive books (2021).
“We’re helping to train the next generation of developers,” Cheng said. “On the flip side, we really believe in our mission. Quite often, the people who work with the Google Summer of Code program continue to contribute with us as volunteers or sometimes even become employees.”
It’s a mutual win and an awesome program that has helped a lot of students find connections with companies, added Cheng. The program is a way for young people to show their initiative and is advertised as a way to “flip bits not burgers” in the summer.
“It’s a chance to contribute to a larger organization and maybe set themselves on a different prospective path to their future,” Cheng said.
Mek, who leads the OpenLibrary team at the Internet Archive, said the four GSoC students he’s worked with have made substantial improvements through their projects.
“We were able to make progress in a variety of different areas that we may not otherwise have had the bandwidth to focus on,” said Mek.
Being involved in GSoC has dramatically increased the number of volunteers who are interested in participating within the Open Library ecosystem. It prompted the Internet Archive to streamline the volunteer page and create an intake form. There has also been an effort to organize and label projects for new volunteers.
The GSoC experience led the Internet Archive to structure its own internship and fellowship opportunities. And it has provided the organization with a means to find qualified staff.
Anish Kumar Sarangi, a student GSoC contributor in 2018, joined the Internet Archive as an employee in May 2020. During his summer experience, Sarangi worked on development of the Chrome extension, “Wayback Machine.” Today it is used by thousands of people to help them archive URLs, access archived content from broken links and perform other functions to help make the web more useful and reliable.
“I gained a lot of knowledge and experience. Everyone was very encouraging and supportive,” said Sarangi, of the summer program. He now works from India in software development for the Internet Archive and has been a mentor with the program himself. His advice to others considering applying: “Please get involved in the community. You can get guidance and grow further in the organization.”
For scholars, especially those in the humanities, the library is their laboratory. Published works and manuscripts are their materials of science. Today, to do meaningful research, that also means having access to modern datasets that facilitate data mining and machine learning.
On March 2, the Internet Archive launched a new series of webinars highlighting its efforts to support data-intensive scholarship and digital humanities projects. The first session focused on the methods and techniques available for analyzing web archives at scale.
Watch the session recording now:
“If we can have collections of cultural materials that are useful in ways that are easy to use — still respectful of rights holders — then we can start to get a bigger idea of what’s going on in the media ecosystem,” said Internet Archive Founder Brewster Kahle.
Just what can be done with billions of archived web pages? The possibilities are endless.
Jefferson Bailey, Internet Archive’s Director of Web Archiving & Data Services, and Helge Holzmann, Web Data Engineer, shared some of the technical issues libraries should consider and tools available to make large amounts of digital content available to the public.
The Internet Archive gathers information from the web through different methods including global and domain crawling, data partnerships and curation services. It preserves different types of content (text, code, audio-visual) in a variety of formats.
Social scientists, data analysts, historians and literary scholars make requests for data from the web archive for computational use in their research. Institutions use its service to build small and large collections for a range of purposes. Sometimes the projects can be complex and it can be a challenge to wrangle the volume of data, said Bailey.
The Internet Archive has worked on a project reviewing changes to the content of 800,000 corporate home pages since 1996. It has also done data mining for a language analysis that did custom extractions for Icelandic, Norwegian and Irish translation.
Transforming data into useful information requires data engineering. As librarians consider how to respond to inquiries for data, they should look at their tech resources, workflow and capacity. While more complicated to produce, the potential has expanded given the size, scale and longitudinal analysis that can be done.
“We are getting more and more computational use data requests each year,” Bailey said. “If librarians, archivists, cultural heritage custodians haven’t gotten these requests yet, they will be getting them soon.”
Up next in the Library as Laboratory series:
The next webinar in the series will be held March 16, and will highlight five innovative web archiving research projects from the Archives Unleashed Cohort Program. Register now.
The audio archive contains recordings ranging from alternative news programming, to Grateful Dead concerts, to Old Time Radio shows, to book and poetry readings, to original music uploaded by our users.
Founded in 2005, Librivox is a community of volunteers from all over the world who record audio versions of public domain texts: poetry, short stories, whole books, even dramatic works, in many different languages.
The Live Music Archive is a community committed to providing the highest quality live concerts in a lossless, downloadable format, along with the convenience of on-demand streaming.
The Internet Arcade is a web-based library of arcade (coin-operated) video games from the 1970s through to the 1990s, emulated in JSMAME, part of the JSMESS software package. Containing hundreds of games ranging through many different genres and styles, the Arcade provides research, comparison, and entertainment in the realm of the Video Game Arcade.
“We are big supporters of libraries because they allow equal access to knowledge and preserve culture,” said Wilt, whose independent press based in Minneapolis sells its books at a discount to nonprofits. “From a publishing standpoint, our authors care about being read so we want to get our books to as many people as possible.”
The Internet Archive recently bought the entire catalog of books from 11:11 Press and made them available online for controlled digital lending to one person at a time.
“Honestly, I don’t know why anyone would not want to have their books in a library, especially the Internet Archive, which is more relevant now than it has been any other time,” Wilt said. “It used to be the library of the future. But in our era of remote learning and people working from home, the Internet Archive is the library of the present. You don’t have to go into an actual physical building. It’s available for anyone with an internet connection. It’s probably the most relevant lending institution at the moment.”
“[Internet Archive] used to be the library of the future. But in our era of remote learning and people working from home, the Internet Archive is the library of the present.”
Andrew Wilt, editor, 11:11 Press
In business for four years, 11:11 Press publishes an eclectic mix of titles that Wilt describes as “disruptive literature.” Its authors push the boundaries. Some books have a very heavy, theoretical and academic focus while others are about everyday working people. There are books of poetry, short stories, novels, and hybrid work. The aim is to give exposure to underrepresented voices and offer an alternative from what is produced by mainstream publishers.
“We’re kind of this lighthouse trying to find those people who are actively looking for something that’s new and exciting,” said Wilt.
From the 11:11 Press Catalog
In one of the 11:11 Press “theory fiction” titles, Zer000 Excess, images are “glitching out” within the text, leading the reader to consider what meaning is being created. Jake Reber wrote the book using Microsoft PowerPoint 2007 – the only version of the software with identifiable software features known to produce these “glitches.” Authors like Reber intentionally use these embedded software tools incorrectly in order to get distortion. “Like the early punk bands who put fuzz in their music, we’re trying to add that distortion in the work,” said Wilt.
Human Tetris merges digital dating in an all-too-honest newspaper style of queer dating profiles. It was written as a collaboration between two different voices building a lattice of interlocking online identities by Vi Khi Nao and Ali Raz.
The publisher features “dangerous writing,” which uses fiction as the buffer to draw on personal experience. For authors in this genre, fiction is the lie that tells the truth. “We want to encourage writers to go to those uncharted territories of the self. What you find might be hard to look at, but if you pull back the layers, there’s something unique and beautiful there.” Wilt said.
Jinnwoo (Ben Webb) is a writer, musician, visual artist, and author of the book Little Hollywood published by 11:11 Press. It consists of B-grade movie scripts with paper doll cut outs. The idea is to engage the reader by having them cut out the dolls and use the scripts. “Going to those dark places with honesty encourages the reader to be more mindful, more present, which leads to more empathy,” Wilt said.
Did you know? Thanks to the innovative partnership between the Internet Archive and Better World Books—our favorite online bookstore—patrons who browse to the 11:11 Press books at archive.org have a direct link to purchase new copies of the books in print via Better World Books.
“Small presses drive innovation.”
In its next catalog, 11:11 Press will be coming out with a 520-page Illustrated Old Testament and corresponding painting. This 9-by-12-inch book, which will sell for $150, is too religious for some and too secular for others, making it a perfect product for a small press, Wilt said. Another upcoming book will be a compilation of short stories by the late Peter Christopher who helped start the dangerous writing movement.
As a small press, Wilt said the focus isn’t to write with marketing in mind but rather for authors to write the stories only they can tell. The hope is for 11:11 Press to create something greater to help benefit society and get people to think in a different way. “Reading authors who courageously face their lives, their past, their future, encourages us, the readers, to do the same,” he said.
Wilt said he anticipates other independent publishers will follow suit in selling their works to the Internet Archive. “Small presses drive innovation. This is where experimentation occurs,” he said. “Our top priority is sharing knowledge.”
The bet, to be revisited a decade and a year later, would be whether the URL of their wager at Long Bets would survive to a point in the semi-distant future.
That is, this day, February 22nd, 2022, (2/22/2022).
Therefore, the Internet Archive shall receive a $1,000 donation from Mr. Keith and Mr. Haughey ($500 apiece), provided from an escrow account that has held the funds since the day of the wager. (We shout out to the Bletchly Park Trust, a worthwhile historical organization, who will not be getting the donation but who are deserving of yours.)
It would be easy enough to declare it a win for the idea of “the web” and that regardless of concerns brought up about the Internet’s ongoing issues, we can still find hope. So certainly, let us all applaud that things worked this way, and the URL’s 11-year consistency is a bright beam of light, online.
In many ways, however, the bet is at best a bittersweet victory, and at its darkest interpretation, a small oasis in a desert.
To understand The Long Bets, you need to understand The Long Now.
The Long Now Foundation is a non-profit meant to be an organization geared towards projects and approaches to thinking that chronologically leave the average human lifespan in the dust; focusing on 10,000-year timelines and solutions to problems of sustaining cultural contexts for a hundred lifetimes and beyond.
Currently, the Long Now and its ideals are expressed in both a very nice performance and drinking space called The Interval, and a number of stylish projects and websites to bring this realm of thinking into focus.
The most prominent and first major project was the Clock of the Long Now, a project to make a time-accurate clock that would function for 10,000 years. The as-yet incomplete project goal is to build a clock deep in a mountain range and set it off, ticking occasionally (but on time, doing so) for the next ten millennia.
It is a facilitation of long-term thinking, of providing a neutral, fair and equitable way for years-in-the-future bets to be made between parties, each contributing funds towards the prize. It is traditional that the recipient of the prizes be organizations not run or controlled by either bettor.
Browsing the betting page, the bets range from the humorous to the aspirational, from specific sports outcomes to predictions around space travel, vehicle autonomy and economics. They’re a joyride of thought and conversation starters, as they’re meant to be.
And among them is Bet #601.
Jeremy Keith and Matt Haughey are both veterans of The Web as it has historically been described; each has had their voices heard to crowds online and off, describing the nature of websites. Their careers have (deservedly) benefitted greatly from the power of interlinked websites.
They both recall the start of the world wide web, as well as internalizing the rules and mores that followed its birth. They were well-qualified to debate on the longevity of URLs and the position that a specific URL would hold across time.
That said, Keith was skeptical. Haughey was optimistic.
Like a lot of its neighbors, Bet #601 is too clever by half; the bet states that it is won or lost depending on the availability of the bet at the URL the bet is hosted at. That is, two situations exist to judge the outcome: Either the URL https://longbets.org/601 exists, at which point the bet is lost, or it does not exist, at which point the bet is won.
(Strikingly, if the bet had been won, the Internet Archive would possibly be the only place to browse the site in its original form, where it would have then helped prove the funds should go to Bletchly Park Trust. The continued reliance on the Wayback Machine as the vault of the Web’s lost memories would have persisted, in a very sharp and slightly less financially-beneficial way. Such is the price of memory.)
For the record, here are the statements made by each bettor about their arguments for the wager:
Jeremy Keith
Jeremy Keith: “Cool URIs don’t change” wrote Tim Berners-Lee in 01999, but link rot is the entropy of the web. The probability of a web document surviving in its original location decreases greatly over time. I suspect that even a relatively short time period (eleven years) is too long for a resource to survive. I would love to be proven wrong.
Matt Haughey
Matthew Haughey: Though much of the web is ephemeral in nature, now that we have surpassed the 20 year mark since the web was created and gone through several booms and busts, technology and strategies have matured to the point where keeping a site going with a stable URI system is within reach of anyone with moderate technological knowledge. My oldest sites are going on 13 years old at the time of this bet and the original URL scheme still functions via 301 redirects to a final format we selected about six years ago.
This should be it.
But it’s worth noting how the context of this bet has changed over time. And issues with the continued evolution of the web strike at heart of the point the bet was trying to make.
THE URL:
The Long Now Foundation, intending to maintain its footing for as long as absolutely possible, has a very vested interest in its URLs staying stable. Between hosting structure, the setup of the webpages themselves, and maintaining clean, static URLs (longbets.org/601 is a very simple address, lacking any ornamentation or dependence on programming language extensions or dynamic rendering). The domain name longbets.org is registered until June of 2022 as of this writing, but was registered in June of 2001, twenty years ago, which bodes well for continued survival.
If you’re going to bet that a URL is going to stick around, on a website run by an organization that expresses its character by the longevity of its projects, staking your bet on a specific URL from that organization is a pretty safe bet.
THE WEB:
Both of the parties in the bet clearly think of “the web” as being a set of interacting links between websites, but even by 2011, the idea of a “website” was beginning to experience direct collision with the ever-centralizing, ever-shifting audience of online life. Mobile access is a quirk in the 1990s, an oddity growing into a majority in the 2000s, and now, in the present day, phones with screens are the “home computer” of vast percentages of internet patrons.
In the interconnection of the world, it is harder and harder to think of a “website” where “platforms” rule the roost. A user is more likely to have an account name, or a public identity, than to ever utter the phrase “http” in their daily activity, or maybe even their year. The clear goal of many firms is to dissolve the consideration of the URI or URL, with many of the previous protocols of the earlier Web forgotten. The question becomes less of “will this URL survive” and more of “will the idea of the URL survive?”
THE IDEA OF LONGEVITY AND CHANGE FOR DIGITAL DATA:
Finally, the overarching fact of the situation is that sites like Long Bets are part of a philosophy of the web that is rapidly shrinking. Points of data and dependable signifiers of content and individuals were once the destination. That’s long changed; they are but stops along the way, flotsam and jetsam that ride in nebulous platforms that dominate online life. While Jeremy Keith and Matt Haughey maintain personal websites, they have rapidly become like homesteads that jut out in the center of towering skyscrapers and apartment blocks. Future generations will think of “the web” as much as they think of “the roads”; intensely interest to a few, below the watermark of consciousness to the rest.
As we move into this even-more-ethereal version of the Web, where objects, materials and locations possess data as much as pages and links we ever did, the Internet Archive will do its best to keep up and grow to match the challenge.
From the hundreds of libraries using Controlled Digital Lending (CDL) to meet the needs of their communities to the many working groups and vendors investigating its potential, it’s clear that this innovative library practice is on the rise.
Want to learn more about what’s going on across the community? Join us for a public webinar at 11am PT on March 10 to hear from active projects, including:
Controlled Digital Lending Implementers group;
NISO’s grant from The Mellon Foundation to support the development of a consensus standards framework for implementing CDL;
Boston Library Consortium’s efforts around CDL for interlibrary loan;
CDL Co-Op (ILL & resource sharing);
Internet Archive, with an update on the publisher’s lawsuit against CDL & libraries;
Presentations will be followed by a facilitated Q&A. Whether you are new to Controlled Digital Lending or have already implemented it in your library, this session will give everyone an update on where the community is today & where it’s going.
Community Update: Controlled Digital Lending March 10 @ 11am PT / 2pm ET – Register
On our current web, most platforms are controlled by a central authority—a company, government, or individual—that maintains the code, data and servers. Ultimately, consumers must trust that those central authorities will do what is in their best interest.
“In order to have ease of use, we have ceded control to these big platforms, and they manage our access to information, our privacy, our security, and our data,” explained Wendy Hanamura, Director of Partnerships at the Internet Archive, who led the workshop.
In contrast, the decentralized web is built on peer-to-peer technologies. Users could conceivably own their data. Rather than relying on a few dominant platforms, you could potentially store and share information across many nodes, addressing concerns about censorship, persistence and privacy.
“It is still very early days for the decentralized web,” Hanamura said. “All of us still have time to contribute and to influence where this technology goes.”
At the event, Mai Ishikawa Sutton, founder & editor at COMPOST Mag, explained how her publication can be viewed over the decentralized web using IPFS and Hypercore, while using Creative Commons licensing to openly share its contents. In addition, Paul Frazee demonstrated Beaker Browser, an experimental browser that allows users to build peer-to-peer websites on the decentralized web.
Using the current system, Web 2.0, relies on content living on web servers in a certain location.
“This is a problem because [publishers] want to change it. They want to update it. They … go out of business. They want to merge with somebody. And it goes away,” said Brewster Kahle, founder of the Internet Archive, noting that the average life of a web page is 100 days. The Wayback Machine was built to back up those web pages after-the-fact, but there is a need to build better decentralized technology that preserves a copy as the content is created, he said. “The Web should have a time axis.”
According to Kahle, in the future a decentralized web would look much the same to the user, but could build features such as privacy, resilience and persistence right into the code. It could also create new revenue models for creative works. For example, a decentralized web could enable buyers to make direct micropayments to creators rather than licensing them through iTunes or Amazon.
“This is a good time for us to try to make sure we guide this technology toward something we actually want to use,” Kahle said. “It’s an exciting time. We in the library world should keep focused on trying to make robust information resources available and make it so people see things in context. We want a game with many winners so we don’t end up with just one or two large corporations or publishers controlling what it is we see.”
From web archives to television news to digitized books & periodicals, dozens of projects rely on the collections available at archive.org for computational & bibliographic research across a large digital corpus. This series will feature six sessions highlighting the innovative scholars that are using Internet Archive collections, services and APIs to support data-driven projects in the humanities and beyond.
Want to participate? Register below! Do you have a research project that uses materials from the Internet Archive? We’re offering a Lightning Talks session at the end of our series to give more people an opportunity to share your research with the world. Simply complete our online form to be considered. Submission guidelines.
Many thanks to the program advisory group:
Dan Cohen, Vice Provost for Information Collaboration and Dean, University Library and Professor of History, Northeastern University
Makiba Foster, Library Regional Manager for the African American Research Library and Cultural Center, Broward County Library
Mike Furlough, Executive Director, HathiTrust
Harriett Green, Associate University Librarian for Digital Scholarship and Technology Services, Washington University Libraries
Session Details & Registration:
March 2 @ 11am PT / 2pm ET
Supporting Computational Use of Web Collections Jefferson Bailey, Internet Archive Helge Holzmann, Internet Archive
What can you do with billions of archived web pages? In our kickoff session, Jefferson Bailey, Internet Archive’s Director of Web Archiving & Data Services, and Helge Holzmann, Web Data Engineer, will take attendees on a tour of the methods and techniques available for analyzing web archives at scale.
Applications of Web Archive Research with the Archives Unleashed Cohort Program
Launched in 2020, the Cohort program is engaging with researchers in a year-long collaboration and mentorship with the Archives Unleashed Project and the Internet Archive, to support web archival research.
Web archives provide a rich resource for exploration and discovery! As such, this session will feature the program’s inaugural research teams, who will discuss the innovative ways they are exploring web archival collections to tackle interdisciplinary topics and methodologies. Projects from the Cohort program include:
AWAC2 — Analysing Web Archives of the COVID Crisis through the IIPC Novel Coronavirus dataset—Valérie Schafer (University of Luxembourg)
Everything Old is New Again: A Comparative Analysis of Feminist Media Tactics between the 2nd- to 4th Waves—Shana MacDonald (University of Waterloo)
Mapping and tracking the development of online commenting systems on news websites between 1996–2021—Robert Jansma (University of Siegen)
Crisis Communication in the Niagara Region during the COVID-19 Pandemic—Tim Ribaric (Brock University)
Viral health misinformation from Geocities to COVID-19—Shawn Walker (Arizona State University)
Hundreds of Books, Thousands of Stories: A Guide to the Internet Archive’s African Folktales Laura Gibbs, Educator, writer & bibliographer
Join educator & bibliographer Laura Gibbs as she gives attendees a guided tour of the African folktales in the Internet Archive’s collection. Laura will share her favorite search tips for exploring the treasure trove of books at the Internet Archive, and how to share the treasures you find with colleagues, students, and fellow readers. Laura will demo how you can blog and tweet, how you can fit hundreds of books into a slideshow and squeeze thousands of stories into a spreadsheet, and how you can even publish your own book-of-books, creating a digital bibliography guide. After learning how Laura created the “Reader’s Guide to African Folktales at the Internet Archive,” maybe you’ll be inspired to make a reader’s guide of your own!
Television as Data: Opening TV News for Deep Analysis and New Forms of Interactive Search Roger MacDonald, Founder, TV News Archive Kalev Leetaru, Data Scientist, GDELT
How can treating television news as data create fundamentally new kinds of opportunities for both computational analysis of influential societal narratives and the creation of new kinds of interactive search tools? How could derived (non-consumptive) metadata be open-access and respectful of content creator concerns? How might specific segments be contextualized by linking them to related analysis, like professional journalist fact checking? How can tools like OCR, AI language analysis and knowledge graphs generate terabytes of annotations making it possible to search television news in powerful new ways?
For nearly a decade, the Internet Archive’s TV News Archive has enabled closed captioning keyword search of a growing archive that today spans nearly three million hours of U.S. local and national TV news (2,239,000+ individual shows) from mid-2009 to the present. This public interest library is dedicated to facilitating journalists, scholars, and the public to compare, contrast, cite, and borrow specific portions of the collection. Using a range of algorithmic approaches, users are moving beyond simple captioning search towards rich analysis of the visual side of television news. In this session, Roger Macdonald, founder of the TV News Archive, and Kalev Leetaru, collaborating data scientist and GDELT Project founder, will report on experiments applying full-screen OCR, machine vision, speech-to-text and natural language processing to assist exploration, analyses and data-visualization of this vast television repository. They will survey the resulting open metadata datasets and demonstrate the public search tools and APIs they’ve created that enable powerful new forms of interactive search of television news and what it looks like to ask questions of more than a decade of television news.
Analyzing Biodiversity Literature at Scale Martin R. Kalfatovic, Smithsonian Library & Archives JJ Dearborn, Biodiversity Heritage Library Data Manager
Imagine the great library of life, the library that Charles Darwin said was necessary for the “cultivation of natural science” (1847). And imagine that this library is not just hundreds of thousands of books printed from 1500 to the present, but also the data contained in those books that represents all that we know about life on our planet. That library is the Biodiversity Heritage Library (BHL) The Internet Archive has provided an invaluable platform for the BHL to liberate taxonomic names, species descriptions, habitat description and much more. Connecting and harnessing the disparate data from over five-centuries is now BHL’s grand challenge. The unstructured textual data generated at the point of digitization holds immense untapped potential. Tim Berners-Lee provided the world with a semantic roadmap to address this global deluge of dark data and Wikidata is now executing on his vision. As we speak, BHL’s data is undergoing rapid transformation from legacy formats into linked open data, fulfilling the promise to evaporate data silos and foster bioliteracy for all humankind.
Martin R. Kalfatovic (BHL Program Director and Associate Director, Smithsonian Library and Archives) and JJ Dearborn (BHL Data Manager) will explore how books in BHL become data for the larger biodiversity community.
Do you have a quick project briefing you’d like to share in two minutes or less? Fresh research, cool collections and wild ideas welcome! Submit a proposal now to give a live or pre-recorded lightning talk at our closing session. Submission guidelines.