Since the beginning of (Twitter) time, the United States Library of Congress, the official research library that serves the United States Congress, has archived all tweets. Yes, all tweets. Every single tweet, ranging from @BarackObama’s tweet that received over 4.6 million likes, to the latest update on what Aunt Betty ate for lunch, has been archived since 2006.
As of midnight December 31st, the Library of Congress has stopped the practice.
Document the emergence of online social media for future generations
“It was a landmark decision for the Library of Congress to begin collecting Twitter posts… The very notion that the research arm of the United States Congress—which also happens to be the country’s oldest federal cultural institution and the largest library in the world—had interest in the public prattle of people publishing thoughts and links in real time, at 140 characters at a time, seemed farfetched. But the deal was done. With help from Twitter itself, the institution acquired all public tweet text…” – Fortune
According to the Library of Congress, “The Library saw an opportunity to document the emergence of online social media for future generations.” The original Gift Agreement between Twitter and the Library of Congress provided the complete collection of all tweets. While archived, the collection is currently embargoed and not open to the public.
Scaling back the collection
According to The New York Times, “After archiving every single public message posted on Twitter since the social media platform was introduced in March 2006, the institution will soon scale back its approach to collecting them.”
“After this time, the Library will continue to acquire tweets but will do so on a very selective basis,” – The Library of Congress.
The desire to scale back is not surprising when you consider the scale of Twitter traffic. It’s estimated that around 6,000 tweets are tweeted on Twitter every second, which corresponds to approximately 200 billion tweets per year. The traffic was just over 19 billion tweets per year when the original agreement was signed.
The decision also reflects the changes in the social media platform since its inception, such as increased use of images. The Library of Congress archived only the text portions of the tweets. Images, GIFs, embedded videos, links, and metadata associated with each tweet have not been saved.
One of this generation’s most significant legacies
In the future, the Library of Congress plans to provide public access to all tweets that have been archived. What gems are hidden among the tweets? What will we learn about popular sentiment?
For now, the embargo on the archived tweets will remain in place. Once the embargo is lifted, imagine the analysis that could take place. What will we learn when we analyze this treasure trove of historic tweets? Here’s a blog post, Analyzing Twitter with MATLAB, where Toshi shares some ideas on how to use the Twitter API to analyze tweets with MATLAB. Some of the analyses and MATLAB code available in this post include:
- Sentiment Analysis
- Tweet Content Visualization
- Does Follower Count Really Matter? Going Viral on Twitter
- Visualizing the Retweet Social Graph
“The Twitter Archive may prove to be one of this generation’s most significant legacies to future generations. Future generations will learn much about this rich period in our history, the information flows, and social and political forces that help define the current generation,” stated The United States Library of Congress in the white paper that explained their decision to selectively archive tweets.