The challenge of preserving social media is an important topic in the contemporary data landscape.
Author: Clare Lanigan
The challenge of preserving social media is an important topic in the contemporary data landscape. In the case of Twitter, millions of tweets are issued every day, and the conversations that happen on Twitter form an essential record of our time; but like all records, this conversation can disappear if not adequately preserved. Vint Cerf from Google spoke to the media recently about the danger of a “digital dark age”, as current storage methods become obsolete. To most people, especially those working in digital preservation, this was not surprising information.
Finding sustainable, efficient ways to gather, preserve and provide access to social media archival data is the driving force behind the The Social Repository of Ireland, a joint project of Digital Humanities and Journalism group at the Insight Centre for Data Analytics and the Digital Repository of Ireland at the Royal Irish Academy. Over the past year or so, the Social Repository of Ireland has investigated the feasibility of developing an effective social media archiving tool for Twitter data relating to significant events in Ireland. During our research, we have identified some important issues that anyone thinking of setting up a Twitter archive needs to be aware of. In this article, we look at those issues, examining the historical relationship between developers and Twitter, changes to the Developer Rules over time and how other projects have fared when attempting to gather and preserve tweet data in a social media archive.
Developers and Twitter
Twitter makes their API available to developers to allow them build tools that work with Twitter data. Before 2011, most projects of this kind (for example, Gnip, Topsy and Datasift) operated independently, but since then many of them have become official Twitter Partners. This is a result of several changes in Twitter’s Developer Policy over the years — changes which have alternately delighted and devastated developers. The most (in)famous of these changes was in 2011, when Twitter made it a lot more difficult for third-party developers from gathering and syndicating tweet data in any meaningful way. To many, these changes were surprising, considering the relative openness and freedom Twitter had allowed developers prior to 2011.
2006 – 2010: the open years
Between 2006 (the year of Twitter’s birth) and 2010 a number of tools and projects, both proprietary and openly accessible, used Twitter’s API to develop scraper and aggregation tools. At that time, Twitter’s developer policy did not explicitly prohibit this kind of use. Projects such as Storify, Topsy and TwapperKeeper were launched. During this period, Twitter had a stronger focus on open data and making public tweets reasonably available. This approach was centred on the idea of Twitter content as an archive of our time: a ‘legacy approach’. It reached its zenith in 2010 when Twitter signed an agreement with the Library of Congress to archive the entire Twitterstream from 2006 onwards and for all tweets going forward. This appeared to reflect a commitment to the principles of open data and archival transparency.
Changes to the Developer Rules, 2011
The ‘legacy approach’ described above appeared to change somewhat in 2011 when Twitter made changes to its Developer Rules. There appeared to be somewhat less focus on making tweet data openly accessible to applications not owned by Twitter. This change may have been brought about by the worldwide recession which was at its height at that time. While it’s not possible to say for sure what Twitter’s motivations were, it may be that the company hoped to gain new revenue streams by partnering with and monetising the various tweet scrapers and aggregators such as Topsy, TwapperKeeper, Datasift, etc. that third-party companies and programmers had developed. Many larger tools and projects became official Twitter partners (e.g. Gnip, Hootsuite).
The text of the 2011 Developer Rules is no longer available, but the essence of the changes was that third-party apps and tools were no longer permitted to ‘replicate the core Twitter experience’. This was described in more detail by Ryan Sarver, at the time Director of Platform at Twitter:
“Developers have told us that they’d like more guidance from us about the best opportunities to build on Twitter. More specifically, developers ask us if they should build client apps that mimic or reproduce the mainstream Twitter consumer client experience. The answer is no,”
Sarver also made explicit Twitter’s desire to create a ‘less fragmented’ experience for Twitter users by reducing the number of ‘consumer client apps that are not owned or operated by Twitter’
Third-party developers were not explicitly barred from gathering or syndicating Twitter data but they were expected to keep within a certain size (‘size’ in this context referring to the number of user tokens needed by an app on a daily basis). The number of user tokens allowed per day varied from 100,000 to 50,000, and the new Developer Rules stated that apps wishing to extend their user tokens needed to contact Twitter to gain permission. Even then , it was not specified what exactly an app needed to do to gain permission from Twitter. The rules seemed vague, perhaps to ensure that Twitter would retain control over as many apps and tools as possible.
Realistically, Twitter were not able to shut down widely used apps such as Tweetdeck, even though they were technically violating the new Developer Rules. Instead, Twitter appeared to adopt a policy of partnership. Tweetdeck, Hootsuite, Datasift and Gnip were among the products that became Certified Product Partners.
It is possible that part of the motivation behind the new rules was the need, to some extent, for Twitter to monetise users’ tweets. Around the time of the Developer Rules change, Twitter suspended products developed by the company Ubermedia that it believed were violating its trademarks and the privacy of users. Crucially, in their takedown notice, Twitter stated, that the products were ‘changing the content of users’ Tweets in order to make money.’ Combined with new restrictions that had been placed on third-party tools and apps and the Certified Partnership Program for apps that had already passed a certain size, this focus on monetisation indicated that Twitter wished to keep financial profit from user tweets within the company itself.
Many developers were unhappy with the new rules. Some speculated that their severity would drive innovators away from using Twitter, but realistically the service remains as popular as ever, so any project that wishes to analyse data relating to news events are required to rely on Twitter’s API.
The Developer Rules have been relaxed slightly since 2011, but are still somewhat restrictive for third-party apps. Many of the projects that shut down in 2011 did not restart. In some cases, this may have had as much to do with the separate end of funding streams as with the Twitter shutdown. Others were absorbed into larger products, e.g. TwapperKeeper into HootSuite (eventually partnered with Twitter).
Since then, data scrapers that operate commercially and behind a paywall, even ones that are not official Twitter Partners, are generally not interfered with by Twitter. Their 2014 purchase of Gnip, their largest proprietary data reseller, appears to represent a decision by Twitter to take complete financial control of their data reselling and buy the company that was already a leader in the field.
Data scraping and collection carried out by non-commercial or research tools and projects is still potentially vulnerable, as the case of ScraperWiki, a tool that allows users to build basic data scraping tools without requiring programming knowledge. Despite the fact that Twitter does allow data gathering for non-commercial purposes, the Developer Rules are remarkably vague as to what constitutes ‘syndication’ (See below for the relevant text from the Rules). Their own Twitter Archive service (developed in 2012) appears to hold the non-commercial monopoly on making ‘human-readable’ datasets available of user’s own accounts and searches. However, services such as the Internet Archive continue to make datasets available in raw unstructured format, without attracting Twitter’s ire. This is probably because the average user will have little use for unstructured data in a non-human-readable code such as JSON.
The current situation
According to the current (2015) Twitter Developer Policy, tools and projects may gather Twitter data but there are restrictions on what may be done with it. As has always been the case, many of these restrictions are in place to protect users’ privacy; to prevent compromise of Twitter’s product and/or to prevent the distribution of spam. However, section I, part 6 of the Policy places restrictions on the number of tweets a tool may gather and on how that information is distributed. The section states:
“If you provide Content to third parties, including downloadable datasets of Content or an API that returns Content, you will only distribute or allow download of Tweet IDs and/or User IDs. You may, however, provide export via non-automated means (e.g., download of spreadsheets or PDF files, or use of a “save as” button) of up to 50,000 public Tweets and/or User Objects per user of your Service, per day.” (Twitter Developer Policy Section I, part 6)
This statement is vaguely contradictory – on the one hand the first part prohibits open access by users of tweet gathering tools to complete tweet data, but the second part indicates that this data may be accessed but only by ‘non-automated’ means (manually downloading spreadsheets or PDFs of data) and comprising not more than 50,000 public tweets per day. It is essentially a moderate climb-down from the 2011 rules, an acknowledgement by Twitter that social media data cannot be completely firewalled, and while it is worth Twitter’s while to attempt to streamline the user experience as much as possible, some amount of third-party development of the API is still going to happen.
Who made it, who didn’t
The following are some examples of tools and projects that have been banned from scraping Twitter data since 2011, and of some that survived the ‘cull’ (This list is not exhaustive). While specific reasons for shutdown or survival appear to vary, there are some common threads in many of these cases.
TwapperKeeper: A JISC-funded tool for searching, collating and exporting user tweets. It was designed for individual users and researchers and operated as Twitter archiving service from 2010-2011. In 2011 Twitter charged it with violation of their Developer Rules under the syndication of content clause. Twitter classified exporting tweets in usable format as syndication and the service was shut down. The product was absorbed into HootSuite and the open source version still provides access to unstructured raw tweet data similar to Internet Archive.
Web Ecology Project 140kit: This was a project funded by Harvard University and started in 2007 with the aim of aggregating and annotating the Twitterstream for researchers to use. While it ceased operations in 2011, this may have had as much to do with its funding coming to an end as its violation of the 2011 Developer Rules. It also may have challenged Twitter’s anti-spam rules when the Web Ecology Project held a competition in 2009 inviting developers to create a Twitter spambot.
ScraperWiki: This is a more recent (2014) example of a tool being shut down by Twitter for Developer Rules violations. ScraperWiki is a data crawling and harvesting tool that allows users to export and manage social media data in easy-to-use graphics. It continues to provide its service to users collecting data from sources other than Twitter, but as of 2014 it cannot scrape and aggregate Twitter data due to violation of the syndication of content clause. ScraperWiki themselves speculated on some possible reasons for this on their blog. As well as reflecting Twitter’s increasing focus on the market, the shutdown may have been triggered by concerns about privacy, because a harvesting tool may not be able to match real-time tweet deletion. ScraperWiki also speculated that Twitter were keenly aware that there is a gap between business use of the Twitter firehose and the data-gathering needs of ordinary users. It may have targeted ScraperWiki because it saw them as filling the ‘ordinary user’ remit.
When you compare the tools and services that survived 2011, the common factor seems to be the lack of free, easy access to a human-readable presentation of the collected data. For example, ARCOMEM, a FP7-funded European Commission project geared towards using digital and web archives to enhance community memory, was not challenged, perhaps because a professional or university login was required to view the collected data and metadata. The Tweepository, a project developed by the University of Southampton as part of their ePrints digital repository, did not fall foul of the 2011 rules, also perhaps because of the ‘wall’ of a university login between the user and the data. While neither of these services charged financially, the ‘distancing’ effect of an institutional login appeared to allow them to stay on just the right side of Twitter’s vague syndication rules.
Conclusion
Twitter data is the social archive of the early 21st century. No archive of social media can afford to neglect looking for solutions to the problems of collecting, preserving and making this data available. When it comes to scalability, projects such as the Social Repository of Ireland, with a remit limited to one country, have a better chance of developing tools to manage data than huge-scale projects such as the Library of Congress Twitter archive. The Library of Congress also has yet to devise a workable solution for access to the Twitter archive.
However, scalability and access mean nothing if the data cannot be archived in the first place. Because Twitter is a private company, archivists and programmers are subject their developer policy. From an approach prioritising open data and shared access, the company appears to have shifted towards a more market-centred attitude in recent years. But even they could see that excessively restrictive amendments to the Developer Rules were unsustainable in the long term. Occasionally, though, they still like to exercise a little muscle, as in the case of ScraperWiki. The recentness of the Scraperwiki incident serves as a reminder to tweet collector projects to remain mindful of possible restrictions on their actions.
We needn’t throw our hands up in despair. It’s a shame that the freer, open-data approach is no longer the dominant mood at Twitter, but all is not bleak for developers wishing to find ways to gather and manage Twitter data. Certain restrictions (for example, on amounts of data gathered, and on ease of access) may have to be put in place, but ultimately there is still scope for ‘bona fide’ researchers to gain access to the incredibly rich resource that is the ever-changing Twitterstream.
To read more entries in the DRI blog, please visit the Blog Page.