Many digital humanities projects begin with a well-defined and easily accessible corpus. One chooses an interesting novel, poetry collection, or whatever, downloads an ebook or checks out a printed copy from the library, and gets to work on the text; there is nothing standing between the researcher and his or her object. Our project was different. We chose to study a corpus that did not really exist—a special collection of tweets—so we had to create it before we could jump into our analysis. We accomplished this by scraping data from Twitter; this process presented significant technical challenges and was not entirely successful. Below I describe the difficulties we encountered and our approaches to managing them.
We knew from the start that this project entailed nontrivial programming tasks. Some DH projects are digital just because they use technology to automate tedious tasks or process information in a manner that is (apparently) more rigorous than those of the traditional humanities scholar. Other DH projects are essentially digital: digital technologies do not merely enhance them, but make them possible to begin with. Our goal of analyzing an overwhelming number of online posting obviously falls into the latter category. We planned to use our own software to collect the corpus—i.e., scrape Twitter—and convert the tweets into an XML format amenable to the analysis and transformation technologies that we learned in class (XPath, XSLT, XQuery, etc.). We chose to write this software in Python because of my familiarity with the language, its flexibility and ease of use, and its facility for manipulating text.
Getting data from Twitter is more difficult than one might think. Since Twitter offers both search and streaming application programming interfaces (APIs), one could reasonably conclude that Twitter provides third-party developers with straightforward access to past tweets, and that is what we expected to find as we began developing our scraper. We soon found that this impression was too optimistic. The streaming API provides only realtime data, and we wanted past tweets, so that was out. Similarly, the search API tantalizingly offers access to past tweets, but permits searching of only a small, recent window of time, not the five years that we needed. A data vendor with close ties to Twitter (Gnip) offers an API that promises to do exactly what we needed, but it is an expensive commercial product targetted at marketing research companies; "Professor JR" was denied a free trial or academic license. Thus exhausting Twitter's options for developers, we concluded that we needed to devise our own data-collection method.
We turned our focus to Twitter's "Advanced Search" page, which returns tweets in exactly the way we wanted. This page seemed like the key to the data we sought, but the search results displayed there are of course intended for human beings instead of computer programs, so we needed to somehow take this human-friendly display, capture the important information, and get it into a machine-friendly format. Oftentimes the best approach to this sort of task is to simply copy the desired information from the site's pages. One can easily download a page, extract the data from the HTML, and then store it for later analysis. The complexity of Twitter's site precluded this straightforward method. The search results page is an "infinite scroller" whose bare HTML contains nothing of interest; instead, content is loaded dynamically by Javascript that fetches a JSON file containing more tweets whenever the user reaches the bottom of the page.
To scrape the page, we targetted this JSON-fetching behavior, mimicking the page's requests to intercept the JSON files containing the tweets. Examining the page with Chrome's developer console, I noticed that the page's Javascript made a request each time the user scrolled to the bottom; importantly, these requests were made by URLs that followed a pattern based on the ID numbers of the tweets contained in the last file returned. Because of this pattern, one could predict the next URL in the sequence, which would in turn give one the key to the next URL, etc., recreating the entire chain of file requests. This discovery promised a simple means of collecting the data we needed, and I wrote a program that used the URL prediction algorithm to repeatedly download JSON files until all results had been returned and the search was over, basically simulating a human's constantly scrolling to the bottom of the search page for countless hours. You can view our code at our Github repository.
However, we soon met another snag: as the search page is intended for the casual perusal of human viewers, it provides information about the tweets that we did not want (e.g., profile pictures) and withholds information that we did want (e.g., location). Recalling that Twitter's search API does provide access to historical tweets, even if it does not allow you to search them for keywords, I devised a two-step scraping process. First, we would use one program to scrape the tweets from the search results page and discard everything but the ID numbers of the tweets, which would be stored in a file. Then, a second program would read in these IDs and use them to request each tweet's full dataset via Twitter's search API. This approach worked satisfactorily until some change occurred on Twitter's end, which broke the software beyond my ability to fix (more on this below).
Now that we had a means of collecting the raw data, we needed to convert it to an XML format with which we could use our various XML-based analysis and transformation technologies. This conversion was trivial. The second program mentioned above received each tweet's dataset from Twitter in the form of a Python dictionary object, so I wrote a function to go through the dictionary, make XML start and end tags with the dictionary key names, and place those tags around the content given by the corresponding dictionary values. Some of these dictionary entries were not single key/values pairs but complex, but instead of using recursion I simply hardcoded extra passes to extract and format their contents.
At this stage we also realized that some compound words on our list were not suitable objects of study because of a peculiarity of Twitter's search method. For some reason, Twitter ignores hypens, so an adjective like "handed-down" becomes the partial verb phrase "handed down"; this change is problematic because a search would return tweets that do indeed contain the same written symbols as the target term, but not the target term itself. We responded to this issue by culling these compound terms from our list of target words.
Since we reformatted our data to XML format, we could process it by taking advantage of the technologies we learned in class. For example, we used XSLT to produce visualizations depicting the terms' popularity over a five year period.
We focused on making our XSLT as flexible and extensible as possible so that we could easily drop in new datasets and get results without modifying the code. For example, I wrote a recursive XSLT template to determine the greatest y value found in a given dataset and scale a chart's y axis appropriately.
A paradoxical tension underlies programming. On the one hand, code is supremely logical. A computer program is fundamentally nothing but the two familiar truth values and the few binary operations that act upon them, and what could be simpler or clearer than that? And yet, on the other hand, computers can seem mysterious or even capricious, surprising us by behaving in unexpected and frustrating ways. The laws of logic control a program at the machine level, but at the human level, numerous layers of abstraction from this base make it seem like one is dealing with mysterious forces. From this perspective, programming takes on a magical or supernatural cast.
This effect was compounded in our case because we were in fact dealing with a higher power and mysterious entity greater than ourselves—Twitter. Even though interacting with Twitter's site was integral to our data-collection methodology, we had no deep understanding of the site; indeed, we could not even hope to understand—let alone control—Twitter's protocols for searching historical data, displaying the results, etc. Of course, people get by just fine without fully understanding anything, and we managed to develop our scraping software with only a rudimentary knowledge of Twitter.com and how it worked. But, Twitter giveth, and Twitter taketh away: after this hard-won initial success, the scraper inexplicably stopped working. One day the software worked fine; the next, it would return only empty files. Even when running exactly the same searches with exactly the same code on exactly the same machines, we could not replicate the results we had found before. We were forced to conclude that something had changed on Twitter's end that complicated the scraping process. We are attempting to adapt and fix our scraper, but so far we have been unsuccessful.
Because of this unexpected obstacle, we had to revise our project goals. Instead of the original ten words that we hoped to study, we are now examining only the two words for which we were able to collect data before the change occurred.
Besides the popularity and interactivity of these words over time, we were also interested in how Twitter users of different genders used the neologisms. To accomplish this, we compared the first name of each tweet's author to lists of common names for each gender and thereby sorted the tweets into "male," "female," and "unknown" categories. I broke this process into simple steps to make it relatively straightforward: I first transformed freely available lists of men's and women's names by popularity using XSLT to create XML files listing the names in order of popularity, then wrote a second XSLT stylesheet to create an XML file linking each tweet with a gender, and finally used a third XSLT stylesheet to create SVG charts from that file. The results can be seen on our analysis page.
We concede that matching tweets to genders in this way is rather crude and not without a host of problems. Even in the case of traditionally gendered names (e.g., "John"), a name is no sure indicator of gender, and many common names are ambiguous (e.g., "Alex"). If the name associated with a tweet appeared on both lists, we assigned the tweet the gender in which the name was more popular (i.e., higher on the list). This approach was imperfect, but serviceable.