How I Came to Textmine Software

Yesterday, I began reading through Matthew Jockers’ Macroanalysis.  As a scholar already sold on the promise of the digital tools in the study of cultural history, I have enjoyed the opportunity to learn how someone else understands many of the same techniques I have begun incorporating into my research over the past year.  I think it also does an excellent job of situating digital methodologies within the larger scope of literary study, and I’d recommend it to both people curious about the digital humanities and those already working in it.  Jockers is very realistic about the limitations of these tools while at the same time describing the new possibilities they could provide, many of which I’ve found myself describing to non-DH colleagues recently.  As valuable as I’ve found his description of the potential for computational analysis, I can’t share in the optimism of the book’s early chapters:

Revolutions take time; this one is only just beginning, and it is the existence of digital libraries, of large electronic text collections, that is fomenting the revolution. . . . Though not ‘everything’ has been digitized, we have reached a tipping point, an event horizon where enough text and literature have been encoded to both allow and, indeed, force us to ask an entirely new set of questions about literature and the literary record.

You see, my field of study is located in the twentieth and twenty-first centuries.  For the most part, this revolution has left me behind.  I have to face the fact on a fairly regular basis that we don’t yet live in a time when “we can take for granted that some digital version of the text we need will be available somewhere online.”  It is true for those working in the nineteenth century that there is a continually growing number of texts available for study, but not so for those working on more recent cultural moments like myself. 

Despite really enjoying working in areas that several have included under the “big tent” of the digital humanities (web publishing, data curation, computational analysis, and digital media studies), my own experience with trying to find ways to incorporate distant reading tools into my research has more or less matched that described in Mark Sample’s essay “Unseen and Unremarked On: Don DeLillo and the Failure of the Digital Humanities”:

Unless one is willing to infringe on copyright, break a publisher’s DRM, or wait approximately four generations when these authors’ works will likely fall into the public domain, barring some as of yet unknown though predictable act of corporate America Congress, the kind of groundbreaking work being done on seventeenth, eighteenth-, nineteenth-, and even twentieth-century literature simply will not happen for DeLillo and his contemporaries.

Both Jockers and Sample seem to conclude that there is at this moment little hope for scholars in my position.  The problem, as Jockers notes in the book’s final chapter, is not a technical one but a legal one.  Digitized versions of texts published since 1923 do exist, likely in great numbers and possibly even in far greater numbers than in earlier periods, but the general public is not allowed to access them because of the prevailing interpretations of how copyright applies to digitized texts.  We may one day have access to these texts, but for now we have to wait.  So while the legal machinery grinds away and while projects like HathiTrust Research Center work to build technologies that mediate access to copyright works in ways that will satisfy copyright holders, the prevailing opinion is that there are few options for scholars like myself because the analytical wing of the digital humanities is largely “shackled in time” by copyright.

Presented with a tone of jest by Jockers, but read painfully by myself, is the idea that digital humanists “are all, or are soon to become, nineteenth centuryists.”  There is more truth to this statement than most might realize.  Roughly three months ago, I attended a presentation by a visiting scholar who admitted that he was by training an American modernist but had turned away from his primary field of literature in order to pursue his interest in digital methodologies.  I am sure he is not the only one. As a junior scholar nearing the end of a dissertation, drastically altering the scope of my research to include an archive of digital texts in the public domain that I could access and analyze legally simply isn’t an option I am in a position to consider.  Nor, however, do I want to abandon my interest in computational analysis.  Surely, I wondered for the better part of a year, there must be a way to work around copyright and still produce research in a cultural field that genuinely interests me (apologies to any nineteenth centuryist readers!).

While I am based in an English department, my research during graduate school has been located primarily within the field known as Software Studies, itself as nebulous as the digital humanities and now even considered a part of the latter with increasing frequency.  Initially, I began my dissertation by turning to literature, popular periodicals, and film to tease out and consider the role that personal computing takes in shaping our shared understanding of larger information systems.  I spent most of my time digging around in dusty stacks while others around me, even if they had no interest in computational analysis, could access so many older texts readily through online databases.  Increasingly, I turned to software itself as a primary object of study rather than trying to read it through the mediation of other cultural forms.  Here too, I reached a conclusion similar to the one discussed above: the potential for a real criticism of software is also bound by copyright.  Endeavors like Critical Code Studies are predicated upon access and are therefore limited in scope by the accessibility of source code. Really studying software would mean accessing source code that was in most cases occluded through withholding by large corporations after the compilation process and further protected by the various legal instruments of intellectual property. 

In short, I was forced to concede one rainy morning (at least I’d like to think it was a dreary day!) that the sort of projects I initially wanted to do—both in terms of mining publications from the early days of personal computing and closely engaging with commercial operating systems—would be almost impossible because of copyright.  At the same time, I was not content to accept defeat and proposed what seemed like a wild idea: to data-mine open source software.  My dissertation, and the posts you’ll find in the blog, have been part of my own methodological response to the problems posed by copyright. 

Although digital versions of literature and periodicals after 1923 are largely unavailable, at least legally, there are plenty of ways to read late twentieth and early twenty-first century culture at scale.  The trick, as it were, is in locating available sources that are relevant to the study of copyrighted works like parallel discourses, paratextual sources, or metadata. Even though there are a number of compelling reasons to examine open source software for my own project, I imagine that webscraping could also be used to pull textual and numeric data relevant to other projects studying post-1923 culture from publicly available repositories.  Tracing the parallax between those sources which can be mined and those from which we are shut out by copyright won’t necessarily satisfy everyone, but it could certainly keep scholars like myself from feeling like we’ve been shut out from all the excitement in the digital humanities.

In terms of how my solution is shaping up: Source code for most open source software is accessible in forms that would make the sorts of studies I want to conduct possible.  I know that there is an entire free and open source software ecosystem that emerged largely in response to an era of computing that I had originally wanted to analyze in popular fiction and periodicals.  In many ways, FLOSS software is not even a parallel discourse to commercial software anymore, as open source applications are becoming increasingly common in commercial software ecosystems.  In short: studying the FLOSS movement fits well within the scope of my project.

I also know that it makes a lot of sense, for many of the same reasons that Jockers describes within the context of literary history, to apply technologies of scale to the cultural history of software.  Most software is too large to read closely, and most software is updated with such frequency over the course of its lifetime that it poses a serious challenge if we were to treat a particular program as subject to the same concerns of textual scholarship that we sometimes deal with while working on literary texts.  So while I can’t yet textmine the publications I’ve been digging through by hand, I can at least use the tools I’ve built while collaborating on nineteenth century projects (with some modification) to trace patterns out of the large codebases that open source software makes accessible.  In many ways, I find the possibility more exciting than my original wish to textmine periodicals because I believe this method will expose and force me to address methodological problems in Software Studies that have yet to be acknowledged.

In coming entries, I hope to write on some issues I’ve encountered thus far trying to datamine source code repositories from the perspective of someone trained in literary and cultural theory.  I will also likely return to the issue of copyright.  As I mentioned above, it has also shaped critical practice in digital media studies, and many recent developments in Digital Rights Management techniques have serious consequences for those interested in studying software’s history.  Like Sample, I am greatly concerned by the idea “that nobody is talking about this disconnect” between the digital humanities and the reality of copyright.  I was really glad that Jockers devoted a chapter to discussing it, and I hope that what I’m able to share here will encourage more people to think about ways we might work around it.

EDIT: Matthew Jockers gave me some links to things worth reading for those interested in learning more about the current status of copyright and textmining.  The first is the amicus curiae brief that he contributed to in the Authors Guild v. HathiTrust case.  The second is some exciting news about a project that will allow for some 20th century datamining.

Leave a Reply