Guessing at the Content of a Million Books

Patrick Juola

Abstract


The recent growth in digital scholarship has made literally millions of books available to readers. But the implications of this, paradoxically, are that reading becomes more difficult. No human can possibly read and understand a million books. This is particularly problematic in literary scholarship, where “reading” a text requires much more than simple content extraction, but may require identifying and explaining patterns of thought and expression across many different works.

We propose a new computer-mediated form of reading, based on automatic pattern extraction. A recent example of this is the “Adam” robot (BBC, 2 April 2009). Other examples include Eurisko[1] and Graffiti[2] to perform automatic research in mathematics.

The Graffiti program, in particular, researches graph theory through the generation and testing of conjectures. The program creates random, template-based conjectures, which are then tested against a large collection of graphs. Any conjectures that survive this set of tests are published. Graffiti, it should note, does not prove any conjectures, but will provide a list of statements that appear to be true; mathematicians are encouraged to prove or disprove them. Since its inception, Graffiti has listed over 1000 different conjectures and inspired more than 100 published papers.

A similar paradigm allows us to conjecture the existence of patterns in writing. We know, for example, that language varies over time, over genre, and over authorial gender[3] in many specific ways. But Roget's thesaurus lists more than 1000 different semantic “categories”, most of which have never been studied in the context of gender and language. For example, we are aware of no study of the use of animal terms (Roget category III.iii.1.2/366). Do men and women's speech differ in this regard? Having constructed this conjecture, it is easy for a computer to test this. If true, this is an interesting finding in need of explanation.

A prototype system to do this initial research, the Conjecturator[4], has been constructed; some sample conjectures are available at http://www.twitter.com/conjecturator. Any or all of these published conjectures could serve as the basis for an interesting explanatory paper. We offer this as an example of a new paradigm in reading and scholarship; an opportunity to separate rote reading (which can be done by computer) from the actual scholarly and intellectual work.

Keywords


reading, Graffiti, Conjecturator, prototyping, computer-mediated reading, application, text analysis, textual patterns and algorithms

Full Text:

PDF HTML


New Knowledge Environments
© University of Victoria