Unravelling Emma: using data analysis to decipher literary works

One can only imagine what Emma Woodhouse would have thought of her carefully chosen words being run through a computer – but by doing just that, an Australian scholar pioneered a new field of literary studies, making it possible to identify authors of anonymous books, date written works, detect plagiarism and chart the evolution of a writer’s style.

A drawing of characters from Jane Austen's novel Emma

Finding patterns

Analysing Austen’s novels computationally in the late 1980s, John Burrows discovered strong patterns in her prose and also in the dialogue of her major characters. But it was not unusual or complex vocabulary that distinguished, say, Emma’s language from Mr Darcy’s, but the most mundane of words, such as “we” and “the”.

In fact, the thirty most common words in any text, and the frequency with which they are used, Burrows established, are rich in stylistic information. Simple articles, prepositions and conjunctions, long disregarded by researchers as devoid of significance, hold the key to all manner of literary enigmas.

Creating Stylometry

A distinguished literary scholar, Burrows is internationally recognised as the creator of “stylometry” – the use of data analysis methods to quantitatively interpret written works, also known as computational stylistics or literary computing. He also invented, in 2002, a new statistical procedure, called Delta, for perceiving and construing patterns in common words, which remains the most widely employed methodology in the field.

 

Authorship attribution – determining who penned anonymously published books such as the 1996 Primary Colors about a US presidential campaign, or whether some parts of the plays we usually regard as Shakespeare’s are actually by another writer – is one of stylometry’s most popular uses. It has been applied not only to literature, but also history and philosophy, as well as forensic linguistics (the analysis of language in legal settings) and corpus linguistics (analysis of a database of language).

Stylometry – a convergence of literary studies, linguistics, statistics and computer science –is based on the observation that every author has a relatively consistent, idiosyncratic style, habitually using language in mostly unconscious ways that result in discernible similarities between their writings. And the thirty most frequently used common words, such as “and” and “you” – which typically represent one-third of a given text – are the most reliable markers of stylistic difference.

From Jane Austen to the Beatles

Defying decades of conventional wisdom, Burrows pursued a wholly original line of enquiry that exposed the hidden potential of such words and gave them weight in literary studies for the first time. The revelation prompted an outburst of interest in computational approaches, opening the way for countless studies of literary style, authorship, translation, dating and genre classification. And stylometry is not confined to English; it has been successfully used to analyse works in languages ranging from Classical Greek to Mandarin.

As well as Austen, Burrows applied the methodology to the writings of Henry James, E M Forster and Virginia Woolf. Other scholars have used it to explore questions such as differences in style between female and male authors, the innovativeness of certain authors compared with their peers – and even how the lyrics of Paul McCartney and John Lennon became “less pleasant, less active, and less cheerful” over time.

Further reading & resources

Discovering Humanities series

This story is part of our Discovering Humanities series.

This series is a celebration of humanities research and discovery. Born out of our 50th anniversary in 2019, it covers just a small fraction of the many advances within the humanities since the Academy was first founded.

>> Explore the series

Acknowledgement of Country

The Australian Academy of the Humanities recognises Australia’s First Nations Peoples as the traditional owners and custodians of this land, and their continuous connection to country, community and culture.