Unlikely as it sounds, that is exactly what two academics working in very different fields—linguistics and bio-informatics—have achieved, through a remarkable cross-disciplinary collaboration.
Hugh Craig is director of the University of Newcastle’s Centre for Literary and Linguistic Computing, with a long-standing interest in the mathematical qualities of language.
Pablo Moscato is director of the university’s Centre for Bioinformatics, Biomarker Discovery and Information-Based Medicine, with a passion for applying advanced computing to the diagnosis and treatment of cancer.
In 2005, inspired by Professor Craig’s use of computational stylistics to analyse literary works, Professor Moscato, seeing its potential to advance his own research, suggested they work together.
One of their joint projects, on which they published in 2013, was based on the principle that every writer—indeed, every person—uses language in a uniquely idiosyncratic way.
The pair fed into supercomputing systems in Moscato’s laboratory all of the plays written by Shakespeare and three of his contemporaries, Ben Jonson, Thomas Middleton and John Fletcher, comprising, in total, about 57,000 individual words.
Out of that process emerged a group of common words which each writer used more or less frequently, and which enabled the two academics to establish authorship of disputed works.
That was impressive in itself. But for Moscato, the data set which they had assembled also mirrored the tens of thousands of biological markers found in a blood sample.
Just as the pair had used bio-informatics to identify key words, so Moscato’s team isolated key biomarkers, such as proteins and gene expressions. And just as he and Craig had studied each playwright’s use of those words, the team investigated the presence of those biomarkers in biological samples.
Moscato was then able to pinpoint a molecular ‘signature’—equivalent to a writer’s stylistic signature—for not only cancer but also Alzheimer’s and multiple sclerosis.
For another study, which they completed in 2014, he and Craig used even more powerful computing techniques to crunch 256 plays and poems, comprising millions of words, by 60 Renaissance writers including Shakespeare. They then noted how the works clustered by author and genre.
This, too, Moscato has been able to apply to his diagnostic work.
The methodology involves detecting subtle patterns of variation across very large data sets. It not only helps with initial diagnoses, but can be used to identify disease types and sub-types, which can then be treated with specifically targeted drugs.
“We started out of sheer curiosity and the intriguing sense that there was a common element here,” says Craig.
“The two-way trade is that Pablo has these beautiful bio-informatics techniques and I can supply this beautifully rich language data, which is a goldmine for statisticians and bio‑medical researchers as well as for literary people.”