As I write this I am on my way to Hamburg for DH2012. I’m very much looking forward to the conference this year, not only because of the wide variety of interesting papers and the chance to explore a city I’ve heard a lot of nice things about, but also because this year I feel like I have some substantial research of my own to contribute.
My speaking slot is on Friday morning (naturally opposite a lot of other interesting and influential speakers, but that seems to be the perpetual curse of DH.) In preparation for that, I thought I might set down the background for the project I have been working on for the last two years, and discuss a little of what I will be presenting on Friday. After all, if I can set it down in a blog post then I can present it, right?
The project is titled The Tree of Texts, and its aim is to provide a basis for empirical modelling of text transmission. It grows out of the problem of text stemmatology, and specifically the stemmatology of medieval texts that were transmitted through manual copies by scribes who were almost never the author of the original text (if, indeed, a single original text ever existed.)
It is well known that texts vary as they are copied, whether through mistakes, changes in dialect, or intentional adaptation of the text to its context; almost as long as texts have been copied, therefore, scholars have tried in one way or another to get past these variations to what they believe to be the original text. Even in cases where there was never a written original text, or where the interest of the scholar is more in the adaptation than in the starting point, there is a lot to be gained if we can understand how the text changed over time.
Stemmatology, the formal reconstruction of the genesis of a text, developed as a discipline over the course of the nineteenth century; the most common (“Lachmannian”) method is based on the principle that if two or more manuscripts share a copying error, they are likely to have been copied either one from the other or both from the same (lost) exemplar. There has been a lot of effort, scholarship, and argument on the subject of how one distinguishes ‘error’ from an original (or archetypal) reading, how one distinguishes genealogical error (e.g. the misreading of a few words in a nigh-irreversible way so that the meaning of the entire sentence is changed) from coincidental error (e.g. variation in word spelling or dialect, which probably says more about the scribe than about the manuscript being copied). The classical Lachmannian method requires the practitioner to decide in advance which variants are likely to have been in the original; more recent and computationally-based neo-Lachmannian methods allow the scholar to withhold that particular pre-judgment, but still require a distinction to be made concerning which shared variants are likely or unlikely to have been coincidental or reversible.
A method that requires the scholar to know the answer in advance was always likely to encounter opposition, and Lachmannian stemmatology has spawned entire sub-disciplines in protest at the sheer arrogance (so an anti-Lachmannian might describe it) of claiming to know in advance what is important and what is trivial. Nevertheless the problem remains: how to trace the history of a text, particularly if we begin with the assumption that we know no more, and perhaps considerably less, than the scribes who made the copies? The first credible answer was borrowed from the field of evolutionary biology, where they have a similar problem in trying to understand the order in which features of species might have evolved and the specific relationships to each other of members of a group. This is the discipline of phylogenetics, and there are several statistical methods to reconstruct likely family trees based upon nothing more than the DNA sequences of species living today. Treat a manuscript as an organism, imagine that its text is its DNA sequence, et voilà – you can create an instant family tree.
And yet phylogenetics, if you ask the Lachmannians and other text scholars besides, has its own problems. First, the phylogenetic model assumes that any species living today is by definition not an ancestor species, and therefore must appear only at the edge of the family tree; in contrast we certainly still possess manuscripts that served as the ‘parent’ of other extant manuscripts. Second, in evolutionary terms it is reasonable to model the tree as a bifurcating one – that is, a species only ever divides into two, and then as time progresses either or both of these may divide further. This also fails to match the manuscript model, where it is easy to see a single text spawning two, three, or ten direct copies. Third, where the evolutionary model is assumed to be continously branching, it is well known that a manuscript can be copied with reference to two, three, or even four exemplars. This is next to impossible to represent in a tree (and indeed is not usually handled in a Lachmannian stemma either, serving more often as a reason why a stemma was not attempted.) Fourth is the problem of significance of variants–while some scholars will insist that variants should simply not be pre-judged in terms of their significance, most will acknowledge the probable truth that some sorts of variation are more telling than other sorts. Most phylogenetic programs do not by default take variant significance into account, and most users of phylogenetic trees don’t even try.
In a recent paper, some of the luminaries of text phylogeny argue that none of these problems are insurmountable. Neighbor net diagrams can give some clues regarding multiple text parentage; some more recent and specialized algorithms such as Semstem are able to build trees so that a text can be an ancestor of another text, and so that a text can have more (or even less) than two descendants. The authors also argued that the problem of significance can be handled trivially in the phylogenetic analysis by anyone who cares to assign weighting factors to the variant sets s/he provides to the program.
While it is undoubtedly true that automated algorithms can handle assignment of significance (that is, weighting), it also remains true that there are only two options for assigning these weightings:
- Treat all variants as equal
- Assign the weights arbitrarily, according to philological ‘common sense’, personal experience, or any other criterion that takes your fancy.
This is exactly the ‘missing link’ in text stemmatology: what sorts of variants occured in medieval copying, how common were they, how commonly were they copied, and how commonly were they changed? If we can build a realistic picture of what, statistically speaking, variation actually looked like in medieval times, it will be an enormous step toward reconstructing the stemmata by whatever means the philologist chooses, be it neo-Lachmannian, phylogenetic, or a method yet to be invented.
What we have done in the Tree of Texts project is to create a model for representing text variation, and a model for representing stemmata, and methods for analyzing the text against the stemma in order to answer exactly the questions of what sort of variation occurred when and how. I’ll be presenting all of these methods on Friday, as well as some preliminary results of the number crunching. If you are at DH I hope to see you there!
I’ve just stumbled across your blog (that’s what I get for failing to follow twitter), but I look forward to future posts! I hope your talk goes well. All the buzz on twitter makes me wish I was in Hamburg!
It might be worth mentioning that DNA phylogenies aren’t really trees either. Horizontal gene transfer in bacteria is one well known counterexample; another would be polar bears and brown bears, which branch much earlier in nuclear DNA than they do in mitochondrial DNA.
Yeah, I know there are non-tree-like features in evolution, and I imagine there are some mathematical models to help cope with them, but none of these (if they do exist) seem to be making it over to text genetics yet. I’m not sure why, and I wonder if the analogies simply start to break down.