So I thought I should begin my first post on here with a nice and gentle introductory sentence, but I realise that pointing out the increased use of computational phylogenetic tools on cultural and particularly linguistic data to the avid readers of this blog is probably a pretty pointless exercise.
There is of course a lot to say about parallels between biological and cultural evolution, and some of the work using computational tools has given us new insights into yet unanswered (and even hitherto unasked!) questions regarding language and language change. But today I’d like to share some thoughts on a particular “application” of phylogenetic tools, the methodology of which I find a bit odd, even though it is arguably the simplest evolutionary analogy of them all: using computational phylogenetics to reconstruct linguistic phylogenies.
The idea looks innocent enough: given a number of taxonomic traits of individuals (languages), the methods produce one or more trees or networks representing a suggested genetic relationship between the individuals which, to use an oversimplification, maximizes overall within-family retention of traits while minimizing independent introduction of the same traits between families/individuals.
While a lot of ink has been spilled over which phylogeny estimation methods are best for language or whether trees are an adequate representation for cultural entities at all, I’d like to draw your attention to a more fundamental issue, namely the nature of the traits or ‘characters’ underlying all linguistic reconstructions. The proposed benefit of a phylogenetic reconstruction is an optimal evolutionary history over these traits. If the reconstruction is to give us any new insights at all, these traits should be strictly taxonomic, i.e. synchronic. I argue that this does absolutely not hold for language, due to the structural nature of all linguistic traits which have been used for reconstructions so far: phonological, lexical and syntactic.
Among all the sub-disciplines of linguistics, the legacy of structuralism is most noticeable in phonology, and understandably so: phonemes are generally accepted to be exemplar-based clusters which are by themselves meaningless (because non-referential) components out of which meaningful morphemes can be built up. What matters is thus not the concrete realisation of a phoneme, but rather its structural opposition to the other phonemes in a given language’s phoneme inventory. Where and how exactly a phoneme is articulated is far less crucial for the role it plays in the particular language, as long as it is sufficiently distinguishable from all other phonemes.
How is this relevant for the discussion at hand? The main point is that phonemes are within-language concepts which do not directly lend themselves to cross-linguistic comparison, synchronically or diachronically. A sound is by no means ‘the same’ or ‘unchanged’ in later (or earlier) stages of a language just because the physical articulation might not have changed much. Saying that English /d/ and Dutch /d/ are ‘the same’ is nonsensical because the very systems through which the two phonemes are defined are not only vastly different from each other, but also vastly different from the original system from which both developed. Saying that Old High German changed a Proto-Germanic /t/ to OHG. /ts/ while Low German languages ‘retained’ or ‘inherited’ the Proto-Germanic /t/ is equally misleading, because it’s Low German /t/ that we are talking about. In the strictest sense, any cross-linguistic comparison of phonemes is ruled out completely. Using biological terminology, all phonological categories are by definition not homoplastic, and all character states change at every proposed stage of a language. This is not just a matter of pedantry because the sounds will in fact differ in their articulation, leading to the uncertainty of reconstructed sounds with increasing time depth. Overenthusiastically equating phonemes with the four nucleotides found in DNA (as repeatedly done by the likes of Mark Pagel) is superficial and indicates a dramatic misunderstanding of the nature of language, as has been noted and elaborated by many others (e.g. Shanon 1978, Holm 2007).
The fact that the contrastive nature of phonemes is problematic for automated phylogenetic reconstructions based on sound changes seems to be completely ignored by virtually all computational phylogenetic reconstructions I have come across. It is either not acknowledged at all or otherwise quickly swept under the carpet, justified by the claim that phonological changes apparently ‘recur’ and are thus homoplastic (Nichols & Warnow 2008, p.765). Two questions: firstly, how can a proposed sound change, i.e. a relation between two concrete phonemes of two different languages be more cross-linguistically comparable than the phonemes themselves? And secondly, how can a sound change, i.e. a proposed diachronic relationship between two sounds which obviously implies a diachronic relationship between the two underlying languages be considered a taxonomic feature at all?
It is simply not possible to encode synchronic cross-linguistic sound correspondences without already having a particular genetic relationship for all these sounds in mind, and indeed without explicitly coding this theoretical reconstruction into the linguistic ‘characters’. This is particularly evident in the phonological characters used in (Nakhleh et al. 2005) which ‘record the occurrence of sound changes in the (pre-)history of the language’ (p.172), such as whether the Germanic consonant shifts occurred or not (P16). This is quite obviously not a synchronic feature since it is actually based on a purported earlier state of both languages which is hypothesised before the phylogenetic methods are even put to use.
But if you already have a theory of how a given set of languages are related (and you have one by definition, because otherwise you would only have a set of mutually incomparable features which will be of little use for any phylogenetic reconstruction), then what’s the added value of pretending not to have one and feeding an impoverished representation of your already fully reconstructed tree to some algorithm in order to reconstruct your theoretical reconstruction that your features are based on?
If this sounds diffuse to you, rest assured that it is. I will try to make this point more strongly (and hopefully more concisely) in the following section.
There was another good reason to start with phonology, namely the fact that, with the exception of the disputed methodology of mass comparison, virtually all cognacy relations between lexemes are based on their compliance with proposed sound laws. Perhaps even more importantly, the only way in which cognates can be distinguished from later (re-)borrowings is by the absence of otherwise regular sound changes, even when these “sound laws” are not as law-like and exceptionless as we’d like them to be.
This leads to the well-known circularity of the comparative method, where regular sound changes are established based on supposed cognates which exhibit these sound changes. Despite this methodological flaw and the fact that its reliability decays with time depth and the corresponding lack of information on both the pronunciation as well as the meaning of words, it has been producing solid results and reconstructions since the dawn of historical linguistics.
The first point I’d like to make is consequently that, due to their entanglement with a concrete proposed history of sound changes, all phylogenetic traits based on cognate lexemes implicitly encode a specific, already reconstructed phylogenetic tree. This is not a mere question of how you selectively choose a subset of cognates, which obviously has an impact on the resulting reconstructions as well. Every single set of cognacy specifications is nothing but a sample of a particular theoretical model of the genetic relationships among these languages based on regular sound changes. Cognacy is a strictly diachronic concept, which should be evident given the etymology of the word ‘cognate’ itself. What’s even worse, in contrast to using sound changes as traits, this information is now no longer explicitly present, effectively obscuring the tautology. This makes it seem like you were actually inferring a diachronic pattern which was not visible in the ‘synchronic’ input specification, but this is an illusion. Any phylogenetic tree reconstructed from such traits which contradicts the sound changes on which the traits are based basically refutes the (diachronic) theory based on which the input characters were encoded, but likely it is no more than an artifact of the sampling of a specific set of cognates.
A second point is in order. The obvious effect of sampling different sets of cognates has led many computational phylogenetic studies to ‘address’ this problem by adopting a sampling strategy from lexicostatistics. This amounts to encoding only 100 or 200 core concepts which are known to be extremely robust against borrowing, also known as Swadesh-lists. Again it is hard to see how you can speak of cognacy in the first place without already having a hypothesis about how precisely the two forms are diachronically related, but my second point is that it is not just the form but also the nature of semantics which calls the foundations of these cross-linguistic comparisons into question.
Here, structuralism strikes back in a more subtle incarnation. While semantic concepts such as those captured by the Swadesh-lists appear to form a relatively stable set of ‘traits’, lexemes don’t organise themselves by mapping to ‘universal’ concepts, but float about, carving out areas in a highly continuous semantic space, carrying a contrastive, discriminatory function in opposition to ‘neighbouring’ concepts. The fixation on idealised prototypical concepts which do not correspond to the actual concepts of a language leads to the systematic exclusion of relevant data. The Swadesh-item ‘dog’ is normally not marked cognate between English (dog) and German (Hund) or Dutch (hond). An English cognate ‘hound’ does exist though, it has simply attained a more specialised meaning than ‘dog’ while, ironically, ‘dog’ has also been borrowed into German (Dogge) serving an equally more specific purpose than general ‘Hund’. Such patterns will immediately catch the eye of the trained person entering the characters who can use them to make interesting inferences about a genetic relationship, but they are by definition excluded from the data fed to the computational tools.
Generally, while information gained from tracking Swadesh cognates can be of interest for determining the speed of a ‘lexical clock’ (the replacement rate of standard vocabulary from a parent to a child language), it is inadequate for establishing a genetic relationship between two languages in the first place. The exact proportion of cognates has no impact on the supposed genetic relatedness of two languages because the identification of cognates itself already implies a concrete hypothesised genetic relation between the two.
Syntactic or ‘structural’ features have been the basis of some of the most interesting results of computational phylogenetics (e.g. Dunn et al 2011), a trend likely to continue with the availabity of large typological corpora such as the World Atlas of Language Structures (WALS). In contrast to fuzzily distributed items such as phonemes, many of these characters like those concerning word order are much more clearly delineated.
This seems to indicate that structural features would lend themselves to cross-linguistic comparison much more easily. But it is actually one of the people behind WALS who has argued against this explicitly (Haspelmath 2010) by emphasising categorial particularism, i.e. that even the most basic formal linguistic categories cannot be equated across languages because the criteria for category assignment are necessarily different in each language. Thus, unless you commit yourself to a framework such as Principles & Parameters, the characters you are dealing with are comparative concepts devised by typologists which are by definition not congruent with the actual functional categories that speakers use to master the language. (NB I think some of the confusion concerning phonological characters stems from the fact that this distinction is not made clear when the same symbol from the International Phonetic Alphabet is used to represent concrete phonetic instantiations [t], the phonological /t/ categories of different languages as well as the fuzzy universal concept of a ‘t-like sound’ in general.)
This result is actually less critical than the ones in the two previous sections. There is nothing categorically wrong about basing reconstructions on typological characters, as long as you acknowledge the fact that these features do not actually correspond to real entities within the languages, but are simple indicators of whether there is any element in the language which fulfills the criteria of the comparative concept, although with no guarantee of a 1:1 relationship. But because these features are strictly placed in the domain of linguistic description as opposed to psychological reality, this raises the question of what such characters mean for the results and interpretations of the method. I understand that something similar is done in biological reconstructions based on morphological rather than genetic features. But since the features are not real traits exhibited by the languages you cannot argue that these should be the units of (altered) replication. This reduces the ‘reconstruction’ to a synchronic clustering/similarity measure with no diachronic dimension, amplified by the fact that most structural characters have few possible values, leading to high rates of chance resemblance (Croft 2008).
In contrast to reconstructions based on sound changes, it is much harder to distinguish between actual ‘inheritance’ and horizontal transfer of features like word order and other grammatical categories, which are known to diffuse easily either through the influence of a Sprachbund or linguistic substrate. While this is acknowledged in (Dunn et al. 2005), they nevertheless conclude having discovered evidence for genealogical clustering, which I don’t see warranted at all. Structural features perform badly at detecting even short-term relationships on established language families (Nichols & Warnow 2008), calling into question the validity of the results for the larger time-depths for which these features were proposed in the first place.
Even though their performance in comparison to the comparative method is questionable, there is nothing wrong with using structural features as long as you acknowledge that the methods are based on features which are not very well-understood diachronically and for which no established linguistic theory of change exists, and also that the mapping of these methods from biology onto linguistics is rather superficial. Such research should consequently also distance itself from any unfounded and misleading ‘language-and-the-genetic-code-metaphors’ – not opening your abstracts with suggestive sentences like ‘Languages, like molecules, document evolutionary history.’ would be a good start.
Other indicators for computational phylogenetics
As you can probably tell by now I am more than underwhelmed by the methodology of computational phylogenetic reconstruction applied to linguistic traits, and I hope that my arguments have managed to convince you of some of these problems. On the other hand you might still be wondering why I have put some of my arguments in a rather fierce fashion. Am I just a disgruntled historical linguist fearing for my job, afraid of being replaced by a computer? Firstly, no I’m not, secondly, I want to discuss two more indicators for applying computational phylogenetics to a field, and relate these to the current state of affairs in (historical) linguistics: abundance of data and formal models of altered replication.
One of the prime reasons for the existence and sophisticatedness of the many phylogenetic algorithms today is the amount of data that fields such as molecular genetics have had to handle since the advent of automated DNA sequencing. Such a huge amount of data unmanagable by human beings (and not collected by human hands in the first place) simply cries out for computational tools, because any manual attempts could never be nearly as exhaustive given the combinatorial explosion of the complexity of systematically contrasting and comparing all the data.
How does this compare to the situation in linguistics? Data collection (particularly of the rarest and consequently most interesting ‘species’) is still a tedious and pain-staking task only performable by trained linguists in the field. Moreover, in the last 50 years, which has seen the biggest growth in scientific research across the board, most of linguistics has been ridden by an obsession with theory over data (Gibson and Fedorenko 2010). This shortcoming is only recently being remedied by the collection of and work with quantitative corpora, both typological as well as developmental. But computational methods coming in as a life-saver in face of an unmanagable amount of linguistic data, systematically collected in a combined effort by a majority of people working in the field? Not quite.
But even the limited linguistic, typological corpora are combinatorially challenging, potentially warranting the use of computational tools. So what could be done to escape the methodological tautology I have argued for? Instead of feeding traits representing already established diachronic histories into the algorithms, the obvious solution would be to use strictly synchronic traits. As I argued these can not be compared cross-linguistically straight away, but this could be addressed by devising algorithms which can establish diachronic correspondences themselves, making use of formal models of probable sound changes.
Such algorithms are indeed under investigation, most commonly based on the Levenshtein distance, the minimum number of changes necessary to transform one word into another. But at least in the current studies known to me this measure systematically ignores the problem of cross-linguistically comparing categories such as phonemes in the first place, simply treating phonetic categories as ‘the same’ given that they are similar enough in articulation (or, mind-boggingly, in spelling, e.g. Serva & Petroni 2008).
This is obviously no match to the sophisticated probabilistic models of change from molecular genetics, filling no less than six chapters in Felsenstein’s ‘Inferring phylogenies’. But these methods are not simply transferable to the linguistic domain because they exploit a very specific property of the genetic code, namely that the function of the nucleotides coincides with its material representation (Shanon 1978). All DNA copying happens locally, while linguistic characters (particularly phonemes and morphosyntactic patterns) are ‘transferred’ by repeated global inference over many linguistic utterances. Even the core phonetic ‘alphabet’ of a language is not stable but has to be re-inferred by every single language learner, completely unlike the neatly delineated inventory of the genetic code, a fact that is frequently and blissfully overlooked. I am surprised that even someone like William Croft does not point out this fundamental difference between the genetic and linguistic ‘alphabets’, treating epenthesis, deletion and metathesis as mere combinatorial problems (Croft 2008).
What’s even worse about current methods, the Levenshtein distance might even be determined independently for every word under investigation (Serva & Petroni 2008 and others). This amounts to reducing the task of establishing cognates to a simple and unsystematic ‘hm, this word does sound a bit like this other word in this other language’-routine, completely ignoring everything we know about the systematicity of sound changes as well as semantic change.
This approach is uncannily similar to the disputed methodology of mass comparison and, not unlike reconstructions using only structural features, lends itself to omnicomparatism, the trend to try to compare and relate the most distant of languages based on a minimal amount of unsystematic evidence, leading to hypothesised relationships which are not only highly disputed but usually also bordering on the non-scientific. I’m not sure that what these questions need are black box algorithms which come up with an ‘optimal’ evolutionary history, hiding all their assumptions behind anonymous numbers.
It is often stressed that computational techniques are intended to supplement traditional methods, but how can we believe this claim if none of these techniques show any serious interest in the existing methods at all? Why is there a focus on much weaker typological (comparative) features, when there is grammaticalisation theory, a sophisticated framework for investigating the origins of grammatical morphemes and categories, which I did not find a single reference to in any of the articles I looked at?
This lack of interest is complemented by a trend to force biological concepts onto linguistic ones, like Nichols & Warnow 2008 p.788 arguing that ‘copying’ would be a more ‘transparent’ term for borrowing – quite on the contrary since it completely obfuscates the fact that loanwords are typically accommodated in the phonology of the target language. These approaches seem to pick the first vague parallels which allow stuffing linguistic data into existing algorithms instead of joining a serious and constructive debate on what the replicating units of language really are, as done by the likes of Croft and Mufwene. All this leads me to the conclusion that many extensions of these phylogenetic methods to language is not done with appropriate care but rather with outright ignorance and fundamental misunderstandings of the target domain.
In this post I have argued that computational phylogenetic ‘reconstructions’ are not just not giving us any genuinely new reconstructions, they are not even independent evidence for existing reconstructions because the traits which they use are already based on concrete hypothesised evolutionary histories.
Computational reconstruction based on typological features is one way to escape this tautology, and this can obviously complement the existing toolkit, particularly when there is a lack of better theories. By this I mean particularly more general cultural traits, such as the features of different traditions of basket weaving or carpetry design for which we have no other theories of transmission, diffusion and change. But in historical linguistics, sophisticated theories of change based on the understanding of the function of linguistic conventions such as phonemes and a large amount of expertise regarding probable phonetic changes do exist. Computational phylogenetics does not provide comparable insights in this area and, despite claims to the contrary, the approach shows no real interest in understanding or complementing the existing methods.
A more direct solution to the tautology would be the development of formal theories for automatically hypothesizing and evaluating regular sound changes, which are still in their very infancy. Most importantly they are also plagued by a confusion of descriptive categories with comparative concepts, with the crucial difference between the two completely ignored in current work, resulting in a similar methodological flaw.
Other current applications of computational methods to linguistic data, particularly if they were performed on phylogenies reconstructed by other means, are of course also untouched by my criticism: Quantifying the role of global vs. lineage-specific trends of typological features and their correlations (Dunn et al. 2011)? Absolutely! Studying replacement rates to infer something about the nature of language divergence in general (Atkinson et al. 2008) or to test some concrete theories of divergence in particular (Gray & Atkinson 2003)? If you will. But if computational phylogenetics is to produce genuinely new and well-founded linguistic reconstructions it has to (a) avoid the tautology I have tried to point out and consequently (b) develop formal models of sound changes which are a methodologically and qualitatively serious challenge for the established and well-proven methods currently employed in the field.
So why am I writing all this? I am not a stubborn skeptic of computational phylogenetics, panicking at the very thought of “putting history into boxes”. I believe that an evolutionary approach can give us valuable new insights into language change and I would of course also embrace computational phylogenetics applied to language if only the underlying methodology was sound. But every single time I come across a phylogenetic study I am in awe over the interpretation of what the encoded characters are, and what the resulting trees are supposed to tell us anew, except that the clustering algorithms developed in machine learning over the past decades do indeed work. Recognised historical linguists proclaim their interest in these methods, criticising them only for their performance rather than their methodological foundations, but I see some fundamental flaws. I am deeply puzzled by some misunderstanding that seems to be going on concerning the nature of the data encoded in these trees, and it seems to be a huge misunderstanding. All throughout this posting I have assumed and tried to show that the misunderstanding is theirs, but some of the discrepancies are so basic that I feel stupid for even pointing them out. Maybe it’s me who is misunderstanding something. Maybe the historical linguists, typologists, computer scientists and geneticists creating these studies are perfectly aware of the mismatches between the characters they investigate (although looking at some of the tables listing ‘conceptual parallels between biological and linguistic evolution’ in various papers I seriously doubt this).
But why is nobody seriously discussing these discrepancies, obvious mismatches and inconsistencies? Am I just imagining them? Are they simply not that bad? Why are they not that bad? Is there something about the original domain of these methods (biology) that I am seriously misguided about? What do comparative and historical linguists really think about them? Are they seen as a quantitative blessing to bring an ‘old-fashioned’ discipline with antiquated research methods up to cutting-edge research?
I am wondering about all these questions and I would like to invite everyone who has bothered to read this far to share their thoughts or frankly tell me where I have totally missed the point, particularly those of you who are actively developing or working with these methods, as well as comparative linguists in general. The comments section below is open for discussion or, if you prefer, I’d be grateful if you would share your thoughts or feedback by e-mail.
Atkinson, Q., Meade, A., Venditti, C., Greenhill, S., & Pagel, M. (2008). Languages Evolve in Punctuational Bursts Science, 319 (5863), 588-588 DOI: 10.1126/science.1149683
Croft, W. (2008). Evolutionary Linguistics Annual Review of Anthropology, 37 (1), 219-234 DOI: 10.1146/annurev.anthro.37.081407.085156
Dunn, M., Terrill, A., Reesink, G., Foley, RA., & Levinson, SC. (2005). Structural Phylogenetics and the Reconstruction of Ancient Language History Science, 309 (5743), 2072-2075 DOI: 10.1126/science.1114615
Dunn M, Greenhill SJ, Levinson SC, & Gray RD (2011). Evolved structure of language shows lineage-specific trends in word-order universals. Nature, 473 (7345), 79-82 PMID: 21490599
Gibson, E., & Fedorenko, E. (2010). Weak quantitative standards in linguistics research Trends in Cognitive Sciences, 14 (6), 233-234 DOI: 10.1016/j.tics.2010.03.005
Gray, RD., & Atkinson, QD. (2003). Language-tree divergence times support the Anatolian theory of Indo-European origin Nature, 426 (6965), 435-439 DOI: 10.1038/nature02029
Haspelmath, M. (2010). Comparative concepts and descriptive categories in crosslinguistic studies Language, 86 (3), 663-687 DOI: 10.1353/lan.2010.0021
Holm, HJ (2007). The new arboretum of Indo-European “trees”. Can new algorithms reveal the phylogeny and even prehistory of Indo-European? Journal of Quantitative Linguistics, 14 (2-3), 167-214 DOI: 10.1080/09296170701378916
Nakhleh, L., Warnow, T., Ringe, D., & Evans, S. (2005). A comparison of phylogenetic reconstruction methods on an Indo-European dataset Transactions of the Philological Society, 103 (2), 171-192 DOI: 10.1111/j.1467-968X.2005.00149.x
Nichols, J., & Warnow, T. (2008). Tutorial on Computational Linguistic Phylogeny Language and Linguistics Compass, 2 (5), 760-820 DOI: 10.1111/j.1749-818X.2008.00082.x
Serva, M., & Petroni, F. (2008). Indo-European languages tree by Levenshtein distance EPL (Europhysics Letters), 81 (6) DOI: 10.1209/0295-5075/81/68005
Shanon, Benny (1978). The genetic code and human language Synthese, 39 (3), 401-415 DOI: 10.1007/BF00869557