Systematic reviews 101: Internal and External Validity

Who remembers last summer when I started writing a series of posts on systematic literature reviews?

I apologise for neglecting it for so long, but here is a quick write up on assessing the studies you are including in your review for internal and external validity, with special reference to experiments in artificial language learning and evolutionary linguistics (though this is relevant to any field which aspires to adopt scientific method).

In the first post in the series, I outlined the differences between narrative and systematic reviews. One of the defining features of a systematic review is that it is not written with a specific hypothesis in mind. The literature search (which my next post will be about) is conducted with predefined inclusion criteria and, as a result, you will end up with a pile of studies to review regardless of there conclusion, or indeed regardless of there quality. Due to a lack of a filter to catch bad science, we need methods to assess the quality of a study or experiment which is what this post will be about.

(This will also help with DESIGNING a valid experiment, as well as assessing the validity of other people’s.)

What is validity?

Validity is the extent to which a conclusion is a well-founded one given the design and analysis of an experiment. It comes in two different flavours: external validity and internal validity.

External Validity

External validity is the extent to which the results of an experiment or study can be extrapolated to different situations. This is EXTREMELY important in the case of experiments in evolutionary linguistics because the whole point of experiments in evolutionary linguistics is to extrapolate your results to different situations (i.e. the emergence of linguistic structure in our ancestors), and we don’t have access to our ancestors to experiment on.

Here are some of things that effect an experiment’s external validity (in linguistics/psychology):

  • Participant characteristics (age (especially important in language learning experiments), gender, etc.)
  • Sample size
  • Type of learning/training (important in artificial language learning experiments)
  • Characteristics of the input (e.g. the nature of the structure in an input language)
  • Modality of the artificial language (how similar to actual linguistic modalities?)
  • Modality of output measures (how the outcome was measured and analysed)
  • The task from which the output was produced (straightforward imitation or communication or some other task)

Internal Validity

Internal validity is how well an experiment reduces its own systematic error within the circumstances of the experiment being performed.

Here are some of things that effect an experiment’s internal validity:

  •  Selection bias (who’s doing the experiment and who gets put in which condition)
  • Performance bias (differences between conditions other than the ones of interest, e.g. running people in condition one in the morning and condition two in the afternoon)
  • Detection bias (how the outcomes measures are coded and interpreted, blinding which condition a participant is in before coding is paramount to reduce the researcher’s bias to want to find a difference between conditions. A lot of retractions lately have been down to failures to act against detection bias.)
  • Attrition bias (Ignoring drop-outs, especially if one condition is especially stressful, causing high drop-out rates and therefore bias in the participants who completed it. This probably isn’t a big problem in most evolutionary linguistics research, but may be in other psychological stuff.)

Different types of bias will be relevant to different fields of research and different research questions, so it may be an idea to come up with your own scoring method for validity to subject different studies to within your review. But remember to be explicit about what your scoring methods are, and the pros and cons of the studies you are writing about.

Hopefully this introduction will have helped you think about validity within experiments in what you’re interested in, and helped you take an objective view on assessing the quality of studies you are reviewing, or indeed conducting.


Systematic reviews 101: How to phrase your research question

Image from the JEPS Bulletin

As promised, and first thing’s first, when writing a systematic review, how should we phrase our research question? This is useful when phrasing questions for individual studies too.

PICO is a useful mnemonic for building research questions in clinical science:

  • Patient group
  • Intervention
  • Comparison/Control group
  • Outcome measures

How does this look in practice?

What is the effect of [intervention] on [outcome measure] in [patient group] (compared to [control group])?

How can we make this more applicable for language evolution?

I guess we can change the mnemonic:

Population (either whole language populations in large scale studies, small sample populations either in the real world or under a certain condition in a laboratory experiment, or a population of computational or mathematical agents or population proxy)

Comparison/Control group
Outcome measures

Here are some examples of what this might look like using language evolution research:

What is the effect of [L2 speakers] on [morphological complexity] in [large language populations] compared to [small language populations]?

What is the effect of [speed of cultural evolution] on [the baldwin effect] in [a population of baysian agents]?

What is the effect of [iterated learning] on [the morphosyntactic structure in an artificial language] in [experimental participants]?

What is the effect of [communication] on the [distribution of vowels] in [a population of computational agents]?

All of the above are good research questions for individual studies, but I’m not sure it would be possible to do a review on any of the above research questions simply because there is not enough studies, and even when studies have investigated the same intervention and outcome measure, they haven’t used the same type of population.

In clinical research the same studies are done again and again, with the same disease, intervention and population. This makes sense as one study does not necessarily create enough evidence to risk people’s lives on the results. We don’t have this problem in language evolution (thank god), however I feel we may suffer from a lack of replication of  studies. There has been quite a lot of movement recently (see here) to make replication of psychological experiments encouraged, worthwhile and publishable. It is also relatively easy to replicate computational modelling work, but the tendency is to change the parameters or interventions to generate new (and therefore publishable) findings. And real world data is a problem because we end up analysing the same database of languages over and over again. However, I suppose controlling for things like linguistic family, and therefore treating each language family as its own study, in a way, is a sort of meta-analysis of natural replications.

I’m not sure there’s an immediate solution to the problems I’ve identified above, and I’m certainly not the first person to point them out, but thinking carefully about your research question before starting to conduct a review is very useful and excellent practice, and you should remember that when doing a systematic review, the narrower your research question, the easier, more thorough and complete your review will be.

Systematic reviews 101: Systematic reviews vs. Narrative reviews

Last week I went to a workshop on writing systematic reviews run by SYRCLE. The main focus of this workshop, and indeed the main focus within most of the literature on systematic reviews, is on clinical and preclinical research. However, I think that other disciplines can really benefit from some of the principles of systematic reviewing, so I thought I’d write a quick series on how to improve the scientific rigor of writing reviews within the field of language evolution.

So first thing’s first, what is a systematic review? A systematic review is defined (by the Centre for Reviews and Dissemination at the University of York) as “a review of the evidence on a clearly formulated question that uses systematic and explicit methods to identify, select and critically appraise relevant primary research, and to extract and analyse data from the studies that are included in the review.”

This is in contrast to more narrative or literature reviews, more traditionally seen in non-medical disciplines. Reviews within language evolution are usually authored by the main players in the field and are generally on a very broad topic, they use informal, unsystematic and subjective methods to search for, collect and interpret information, which is often summarised with a specific hypothesis in mind, and without critical appraisal, and summarised with an accompanying convenient narrative. Though these narrative reviews are often conducted by people with expert knowledge of their field, it may be the case that this expertise and experience may bias the authors. Narrative reviews are, by definition, arguably not objective in assessing the literature and evidence, and therefore not good science. Some are obviously more guilty than others, and I’ll let you come up with some good examples in the comments.

So how does one go about starting a systematic review, either as a stand alone paper or as part of a wider thesis?

Systematic reviews require the following steps:

From: YourHealthNet in Australia
From: YourHealthNet in Australia

1. Phrase the research question

2. Define in- and exclusion criteria (for literature search)

3. Search systematically for all original papers

4. Select relevant papers

5. Assess study quality and validity

6. Extract data

7. Analyse data (with a meta-analysis if possible)

8. Interpret and present data

In the coming weeks I will write posts on how to phrase the research question of your review, tips on searching systematically for relevant studies, how to assess the quality and validity of the papers and studies you wish to cover, and then maybe a post on meta-analysis (though this is going to be difficult with reference to language evolution because of its multidisciplinary nature and diversity within the relevant evidence, I’ll have a good think about it)



Undertaking Systematic Reviews of Research on Effectiveness. CRD’s Guidance for those Carrying Out or Commissioning Reviews. CRD Report Number 4 (2nd Edition). NHS Centre forReviews and Dissemination, University of York. March 2001.