Monday, September 29, 2014

The Hardy-Weinberg equation

This is either very easy, or very hard. The algebra is easy but the concepts underlying it are a bit slippery. Note that frequency and probability are being used interchangeably here.

Let's start observationally, with an example. Suppose there is a population of 100 individuals: a few are short, most are medium and some are tall. And suppose (contrary to fact) that there is just one gene which controls height, which comes in two alleles, A and a.

The genotype AA is short, Aa is medium and aa is tall.

For the 100 individuals, we'll say that:

  • 16 are short (AA)
  • 48 are medium (Aa)
  • 36 are tall (aa).

So for a random choice of individual in this example:

  • probability (short-genotype) = 0.16 .. call it x
  • probability (medium-genotype) = 0.48 .. call it y
  • probability (tall-genotype) = 0.36 .. call it z

Now we ask the question: how many A alleles are there, and how many a? The answer is obvious:

  • For the A alleles we have: 16 x 2 + 48 = 80
  • for the a alleles we have: 48 + 2 x 36 = 120

Total 200 alleles.

  • frequency of A alleles ( call it p) is 80/200 = 0.4
  • frequency of a alleles (call it q) is 120/200 = 0.6.

What's the connection between between the allele frequencies (p, q) and the genotype frequencies (x, y, z)? This is the famous Hardy-Weinberg equation:

x = p2, y = 2pq, z = q2

p = x + y/2, q = y/2 + z.

Note that the underlying mechanism of a trait determined by two alleles (with frequencies p, q) severely restricts the possible values of x, y, z. It would not have been possible, for example, to have observed these population proportions:

  • 10 are short (AA)
  • 60 are medium (Aa)
  • 30 are tall (aa)

as no consistent values of p and q can generate this outcome. That is, the constraint:

  •  x + y + z = 1 

is considerably weaker than x = p2y = 2pqz = q2. (We're dealing with a special case of a binomial distribution).

The interesting thing about the Hardy-Weinberg equation is that for a population under no selection and with random mating, these proportions are invariant through the generations.

Not so hard once you figure out what the problem being addressed actually is (normally that you're starting with a collection of phenotypes, identifying them with corresponding genotypes and then trying to work out the allele frequency - as in disease management). For more on this see my next post: Using the Hardy-Weinberg equation.

The full story is here (Wikipedia).