6.11.2009

Brainstorm: Logarithmic Evolution Distance

(This piece is sort of a continuation of a previous brainstorm on evolution and phylogeny- it was lots of fun to think through and write, and I hope it's fun to read even if a bit jargon-heavy.)

Exponential advances in gene sequencing technology have produced an embarrassment of riches: we're now able to almost trivially sequence an organism's DNA, yet sifting meaning from these genomes is still an incredibly intensive and haphazard task. For instance, consider the following simple questions:

How close are the genetics of dogs and humans? How does this compare to cats and humans? What about mice and cats? How different, genetically, are mice and corn?

We have all of these genomes sequenced, but we don't have particularly good and intuitive ways to answer these sorts of questions.

Whenever we can ask simple questions about empirical phenomena that don't seem to have elegant answers, it's often a sign there's a niche for a new conceptual tool. This is a stab at a tool that I believe could deal with these questions more cogently and intelligently than current approaches.


Logarithmic Evolution Distance: an intuitive approach to quantifying difference between genomes.

How do we currently compare two genomes and put a figure on how close they are? The current fashionable metrics seem to be:

- Raw % similarity in genetic code-- e.g., "Humans and dogs share 85% of their genetic sequence." Or 70%. Or 98%, depending on whom you ask. However, what does this really say? There are many ways to calculate the figure for this, depending on how one evaluates CNVs and functional parity in sequences. And this tends to grossly understate the importance of regulatory elements.

- Gene homologue analysis-- e.g., "The dog genome has gene homologues for ~99.8% of the human genome." However, this metric also involves subjectivity-- depending on how you count them, apes might have the same number of human gene homologues as dogs. This approach also involves deep ambiguities in assuming homologue function, in assessing what constitutes a similar-enough homologue, and in dealing with CNVs-- and this 'roll up your sleeves and compare the functional nuts and bolts of two genomes' approach is also extremely labor-intensive.

- Time since evolutionary divergence-- e.g., "The latest common ancestor of dogs and cats lived 60 MYA, vs that of dogs and humans, which lived 95 MYA." However, though time seems a relatively good proxy for estimating how far apart two genomes are, there are many examples of false positives and false negatives for this heuristic. Selection strength and rate of genetic change can vary widely in different circumstances, and thus there are reasons to believe this heuristic is often deeply and systemically biased as a proxy for genome difference, and it breaks down very quickly for organisms with significant horizontal or inter-species gene transfer.


None of these approaches really give wrong answers to the questions I posed, but neither do they always, or often, give helpful and intuitive answers. They fail the ok, but what does it mean? test.

I think it's important to note, first, that all of life is connected. And as evolution creates gaps, it could also bridge them. I say we use these facts to build an intuitive, computational metric for quantifying how far apart two genomes are.

So here's my suggestion for a new approach
'Evolution Distance' - a rough computational/simulation estimate (useful in a relative sense) of the average number of generations of artificial selection it would take to evolve organism X into organism Y under standardized conditions, given a set of allowed types of mutations.

To back up a bit, a (rough) way to explain what this idea is about is, let's imagine we have some cats. We can breed our cats, and every generation we can take the most genetically doglike cats, and breed them together. Eventually (although it'll take a while!) you'll get a dog. What this tool would do, essentially, is computationally estimate how many generations worth of mutations it would take to go from genome A (a cat) to genome B (a dog). The number of generations is this 'evolution distance' between the genomes. You can apply it to any two genomes under the sun and get an intuitive, fairly consistent answer.

Details, details...
Now, what makes a dog a dog? We can identify several different thresholds for success-- an exact DNA match would be the gold standard, followed by a match of the DNA that codes for proteins, followed by estimated reproductive compatibility, followed by specific subsystem similarities, and so forth. The answer would be in terms of X to Y generations, 95% Confidence Interval, in log notation like the Ricter Scale, as it could vary so widely between organisms... let's call it LED for Logarithmic Evolution Distance. Arbitrarily, an LED of 1 might be 1k generations, an LED of 2 would be 10k generations, etc.

E.g., the LED of a babboon and a chimpanzee might be 1.8-1.9 (meaning it would take ~8000 generations of selective breeding to turn a babboon into a chimpanzee);
A giraffe and a hippo might be 3.4-3.6;
A starfish and a particular strain of e. coli might be 10.2-10.4. (That's a lot!)
I'm just throwing out some numbers, and may not be in the right ballpark... but the point is this is an intuitive, quantitative metric that can scale from comparing the genetics of parent and offspring all the way to comparing opposite branches of the tree of life.

This is intrinsically a quick and dirty estimate, very difficult to get (& prove) 'correct', but given that, it is
1. potentially very useful as a relative, quantitative metric,
2. intuitive in a way current measures of genetic similarity aren't,
3. fully computational with a relatively straightforward interpretation-- you'd set up a model, put in two genomes, and get an answer.

This estimate could, and would need to, operate with a significantly simplified model of selection. Later, the approach could slowly add in gene pools, simulation & function-aware aspects, mutation mechanics, the geometry of mutation hotspots, mutations incompatible with life, gene patterns that protect against mutations, HGT, etc. But it would start as, and be most helpful as, a very rough metric.


Selection-centric or mutation-centric?
Thus far I've used 'selection' and 'mutation' somewhat interchangeably. But I think the ideal way to set up the model is to stay away from random mutations and pruning. Instead, I would suggest setting up an algorithm to map out a shortest mutational path from genome A to genome B, given a certain amount of allowed mutation per generation. This would be less indicative of the randomness of evolution, but perhaps a tighter, more tractable, and more realistic estimate of the number of generations' worth of distance.

Practical applications (why would this be useful?):
In general, I see this as an intuitive metric to compare any two genomes that could see wide use-- after the general model is built, the beauty of this approach is that it's automated and quantitative. Just input any two arbitrary genomes, input some mutational parameters, and you get an answer. Biology is coming into an embarrassment of riches in terms of sequencing genomes. This is a tool that can hopefully help people, both scientists and laymen, make better intuitive sense of all this data.

E.g., If I wanted to compare the genomes of two prairie grasses with each other, and with corn, this tool would give me a reasonably intuitive answer to how closely each was related to the others.

A specific scientific use for this would be to compare the ratio of calculated LED to the time since evolutionary divergence while controlling for time between generations. This would presumably be a reasonable (and easy-to-do) measure to detect and compare strength of selection, perhaps helpful as a supplement to e.g., metrics such as linkage disequilibrium analysis. E.g., if the genome of two organisms' last common ancestor can be inferred, the LED of LCA's genome->genome A vs the LED of LCA's genome->genome B would presumably be an excellent quantitative indicator of relative strength of selection.

This metric is by no means limited to comparisons between species; comparing Great Danes to Pitbulls with this tool, or even two Pitbulls to each other, would generate interesting results.

This tool would also be helpful in an educational context, to drive home the point that everything living really is connected to everything else, and evolution is the web that connects them. It's also educational in the sense that it'd actually simulate a simplified form of genetic evolution, and we may learn a great deal from rolling up our sleeves and seeing how well our answers compare to nature's.

The nitty gritty...

Open questions:
- This comparison as explained does not deal with the complexity of sexual recombination or of horizontal gene transfer (though to be fair, none of its competitors do either). Or, to dig a little deeper, evolution happens on gene pools, whereas this tool only treats evolution as mutation on single genomes. Does it still produce a usably unbiased result in most comparisons? (My intuition is if we're going for an absolute estimate of an 'evolution distance', no; a relative comparison, yes.)

- Would direction matter? It depends on how simple the model is, but realistically, it's very likely. E.g., the LED of a dog -> cat might be significantly different than cat -> dog. Presumably it'd matter the most in deep, structural changes such as prokaryote <-> eukaryote evolution. Loss of function/structure is always easier to evolve than function/structure.

- How realistically could one model the conditions that these evolutionary simulations would operate under? E.g., would the number of offspring need to be arbitrary for each simulation? Would the rate of mutation vary between dogs and cats? How could the model be responsive to operation under different ecosystems? How to deal with many changes in these quantities over time, if you're charting a large LED (e.g., bacteria->cat)? I guess the answer to this is, you could make things as complicated as you wanted. But you wouldn't have to.

- In theory, the impact of genetic differences between arbitrary members of the same species would be minimized by the logarithmic nature of the metric. Would this usually be the case? Presumably LED could be used to explore variation pertaining to this: e.g., species X has a mean LED of 1.4, whereas species Y has a mean LED of 1.6.

- Depending on the progress of tissue and functional domain gene expression analysis and what inherent and epistemological messiness lies therein, this could be applied to subsets of organisms: finding a provisional sort of evolution distance between organism X's immune system and organism Y's immune system, or limbs, or heart, etc. Much less conceptually elegant, but perhaps still useful.


Anyway, this is a different way of looking at the differences between genomes. Not more or less correct than others-- but, at least in some cases, I think more helpful.

Edit, 9/27/09: Just read an important paper on the difficulty of reversing some types of molecular evolution, since neutral genetic drift accumulated after a shift in function may not be neutral in the original functional context. In the context of Logarithmic Evolution Distance, I think it underscores the point that LED can't be taken literally, since it doesn't take function or fitness into account. But then again, neither do the other tools it's competing against, and this doesn't impact its core function as an estimation-based tool with which to make relative comparisons.

2 comments:

Anonymous said...

Caveats:

1. Genome assembly is still an art as opposed to a science, especially for the organisms we care about.

2. A successful first approximation to evolution on the genome scale implicitly assumes knowledge and understanding of many processes we do not know and know not of! The concept of selective pressure applied to the genome is an extremely cutting edge field of study.

3. Validation. The problem with any such tool would be validation both on the level of accuracy (how close to the "truth" is the output?) as well as performance (how is it better than existing phylogenetic inference methods?). For the former, the proposed method obviates the convention of validation based on simulated evolution, since such a demonstration would beg the question. As for the latter question, there is no framework within which to posit an answer given the predicament we have with the former.

Mike said...

Good points. My responses:

1. I think one can make this point, but I also think it's a disappearing problem. I can't state anything too firmly here, but I'm reminded of this interview- http://radar.oreilly.com/2009/07/sequencing-a-genome-a-week.html where the tech guy in charge of Washington University's Genome Center talks about the progress of systematizing genome sequencing and reconstruction. It sure sounds like a lot of progress has been made very recently in terms of systemizing, automating, standardizing, etc, and most of the messy problems have been 'pushed down the chain' into the hands of genomicists.

2. I would say that a first approximation can be very rough, and in this context it could still be useful if one limited discussions to relative comparisons of LED. The other relevant item here is, of course, that the process of trying to simulate genomic evolution could be a very generative endeavor.

3. Both points are very valid. I think validation is a problem with this model, though (as you perhaps allude to re: performance) it may be a problem with other models to some extent as well?

Continuing on (3), I guess my argument would be this: validating accuracy is a problem rather unique to this tool because it attempts to model something empirical, which other tools do not. E.g., existing phylogenetic inference methods assert things, but the meaning of what they assert is more or less definitional. You may view the need for validation/accuracy as a weakness of this tool; I think it could also be viewed as a strength, that it goes beyond other tools in that asserts something that /can/ be said to be empirically accurate or inaccurate.

Thanks for the intelligent comments.