(This piece is sort of a continuation of a previous
brainstorm on evolution and phylogeny- it was lots of fun to think through and write, and I hope it's fun to read even if a bit jargon-heavy.)
Exponential advances in gene sequencing technology have produced an embarrassment of riches: we're now able to almost trivially sequence an organism's DNA, yet sifting meaning from these genomes is still an incredibly intensive and haphazard task. For instance, consider the following simple questions:
How close are the genetics of dogs and humans? How does this compare to cats and humans? What about mice and cats? How different, genetically, are mice and corn?We have all the necessary genomic data to answer these questions, and we can calculate answers of a sort-- but the types of answers we can give at this point are rather sparse and definitely not intuitively satisfying.
Whenever we can ask simple questions about empirical phenomena that don't seem to have elegant answers, it's often a sign there's a niche for a new conceptual tool. This is a stab at a tool that I believe could deal with these questions more cogently and intelligently than current approaches.
Logarithmic Evolution Distance: an intuitive approach to quantifying difference between genomes.How do we currently compare two genomes and put a figure on how close they are? The fashionable metrics seem to be:
-
Raw % similarity in genetic code-- e.g., "Humans and dogs share 85% of their genetic sequence." Or 70%. Or 98%, depending on who you ask. However, what does this really say? It's a non-intuitive answer, particularly since there are so many ways to calculate the figure for this, depending on how one evaluates CNVs and functional parity in sequences. And this tends to grossly understate the importance of regulatory elements.
-
Gene homologue analysis-- e.g., "The dog genome has gene homologues for ~99.8% of the human genome." However, neither the magnitude nor the functional meaning of the difference between two genomes having 99% homologous genes and 99.8% homologous genes is apparent. This approach also involves deep ambiguities in assuming homologue function, in assessing what constitutes a similar-enough homologue, and in dealing with CNVs-- and this 'roll up your sleeves and compare the functional nuts and bolts of two genomes' approach is also extremely labor-intensive.
-
Time since evolutionary divergence-- e.g., "The latest common ancestor of dogs and cats lived 60 MYA, vs that of dogs and humans, which lived 95 MYA. However, though time seems a relatively good proxy for estimating how far apart two genomes are, there are many examples of false positives and false negatives for this heuristic. Selection strength and rate of genetic change can vary widely in different circumstances, and thus there are reasons to believe this heuristic is often deeply and systemically biased as a proxy for genome difference.
None of these approaches really give wrong answers to the questions I posed, but neither do they always, or often, give helpful and intuitive answers. They fail the
ok, but what does it mean? test.
So here's my suggestion for a new approach-
'
Evolution Distance' - a rough computational/simulation estimate (useful in a relative sense) of the average number of generations of artificial selection it would take to evolve organism X into organism Y under standardized conditions, given a set of allowed types of mutations.
To back up a bit, a (rough) way to explain what this idea is about is, take some cats. Breed them. Every generation, take the most genetically doglike cats, and breed them together. Eventually(!) you'll get a dog. What this tool does, essentially, is computationally estimate how many generations of selection [edit: mutation] it would take to go from genome A (a cat) to genome B (a dog). The number of generations is this 'evolution distance' between the genomes.
Now, what makes a dog a dog? We can identify several different thresholds for success-- an exact DNA match would be the gold standard, followed by a match of the DNA that codes for proteins, followed by estimated reproductive compatibility, followed by specific subsystem similarities, and so forth. The answer would be in terms of X to Y generations, 95% Confidence Interval, in log notation like the Ricter Scale, as it could vary so widely between organisms... call it LED for Logarithmic Evolution Distance. Arbitrarily, an LED of 1 might be 1k generations, an LED of 2 would be 10k generations, etc.
E.g., the LED of a babboon and a chimpanzee might be 1.8-1.9;
Of a giraffe and a hippo might be 3.4-3.6;
Starfish and a particular strain of e. coli might be 10.2-10.4. (That's a lot!)
I'm just throwing out some numbers, and may not be in the right ballpark... but the point is this is an intuitive, quantitative metric that can scale from comparing the genetics of parent and offspring all the way to comparing opposite branches of the tree of life.
This is intrinsically a quick and dirty estimate, very difficult to get (& prove) 'correct', but given that, it is
1. potentially very useful as a relative, quantitative metric,
2. intuitive in a way current measures of genetic similarity aren't,
3. fully computational with a relatively straightforward interpretation-- you'd set up a model, put in two genomes, and get an answer.
This estimate could, and would need to, operate with a significantly simplified model of selection. Later, the approach could slowly add in gene pools, simulation & function-aware aspects, mutation mechanics, the geometry of mutation hotspots, mutations incompatible with life, gene patterns that protect against mutations, HGT, etc. But it would start as, and be most helpful as, a very rough metric.
Variations:1. Instead of being based on random mutations and pruning, perhaps the algorithm could be tuned to map out a shortest mutational path from genome A to genome B, given a certain amount of allowed mutation per generation. This would be less indicative of the randomness of evolution, but perhaps a tighter, more tractable, and more realistic estimate of the number of generations' worth of distance. [Note- I'm coming around to the idea that this is the better approach.]
2. Depending on the progress of tissue and functional domain gene expression analysis and what inherent and epistemological messiness lies therein, this could be applied to subsets of organisms: finding a provisional sort of evolution distance between organism X's immune system and organism Y's immune system, or limbs, or heart, etc. Much less conceptually elegant, but perhaps still useful.
Practical applications (why would this be useful?):In general, I see this as an intuitive metric to compare any two genomes that could see wide use-- after the general model is built, the beauty of this approach is that it's automated and quantitative. Just input any two arbitrary genomes, input some mutational parameters, and you get an answer. Biology is coming into an embarrassment of riches in terms of sequencing genomes. This is a tool that can hopefully help people, both scientists and laymen, make better intuitive sense of all this data.
A specific use for this would be to compare the ratio of calculated LED to the time since evolutionary divergence while controlling for time between generations. This would presumably be a reasonable (and easy-to-do) measure to detect and compare strength of selection, perhaps helpful as a supplement to e.g., metrics such as linkage disequilibrium analysis. Alternatively, if the genome of two organisms' last common ancestor can be inferred, the LED of LCA's genome->genome A vs the LED of LCA's genome->genome B would presumably be an excellent quantitative indicator of relative strength of selection.
This metric is by no means limited to comparisons between species; comparing Great Danes to Pitbulls with this tool, or even two Pitbulls to each other, would generate interesting results.
This tool would also be helpful in an educational context, to drive home the point that everything living really is connected to everything else, and evolution is the web that connects them. It's also educational in the sense that it'd actually simulate a simplified form of genetic evolution, and we may learn a great deal from rolling up our sleeves and seeing how well our answers compare to nature's.
Open questions:- This comparison as explained does not deal with the complexity of sexual recombination or of horizontal gene transfer (though to be fair, none of its competitors do either). Or, to dig a little deeper, evolution happens on gene pools, whereas this tool only treats evolution as mutation on single genomes. Does it still produce a usably unbiased result in most comparisons? (My intuition is if we're going for an absolute estimate of an 'evolution distance', no; a relative comparison, yes.)
- Would direction matter? It depends on how simple the model is, but realistically, it's very likely. E.g., the LED of a dog -> cat might be significantly different than cat -> dog. Presumably it'd matter the most in deep, structural changes such as prokaryote <-> eukaryote evolution. Loss of function/structure is always easier to evolve than function/structure.
- How realistically could one model the conditions that these evolutionary simulations would operate under? E.g., would the number of offspring need to be arbitrary for each simulation? Would the rate of mutation vary between dogs and cats? How could the model be responsive to operation under different ecosystems? How to deal with many changes in these quantities over time, if you're charting a large LED (e.g., bacteria->cat)? I guess the answer to this is, you could make things as complicated as you wanted. But you wouldn't have to.
- In theory, the impact of genetic differences between arbitrary members of the same species would be minimized by the logarithmic nature of the metric. Would this usually be the case? Presumably LED could be used to explore variation pertaining to this: e.g., species X has a mean LED of 1.4, whereas species Y has a mean LED of 1.6.
Anyway, this is a different way of looking at the differences between genomes. Not more or less correct than others-- but, at least in some cases, I think more helpful.
Edit, 9/27/09: Just read an important paper on the
difficulty of reversing some types of molecular evolution, since neutral genetic drift accumulated after a shift in function may not be neutral in the original functional context. In the context of Logarithmic Evolution Distance, I think it underscores the point that LED can't be taken literally, since it doesn't take function or fitness into account. But then again, neither do the other tools it's competing against, and this doesn't impact its core function as an estimation-based tool with which to make relative comparisons.