6.29.2009

Our broken grant system

The New York Times has a piece up highlighting some of the fundamental flaws in the cancer research grant system. In short, they find that it tends to fund unambitious, incremental research proposals that are unlikely to fail, yet also unlikely to result in significant progress toward curing cancer. I thought this passage was particularly poignant:
“Scientists don’t like talking about it publicly,” because they worry that their remarks will be viewed as lashing out at the health institutes, which supports them, said Dr. Richard D. Klausner, a former director of the National Cancer Institute.

But, Dr. Klausner added: “There is no conversation that I have ever had about the grant system that doesn’t have an incredible sense of consensus that it is not working. That is a terrible wasted opportunity for the scientists, patients, the nation and the world.”
John Hawks has some clever and good commentary on the situation, bringing in some evolutionary theory about search space and fitness peaks to support the point that yes, we're funding the wrong sorts of grant proposals when we go for timid, incremental projects given our current state of knowledge.

A pie-in-the-sky idea

As sort of an ideal-world scenario, instead of routing all proposals through the most established and senior of scientists, I'd like to see a modest amount of future NIH funding be set aside and overseen by graduate students in seminars across the country. Essentially, students could sign up for a seminar where their coursework would be to analyze a set of grant applications pertaining to their field, learn about the science in each grant and about the grant system, and finally select the top 1-2 grants to be funded. The professor teaching the class would be in charge of the syllabus, but with the following three guidelines:

1. Attempt to choose the best grant proposals;
2. The students, not the professor, have the final say in which proposals get funded;
3. Use the class as a teaching tool for both the science involved in the grants, and the grant system itself.

The set of grant applications to evaluate could be drawn from the pool of applications the NIH has rejected, but still deems interesting and not based on bad science.

There would be a million details to fill in, but I guarantee this system would be consistently fresh and open to new ideas (I don't know if anyone has noticed this, but grad students are really smart and creative!), yet would still be grounded in science and experience. It'd also be a fantastic teaching tool.

6.28.2009

Now leaving Era of the Mystery. All aboard for Era of the Tool.

Historically, there have been three ways to make progress within a scientific paradigm:
- Solve an outstanding mystery;
- Gather and publishing new data;
- Construct a new tool.

Gathering and publishing new data has constituted, and will constitute for the forseeable future, the majority of scientific publication. Science has a healthy and voracious appetite for data, and this isn't likely to change anytime soon. The interesting thing about progress in science today though, and the topic of this post, is the balance between the first and third sort of approach, mystery vs tool.

Era of the Scientific Mystery

By and large, the emphasis in science used to be on solving mysteries. Discovering the mechanism of genetic inheritance; decoding the structure of DNA; deciphering how viruses take over cells. Scientists were billed as detectives, and the height of scientific achievement was to find an "aha" insight that solved an outstanding mystery. But- though some scientists may voraciously deny this- we've been so successful at solving the fundamental mysteries out there that we're running out of this kind of mystery in many branches of science. In turn, science is gradually becoming less about solving foundational unknowns (like decoding the structure of DNA) and more about creating tools by which to more richly and more quantifiedly understand what is no longer mysterious but too complex to trust to our intuitions and simple equations.

Era of the Scientific Tool

Scientific progress has always had a strong tool component. Grind a better lens, see the stars better, and create a more accurate description of the galaxy; build a free-swinging pendulum, observe the shifting plane of motion, and conclude the Earth is not fixed but rotates. These sort of things were not uncommon in the history of science. But there seems to be a sea change happening that modern scientific publication is beginning to center around devising and applying tools that in turn generate interesting results.

Two examples of this from my own experience are the recent publications of a couple friends who are scientists, John Hawks (UW Madison) and Bryan W. Jones (U Utah).

Hawks made waves with a recent publication, Recent Acceleration of Human Adaptive Evolution, which applied an established genetics tool (linkage disequilibrium) to the context of the human genome and came to the conclusion that not only did human evolution not stop with the advent of civilization, but that it actually sped up over a hundredfold.

Jones just published A Computational Framework for Ultrastructural Mapping of Neural Circuitry, a work which defined a new integrative workflow which enabled, for the first time, the mapping of a large-scale neural connectome, and offered the first product of this workflow, a connectome map of a rabbit's retina.

Tools are absolutely central to both publications: the first is based on the novel application of an existing tool to a context it hadn't been applied in, and the second involved inventing a new tool to enable the generation of new datasets.

These examples are anecdotal, to be sure-- but it seems that although the meme of the scientific mystery will be with us for a long time, and though there are sporadic fundamental unknowns yet to discover, increasingly the really sexy, generative results in science involve creating or repurposing a tool to shed new light on some data, or generate data at an exponentially faster rate.

In short? Science is no longer about mysteries but about problems. And given the right tool, problems solve themselves.

Notes:

- Kevin Kelly's Speculations on the Future of Science is an interesting survey of possible tools science may grow into.

Edit, 5-13-11: Bryan W. Jones has a nice description of the problem his lab faced in building a connectome, and the tools they built to solve it. The research method and goal were less about solving a well-defined mystery, but more about building tools, datasets, and models that allow more useful ways of understanding what happens in retinal tissue under various scenarios.

6.11.2009

Brainstorm: Logarithmic Evolution Distance

(This piece is sort of a continuation of a previous brainstorm on evolution and phylogeny- it was lots of fun to think through and write, and I hope it's fun to read even if a bit jargon-heavy.)

Exponential advances in gene sequencing technology have produced an embarrassment of riches: we're now able to almost trivially sequence an organism's DNA, yet sifting meaning from these genomes is still an incredibly intensive and haphazard task. For instance, consider the following simple questions:

How close are the genetics of dogs and humans? How does this compare to cats and humans? What about mice and cats? How different, genetically, are mice and corn?

We have all of these genomes sequenced, but we don't have particularly good and intuitive ways to answer these sorts of questions.

Whenever we can ask simple questions about empirical phenomena that don't seem to have elegant answers, it's often a sign there's a niche for a new conceptual tool. This is a stab at a tool that I believe could deal with these questions more cogently and intelligently than current approaches.


Logarithmic Evolution Distance: an intuitive approach to quantifying difference between genomes.

How do we currently compare two genomes and put a figure on how close they are? The current fashionable metrics seem to be:

- Raw % similarity in genetic code-- e.g., "Humans and dogs share 85% of their genetic sequence." Or 70%. Or 98%, depending on whom you ask. However, what does this really say? There are many ways to calculate the figure for this, depending on how one evaluates CNVs and functional parity in sequences. And this tends to grossly understate the importance of regulatory elements.

- Gene homologue analysis-- e.g., "The dog genome has gene homologues for ~99.8% of the human genome." However, this metric also involves subjectivity-- depending on how you count them, apes might have the same number of human gene homologues as dogs. This approach also involves deep ambiguities in assuming homologue function, in assessing what constitutes a similar-enough homologue, and in dealing with CNVs-- and this 'roll up your sleeves and compare the functional nuts and bolts of two genomes' approach is also extremely labor-intensive.

- Time since evolutionary divergence-- e.g., "The latest common ancestor of dogs and cats lived 60 MYA, vs that of dogs and humans, which lived 95 MYA." However, though time seems a relatively good proxy for estimating how far apart two genomes are, there are many examples of false positives and false negatives for this heuristic. Selection strength and rate of genetic change can vary widely in different circumstances, and thus there are reasons to believe this heuristic is often deeply and systemically biased as a proxy for genome difference, and it breaks down very quickly for organisms with significant horizontal or inter-species gene transfer.


None of these approaches really give wrong answers to the questions I posed, but neither do they always, or often, give helpful and intuitive answers. They fail the ok, but what does it mean? test.

I think it's important to note, first, that all of life is connected. And as evolution creates gaps, it could also bridge them. I say we use these facts to build an intuitive, computational metric for quantifying how far apart two genomes are.

So here's my suggestion for a new approach
'Evolution Distance' - a rough computational/simulation estimate (useful in a relative sense) of the average number of generations of artificial selection it would take to evolve organism X into organism Y under standardized conditions, given a set of allowed types of mutations.

To back up a bit, a (rough) way to explain what this idea is about is, let's imagine we have some cats. We can breed our cats, and every generation we can take the most genetically doglike cats, and breed them together. Eventually (although it'll take a while!) you'll get a dog. What this tool would do, essentially, is computationally estimate how many generations worth of mutations it would take to go from genome A (a cat) to genome B (a dog). The number of generations is this 'evolution distance' between the genomes. You can apply it to any two genomes under the sun and get an intuitive, fairly consistent answer.

Details, details...
Now, what makes a dog a dog? We can identify several different thresholds for success-- an exact DNA match would be the gold standard, followed by a match of the DNA that codes for proteins, followed by estimated reproductive compatibility, followed by specific subsystem similarities, and so forth. The answer would be in terms of X to Y generations, 95% Confidence Interval, in log notation like the Ricter Scale, as it could vary so widely between organisms... let's call it LED for Logarithmic Evolution Distance. Arbitrarily, an LED of 1 might be 1k generations, an LED of 2 would be 10k generations, etc.

E.g., the LED of a babboon and a chimpanzee might be 1.8-1.9 (meaning it would take ~8000 generations of selective breeding to turn a babboon into a chimpanzee);
A giraffe and a hippo might be 3.4-3.6;
A starfish and a particular strain of e. coli might be 10.2-10.4. (That's a lot!)
I'm just throwing out some numbers, and may not be in the right ballpark... but the point is this is an intuitive, quantitative metric that can scale from comparing the genetics of parent and offspring all the way to comparing opposite branches of the tree of life.

This is intrinsically a quick and dirty estimate, very difficult to get (& prove) 'correct', but given that, it is
1. potentially very useful as a relative, quantitative metric,
2. intuitive in a way current measures of genetic similarity aren't,
3. fully computational with a relatively straightforward interpretation-- you'd set up a model, put in two genomes, and get an answer.

This estimate could, and would need to, operate with a significantly simplified model of selection. Later, the approach could slowly add in gene pools, simulation & function-aware aspects, mutation mechanics, the geometry of mutation hotspots, mutations incompatible with life, gene patterns that protect against mutations, HGT, etc. But it would start as, and be most helpful as, a very rough metric.


Selection-centric or mutation-centric?
Thus far I've used 'selection' and 'mutation' somewhat interchangeably. But I think the ideal way to set up the model is to stay away from random mutations and pruning. Instead, I would suggest setting up an algorithm to map out a shortest mutational path from genome A to genome B, given a certain amount of allowed mutation per generation. This would be less indicative of the randomness of evolution, but perhaps a tighter, more tractable, and more realistic estimate of the number of generations' worth of distance.

Practical applications (why would this be useful?):
In general, I see this as an intuitive metric to compare any two genomes that could see wide use-- after the general model is built, the beauty of this approach is that it's automated and quantitative. Just input any two arbitrary genomes, input some mutational parameters, and you get an answer. Biology is coming into an embarrassment of riches in terms of sequencing genomes. This is a tool that can hopefully help people, both scientists and laymen, make better intuitive sense of all this data.

E.g., If I wanted to compare the genomes of two prairie grasses with each other, and with corn, this tool would give me a reasonably intuitive answer to how closely each was related to the others.

A specific scientific use for this would be to compare the ratio of calculated LED to the time since evolutionary divergence while controlling for time between generations. This would presumably be a reasonable (and easy-to-do) measure to detect and compare strength of selection, perhaps helpful as a supplement to e.g., metrics such as linkage disequilibrium analysis. E.g., if the genome of two organisms' last common ancestor can be inferred, the LED of LCA's genome->genome A vs the LED of LCA's genome->genome B would presumably be an excellent quantitative indicator of relative strength of selection.

This metric is by no means limited to comparisons between species; comparing Great Danes to Pitbulls with this tool, or even two Pitbulls to each other, would generate interesting results.

This tool would also be helpful in an educational context, to drive home the point that everything living really is connected to everything else, and evolution is the web that connects them. It's also educational in the sense that it'd actually simulate a simplified form of genetic evolution, and we may learn a great deal from rolling up our sleeves and seeing how well our answers compare to nature's.

The nitty gritty...

Open questions:
- This comparison as explained does not deal with the complexity of sexual recombination or of horizontal gene transfer (though to be fair, none of its competitors do either). Or, to dig a little deeper, evolution happens on gene pools, whereas this tool only treats evolution as mutation on single genomes. Does it still produce a usably unbiased result in most comparisons? (My intuition is if we're going for an absolute estimate of an 'evolution distance', no; a relative comparison, yes.)

- Would direction matter? It depends on how simple the model is, but realistically, it's very likely. E.g., the LED of a dog -> cat might be significantly different than cat -> dog. Presumably it'd matter the most in deep, structural changes such as prokaryote <-> eukaryote evolution. Loss of function/structure is always easier to evolve than function/structure.

- How realistically could one model the conditions that these evolutionary simulations would operate under? E.g., would the number of offspring need to be arbitrary for each simulation? Would the rate of mutation vary between dogs and cats? How could the model be responsive to operation under different ecosystems? How to deal with many changes in these quantities over time, if you're charting a large LED (e.g., bacteria->cat)? I guess the answer to this is, you could make things as complicated as you wanted. But you wouldn't have to.

- In theory, the impact of genetic differences between arbitrary members of the same species would be minimized by the logarithmic nature of the metric. Would this usually be the case? Presumably LED could be used to explore variation pertaining to this: e.g., species X has a mean LED of 1.4, whereas species Y has a mean LED of 1.6.

- Depending on the progress of tissue and functional domain gene expression analysis and what inherent and epistemological messiness lies therein, this could be applied to subsets of organisms: finding a provisional sort of evolution distance between organism X's immune system and organism Y's immune system, or limbs, or heart, etc. Much less conceptually elegant, but perhaps still useful.


Anyway, this is a different way of looking at the differences between genomes. Not more or less correct than others-- but, at least in some cases, I think more helpful.

Edit, 9/27/09: Just read an important paper on the difficulty of reversing some types of molecular evolution, since neutral genetic drift accumulated after a shift in function may not be neutral in the original functional context. In the context of Logarithmic Evolution Distance, I think it underscores the point that LED can't be taken literally, since it doesn't take function or fitness into account. But then again, neither do the other tools it's competing against, and this doesn't impact its core function as an estimation-based tool with which to make relative comparisons.