Automating the Art of Protein-Function Prediction
Biologists are knee-deep in protein sequences. Thanks to automation, genes that code for proteins are being found with ever-increasing speed. But deciphering what these proteins do—the logical next step in the process—remains a daunting hurdle.
When genetics was new, time-consuming lab experiments were the only way to determine what proteins did. Today, scientists routinely search giant databases for similar protein sequences that might help predict a new protein’s function.
By rights, such automated matching should make protein function prediction a snap. But the process isn’t nearly so simple. Like the organisms they come from, proteins evolve over time. As a protein’s gene mutates, that protein’s function will eventually shift, too. Compounding these issues are database errors, where protein functions were entered incorrectly, or were based on faulty database matches. Mistakes are then copied over as new proteins are added, leaving databases riddled with inaccuracies.
The problem came to a head in 1995, when the genomes of bacteria were being sequenced for the first time. Using combinations of automated searches and personal expertise, several groups of researchers claimed conflicting functions for Mycobacteria genitalium’s fewer than 500 genes.
Associate Professor of Plant and Microbial Biology Steven Brenner aims to straighten out this mess. Along with graduate student Barbara Engelhardt and computer science professor Michael Jordan, Brenner is developing a new approach to protein function prediction. It bridges the pitfalls of previous prediction methods with logical, thoroughly annotated, and statistically-based assessments.
“We’ve taken all of the steps involved in manual phylogenomics, a method of predicting the functions of proteins based on the evolutionary history of their genes, and begun to automate them,” Brenner says. “We’re developing ways to take the detailed insight of a human expert and are beginning to apply it in a systematic way on a large scale.”
Called SIFTER (Statistical Inference of Function Through Evolutionary Relationships), the program automates phylogenomics. After searching existing databases for related genes, it arranges any hits into a family tree. Branches are arranged based on similarities in sequence and function, while each leaf shows an individual protein’s function when available, plus what that prediction was based on—a lab experiment, a statement in a scientific paper, or (in the least accurate case) an inference from a database.
The program then calculates how easily proteins in this family shift functions by looking at the tree. This information allows SIFTER to assign a confidence rating to each function prediction.
“SIFTER gives traceable evidence; it tells you not just what it thinks the function is, but why and how much confidence you should have in that prediction,” Brenner says.
In tests with known families of proteins, SIFTER has performed impressively well. “It turned out we did better than anything else out there, with 96 percent correct,” Brenner says. Even in more complex families, where the proteins have more functions and evolve in less predictable ways, SIFTER was correct 60 percent of the time, compared to 40 percent success for the method that’s currently the most widely-used. Though SIFTER remains a work in progress, its early results forecast a prominent role in the phylogenetics of the future.