Protein function prediction
Protein function prediction methods are used to assign biological or biochemical roles to proteins that are poorly studied or predicted proteins based on genomic sequence data. These predictions are often driven by data-intensive computational procedures. Sources of information may come from nucleic acid sequence homology, gene expression profiles, protein domain structures, text mining of publications, phylogenetic profiles, phenotypic profiles, and protein-protein interaction. It should be noted that protein function is a broad term: the roles of proteins range from catalysis of biochemical reactions to transport to signal transduction, and a single protein may play a role in multiple processes or cellular pathways. Generally, function can be thought of as, "anything that happens to or through a protein". The Gene Ontology Consortium provides a useful classification of functions, based on a dictionary of well-defined terms divided into three main categories of molecular function, biological process and cellular component. Researchers can query this database with a protein name or accession number to retrieve associated Gene Ontology (GO) terms or annotations based on computational or experimental evidence.
While techniques such as microarray analysis, RNA interference, and the yeast two-hybrid system can be used to experimentally demonstrate the function of a protein, advances in sequencing technologies have made the rate at which proteins can be experimentally characterized much slower than the rate at which new sequences become available. Thus, the annotation of new sequences is mostly by prediction through computational methods, as these types of annotation can often be done quickly and for many genes or proteins at once. The first such methods inferred function based on homologous proteins with known functions (homology-based function prediction). The development of context-based and structure based methods have expanded what information can be predicted, and a combination of methods can now be used to get a picture of complete cellular pathways based on sequence data. The importance and prevalence of computational prediction of gene function is underlined by an analysis of 'evidence codes' used by the GO database: as of 2010, 98% of annotations were listed under the code IEA (inferred from electronic annotation) while only 0.6% were based on experimental evidence.
Homology-based Function Prediction
Homologous genes or proteins are derived from a common ancestral sequence. Homologous proteins within or among species are similar in sequence and are likely, but not guaranteed to have a similar function. Comparison of an un-annotated sequence to known homologous sequences is therefore a good starting point for predicting function. Generally, a new sequence could be used to query protein databases using BLAST (webpage) to find proteins of known function with high sequence similarity, and their function is transferred to the query sequence. However, not all new protein sequences will have known homologs in other species, and the assumption that closely related proteins will share the same function are not always correct. For example, genes which are paralogs (diverged from a common ancestor after a gene duplication event) may be highly similar in sequence but have evolved different functions. Therefore, other approaches are often necessary.
Function Prediction Using Sequence Motifs
The development of protein domain databases such as Pfam (Protein Families Database) (webpage) allow us to find known domains within a query sequence, providing evidence for likely functions. The dcGO (dcGO website)  contains annotations to both the individual domains and supra-domains (i.e., combinations of two or more successive domains), thus via dcGO Predictor allowing for the function predictions in a more realistic manner. Within protein domains, shorter signatures known as motifs' are found which are associated with particular functions, and motif databases such as PROSITE ('database of protein domains, families and functional sites') (webpage) can be searched using a query sequence. Motifs can, for example, be used to predict subcellular localization of a protein (where in the cell the protein is sent after synthesis). Short signal peptides direct certain proteins to a particular location such as the mitochondria, and various tools exist for the prediction of these signals in a protein sequence. For example, SignalP, which has been updated several times as methods are improved. Thus, aspects of a protein's function can be predicted without comparison to other full-length homologous protein sequences.
Structure-based Function Prediction
Because 3D protein structure is generally more well conserved than protein sequence, structural similarity is a good indicator of similar function in two or more proteins. Many programs have been developed to screen an unknown protein structure against the Protein Data Bank (PDB, webpage) and report similar structures (for example, FATCAT (Flexible structure AlignmenT by Chaining AFPs (Aligned Fragment Pairs) with Twists), CE (combinatorial extension)) and DeepAlign (protein structure alignment beyond spatial proximity). To deal with the situation that many protein sequences have no solved structures, some function prediction servers such as RaptorX are also developed that can first predict the 3D model of a sequence and then use structure-based method to predict functions based upon the predicted 3D model. In many cases instead of the whole protein structure, the 3D structure of a particular motif representing an active site or binding site can be targeted. Databases such as Catalytic Site Atlas have been developed which can be searched using novel protein sequences to predict specific functional sites.
Genomic Context-based Function Prediction
Many of the newer methods for protein function prediction are not based on comparison of sequence or structure as above, but on some type of correlation between novel genes/proteins and those that already have annotations. Also known as phylogenomic profiling, these genomic context based methods are based on the observation that two or more proteins with the same pattern of presence or absence in many different genomes most likely have a functional link. Whereas homology-based methods can often be used to identify molecular functions of a protein, context-based approaches can be used to predict cellular function, or the biological process in which a protein acts. For example, proteins involved in the same signal transduction pathway are likely to share a genomic context across all species.
Gene fusion is observed when two or more proteins are encoded by two or more genes in one organism which have, through evolution, combined to become a single gene in another organism (or vice versa for gene fission). This concept has been used, for example, to search all E. coli protein sequences for homology in other genomes and find over 6000 pairs of sequences with shared homology to single proteins in another genome, indicating potential interaction between each of the pairs. Because the two sequences in each protein pair are non-homologous, these interactions could not be predicted using homology-based methods.
In prokaryotes, clusters of genes which are physically close together in the genome are often seen to be conserved together through evolution, and tend to encode proteins that interact or are part of the same operon. Thus, chromosomal proximity also called the gene neighbour method can be used to predict functional similarity between proteins, at least in prokaryotes. Chromosomal proximity has also been seen to apply for some pathways in selected eukaryotic genomes, including Homo sapiens, and with further development gene neighbor methods may be valuable for studying protein interactions in eukaryotes.
Genes involved in similar functions are also often co-transcribed, so that an unannotated protein can often be predicted to have a related function to proteins with which it is co-expressed. The guilt by association algorithms developed based on this approach can be used to analyze large amounts of sequence data and identify genes with expression patterns similar to those of known genes. A guilt by association study will often compare a group of candidate genes (unknown function) to a target group (for example, a group of genes known to be associated with a particular disease), and rank the candidate genes by their likelihood of belonging to the target group based on the data. Based on recent studies, however, it has been suggested that some problems exist with this type of analysis. For example, because many proteins are multifunctional, the genes encoding them may belong to several target groups. It is argued that such genes are more likely to be identified in guilt by association studies, and thus predictions are not specific.
Function Prediction Using Functional Association Networks
Guilt by association type algorithms may be used to produce a functional association network for a given target group of genes or proteins. These networks serve as a representation of the evidence for shared/similar function within a group of genes, where nodes represent genes/proteins and are linked to each other by edges representing evidence of shared function.
Integration of Data Sources
Several networks based on different data sources can be combined into a composite network which can then be used by a prediction algorithm to annotate candidate genes or proteins. For example, the developers of the bioPIXIE system used a wide variety of Saccharomyces cerevisiae (yeast) genomic data to produce a composite functional network for that species. This resource allows the visualization of known networks representing biological processes, as well as the prediction of novel components of those networks. Many algorithms have been developed to predict function based on the integration of several data sources (eg. genomic, proteomic, protein interaction, etc.), and testing on previously annotated genes indicates a high level of accuracy. Disadvantages of some function prediction algorithms have included a lack of accessibility, and the time required for analysis. Faster, more accurate algorithms such as GeneMANIA (Multiple Association Network Integration Algorithm) have however been developed in recent years  and are publicly available on the web, indicating the future direction of function prediction.
- Rost, B.; J.Liu, R.Nair, K.O. Wrzeszczynski, Y. Ofran (2003). "Automatic prediction of protein function". Cellular and Molecular Life Sciences 60: 2637–2650. doi:10.1007/s00018-003-3114-8. PMID 14685688.
- The Gene Ontology Consortium (2000). "Gene ontology: tool for the unification of biology". Nature Genetics 25 (1): 25–29. doi:10.1038/75556. PMC 3037419. PMID 10802651.
- Gabaldon, T; M.A. Huynen (2004). "Prediction of protein function and pathways in the genome era". Cellular and Molecular Life Sciences 61: 930–944. doi:10.1007/s00018-003-3387-y. PMID 15095013.
- du Plessis, L.; N. Skunca, C. Dessimoz (2011). "The what, where, how and why of gene ontology--a primer for bioinformaticians". Brief Bioinform 12 (6): 723–735. doi:10.1093/bib/bbr002. PMID 21330331.
- Reeck, G.R.; C. de Haen, D.C. Teller, R.F. Doolittle, W.M. Fitch, R.E. Dickerson, et al. (1987). ""Homology" in proteins and nucleic acids: a terminology muddle and a way out of it". Nature 50 (5): 667. doi:10.1016/0092-8674(87)90322-9. PMID 3621342.
- Sleator, R.D.; P. Walsh (2010). "An overview of in silico protein function prediction". Arch Microbiol 192: 151–155. doi:10.1007/s00203-010-0549-9. PMID 20127480.
- Whisstock, J.C.; A.M. Lesk (2003). "Prediction of protein function from protein sequence and structure". Quarterly Reviews of Biophysics 36 (3): 307–340. doi:10.1017/S0033583503003901.
- Finn, R.D.; J. Mistry, J. Tate, P. Coggill, A. Heger, J.E. Pollington, O.L. Gavin, et al. (2010). "The Pfam protein families database". Nucleic Acids Res 38: D211–222. doi:10.1093/nar/gkp985. PMC 2808889. PMID 19920124.
- Fang, H.; Gough, J. (2012). "DcGO: Database of domain-centric ontologies on functions, phenotypes, diseases and more". Nucleic Acids Research. doi:10.1093/nar/gks1080. PMID 23161684.
- Sigrist, C.J; L. Cerutti, E. de Castro, P.S. Langendijk-Genevaux, V. Bulliard, A. Bairoch, N. Hulo (2010). "PROSITE, a protein domain database for functional characterization and annotation". Nucleic Acids Res 38: 161–166. doi:10.1093/nar/gkp885. PMC 2808866. PMID 19858104.
- Menne, K.M.; H. Hermjakob, R. Apweiler (2000). "A comparison of signal sequence prediction methods using a test set of signal peptides". Bioinformatics 16 (8): 741–742. doi:10.1093/bioinformatics/16.8.741. PMID 11099261.
- Petersen, T.N.; S. Brunak, G. von Heijne, H. Nielsen (2011). "SignalP 4.0: discriminating signal peptides from transmembrane regions". Nature Methods 8: 785–786. doi:10.1038/nmeth.1701. PMID 21959131.
- Berman, H.M.; J. Westbrook, Z. Feng, G. Gilliland, T.N. Bhat, H. Weissig, I.N. Shindyalov, P.E. Bourne (2000). "The Protein Data Bank". Nucleic Acids Res 28 (1): 235–242. doi:10.1093/nar/28.1.235. PMC 102472. PMID 10592235.
- Ye, Y.; A. Godzik (2004). "FATCAT: a web server for flexible structure comparison and structure similarity searching". Nucleic Acids Res 32: W582–W585. doi:10.1093/nar/gkh430. PMID 15215455.
- Shindyalov, I.N.; P.E. Bourne (1998). "Protein structure alignment by incremental combinatorial extension (CE) of the optimal path". Protein Engineering 11 (9): 739–747. PMID 9796821.
- Wang, Sheng; Jianzhu Ma, Jian Peng and Jinbo Xu (March 2013). "Protein structure alignment beyond spatial proximity". Scientific Reports. doi:10.1038/srep01448. PMID 23486213.
- Porter, C.T.; G.J Bartlett, J.M. Thornton (2004). "The catalytic site atlas: a resource of catalytic sites and residues identified in enzymes using structural data". Nucleic Acids Res. 32: D129–D133. doi:10.1093/nar/gkh028.
- Eisenberg, D.; E.M. Marcotte, I. Xenarios, T.O. Yeates (2000). "Protein function in the post-genomic era". Nature 405 (6788): 823–826. doi:10.1038/35015694. PMID 10866208.
- Marcotte, E.M.; M. Pellegrini, H.L. Ng, D.W. Rice, T.O. Yeates, D. Eisenberg (1999). "Detecting protein function and protein-protein interactions from genome sequences". Science 285 (5428): 751–753. doi:10.1126/science.285.5428.751. PMID 10427000.
- Overbeek, R.; M. Fonstein, M. D'Souza, G.D. Pusch, N. Maltsev (1999). "The use of gene clusters to infer functional coupling". Proc Natl Acad Sci USA 96 (6): 2896–2891. doi:10.1073/pnas.96.6.2896. PMID 10077608.
- Lee, J.M.; E.L. Sonnhammer (2003). "Genomic gene clustering analysis of pathways in eukaryotes". Genome Research 13 (5): 875–82. doi:10.1101/gr.737703. PMID 12695325.
- Walker, M.G.; W. Volkmuth, E. Sprinzak, D. Hodgson, T. Klingler (1999). "Prediction of gene function by genome-scale expression analysis: prostate cancer associated genes". Genome Research 9 (12): 1198–1203. doi:10.1101/gr.9.12.1198. PMC 310991. PMID 10613842.
- Klomp, J. A.; K. Furge (2012). "Genome-wide matching of genes to cellular roles using guilt-by-association models derived from single sample analysis". BMC Research Notes 5 (1): 370. doi:10.1186/1756-0500-5-370. PMID 22824328.
- Pavlidis, P.; J. Gillis (2012). "Progress and challenges in the computational prediction of gene function using networks". F1000 Research 1 (14). doi:10.3410/f1000research.1-14.v1.
- Sharan, R; I. Ulitsky, R. Shamir (2007). "Network-based prediction of protein function". Mol Sys Biol 3 (88). doi:10.1038/msb4100129. PMC 1847944. PMID 17353930.
- Mostafavi, S.; D. Ray, D. Warde-Farley, C. Grouios, Q. Morris (2008). "GeneMANIA: a real-time multiple association network integration algorithm for predicting gene function". Genome Biol 9: s4. doi:10.1186/gb-2008-9-s1-s4. PMC 2447538. PMID 18613948.
- Meyers, C.L.; D. Robson, A. Wible, M.A. Hibbs, C. Chiriac, C.L. Theesfeld, K. Dolinski, O.G. Troyanskaya (2005). "Discovery of biological networks from diverse functional genomic data". Genome Biology 6 (13): R114. doi:10.1186/gb-2005-6-13-r114.
- Peña-Castillo, L.; M. Tasan, C.L. Meyers, H. Lee, T. Joshi, C. Zhang, Y. Guan, M. Leone, A. Pagnani, et al. (2008). "A critical assessment of Mus musculus gene function prediction using integrated genomic evidence". Genome Biology 9 (S1): S2. doi:10.1186/gb-2008-9-s1-s2.