As proteins account for almost every activity of cellular life, such as signal transduction, and catalyzing biochemical reaction, elucidating the functions of each protein is crucial for decoding secrets of life [1]. Protein functions are usually classified based on Gene Ontology, which is one of the most famous biomedical ontologies [2]. This ontology already includes more than 40,000 concepts across three domains: Molecular Function Ontology (MFO), Biological Process Ontology (BPO), and Cellular Component Ontology (CCO). New biological sequences (DNA, RNA, and protein sequences) are still available at an unprecedented rate, as a result of the rapid development of sequencing technology. It thus poses a great challenge for biologists to understand the functions of these proteins. Due to the high cost of biochemical experiments, very few proteins are associated with experimental GO annotations. Although there are more than 115 million protein sequences in UniProtKB as of April 2018,1 less than 0.15% (0.15 million) of these proteins have experimental GO annotations [3], [4]. Therefore, automatic protein function prediction (AFP) becomes increasingly important in reducing this huge gap [5], [6], [7].
For facilitating a fair and extensive evaluation on different large-scale AFP methods, the Critical Assessment of protein Function Annotation algorithms (CAFA) challenge was initiated in 2010 as CAFA1 (2010–2011), and continued as CAFA2(2013–14), and CAFA3(2016–2017).2 A time-delayed evaluation is employed to assess the accuracy of protein function prediction in these challenges. Specifically, a large set of target proteins was first available to the participants for function prediction with a submission deadline (T0). After a few months with a new deadline (T1), some target proteins with accumulated experimental annotations were then used as a benchmark for performance evaluation. For a specific GO domain (MFO, BPO or CCO), the benchmark data of CAFA2 was divided further into two categories: no-knowledge proteins and limited-knowledge proteins [6]. Both of the two categories focus on those proteins having none of experimental annotations in a target domain at T0, but having at least one accumulated experimental annotations at T1. The difference between these two categories is as follows. The no-knowledge proteins do not have any experimental annotations in any of the three domains at T0, while the limited-knowledge proteins have at least one experimental annotation in another one or two domains at T0. Very few proteins (<0.15%) have experimental annotations, so AFP for no-knowledge proteins is particularly important in practice. In this work, we also address the challenging problem of AFP for no-knowledge proteins.
The available results on CAFA benchmark of no-knowledge proteins show that sequence homology based methods such as BLAST and PSI-BLAST are highly competitive, which implies the importance of sequence information [8]. However, homology-based methods cannot make reliable prediction of proteins that have no homology proteins in training data. According to global sequence identity between a target protein and its most similar protein in training data, no-knowledge proteins can further be divided into two groups: easy and difficult [6]. The easy proteins are refereed to as those who have a global sequence identity of no less than 60%, while all other proteins are called difficult. Also, mining motif, domain and functional information from sequences has been found very useful for AFP [9]. Currently, an imperative issue is how to make full use of other information sources other than the sequence for protein function prediction, especially for difficult proteins [6]. As we know, human curators usually annotate protein functions with the help of relevant citation information, assigning evidence code after reading the full text of citations [10]. As such, it will be interesting to examine whether or not the performance of AFP could be improved by utilizing these relevant citations. As the largest biomedical literature database, MEDLINE contains more than 24 million citations [11]. Unfortunately, most of these citations do not have the full texts. So we use their abstracts for AFP instead. Inspired by these facts, in this paper we propose a new method called DeepText2GO for large-scale function predictions on no-knowledge proteins. The basic idea of our DeepText2GO is to seamlessly integrate text-based classifiers into sequence-based classifiers in a probabilistic (consensus) manner. Instead of using a shallow BOW (Bag of Words) representation by traditional text-based classifiers, DeepText2GO utilizes deep semantic representation of texts by which to improve the performance of protein function prediction. Furthermore, we make full use of different kinds of available protein information such as sequence homology, families, citations, domains, and motifs. The performance of DeepText2GO has been extensively validated against a benchmark dataset on no-knowledge at a large-scale level. The dataset was derived from UniProtKB/Swiss-Prot3 database by following the procedure of CAFA. In our experiments, we first examined the performance of three text-based classifiers, with the finding that the use of deep semantic representation in AFP is able to improve the prediction accuracy significantly. Then we compared the performance of text-based methods with sequence-based methods. Our experimental results show that text-based methods were effective for BPO and CCO, while homology based methods were effective for MFO. Finally, integrating text-based with sequence-based methods by using a consensus approach, DeepText2GO outperformed both of them significantly in any of the three GO domains. As an example, DeepText2GO has achieved a F-max score of 0.442 over BPO, followed by text-based methods (0.424) and sequence-based methods (0.366).