Methods

Volume 145, 1 August 2018, Pages 82-90
Methods

DeepText2GO: Improving large-scale protein function prediction with deep semantic text representation

https://doi.org/10.1016/j.ymeth.2018.05.026Get rights and content

Highlights

  • Automatic protein function prediction (AFP) becomes increasingly important in reducing the huge gap of explosive protein sequences and very few experimental annotations.
  • The available results on CAFA benchmark of no-knowledge proteins show that sequence homology based methods such as BLAST and PSI-BLAST are highly competitive.
  • Currently, an imperative issue is how to make full use of other information sources other than the sequence for protein function prediction.
  • The basic idea of our DeepText2GO is to integrate text-based classifiers into sequence-based classifiers in a probabilistic (consensus) manner.
  • Instead of using a shallow BOW (Bag of Words) representation by traditional text-based classifiers, DeepText2GO utilizes deep semantic representation of texts by which to improve the performance of protein function prediction.
  • Furthermore, we make full use of different kinds of available protein information such as sequence homology, families, domains, motifs and citations.
  • The performance of DeepText2GO has been extensively validated against a benchmark dataset on no-knowledge at a large-scale level.
  • By integrating the text-based method with the sequence-based method through a consensus approach, DeepText2GO outperformed both of them significantly in all of three GO domains against the benchmark dataset.

Abstract

As of April 2018, UniProtKB has collected more than 115 million protein sequences. Less than 0.15% of these proteins, however, have been associated with experimental GO annotations. As such, the use of automatic protein function prediction (AFP) to reduce this huge gap becomes increasingly important. The previous studies conclude that sequence homology based methods are highly effective in AFP. In addition, mining motif, domain, and functional information from protein sequences has been found very helpful for AFP. Other than sequences, alternative information sources such as text, however, may be useful for AFP as well. Instead of using BOW (bag of words) representation in traditional text-based AFP, we propose a new method called DeepText2GO that relies on deep semantic text representation, together with different kinds of available protein information such as sequence homology, families, domains, and motifs, to improve large-scale AFP. Furthermore, DeepText2GO integrates text-based methods with sequence-based ones by means of a consensus approach. Extensive experiments on the benchmark dataset extracted from UniProt/SwissProt have demonstrated that DeepText2GO significantly outperformed both text-based and sequence-based methods, validating its superiority.

Introduction

As proteins account for almost every activity of cellular life, such as signal transduction, and catalyzing biochemical reaction, elucidating the functions of each protein is crucial for decoding secrets of life [1]. Protein functions are usually classified based on Gene Ontology, which is one of the most famous biomedical ontologies [2]. This ontology already includes more than 40,000 concepts across three domains: Molecular Function Ontology (MFO), Biological Process Ontology (BPO), and Cellular Component Ontology (CCO). New biological sequences (DNA, RNA, and protein sequences) are still available at an unprecedented rate, as a result of the rapid development of sequencing technology. It thus poses a great challenge for biologists to understand the functions of these proteins. Due to the high cost of biochemical experiments, very few proteins are associated with experimental GO annotations. Although there are more than 115 million protein sequences in UniProtKB as of April 2018,1 less than 0.15% (0.15 million) of these proteins have experimental GO annotations [3], [4]. Therefore, automatic protein function prediction (AFP) becomes increasingly important in reducing this huge gap [5], [6], [7].
For facilitating a fair and extensive evaluation on different large-scale AFP methods, the Critical Assessment of protein Function Annotation algorithms (CAFA) challenge was initiated in 2010 as CAFA1 (2010–2011), and continued as CAFA2(2013–14), and CAFA3(2016–2017).2 A time-delayed evaluation is employed to assess the accuracy of protein function prediction in these challenges. Specifically, a large set of target proteins was first available to the participants for function prediction with a submission deadline (T0). After a few months with a new deadline (T1), some target proteins with accumulated experimental annotations were then used as a benchmark for performance evaluation. For a specific GO domain (MFO, BPO or CCO), the benchmark data of CAFA2 was divided further into two categories: no-knowledge proteins and limited-knowledge proteins [6]. Both of the two categories focus on those proteins having none of experimental annotations in a target domain at T0, but having at least one accumulated experimental annotations at T1. The difference between these two categories is as follows. The no-knowledge proteins do not have any experimental annotations in any of the three domains at T0, while the limited-knowledge proteins have at least one experimental annotation in another one or two domains at T0. Very few proteins (<0.15%) have experimental annotations, so AFP for no-knowledge proteins is particularly important in practice. In this work, we also address the challenging problem of AFP for no-knowledge proteins.
The available results on CAFA benchmark of no-knowledge proteins show that sequence homology based methods such as BLAST and PSI-BLAST are highly competitive, which implies the importance of sequence information [8]. However, homology-based methods cannot make reliable prediction of proteins that have no homology proteins in training data. According to global sequence identity between a target protein and its most similar protein in training data, no-knowledge proteins can further be divided into two groups: easy and difficult [6]. The easy proteins are refereed to as those who have a global sequence identity of no less than 60%, while all other proteins are called difficult. Also, mining motif, domain and functional information from sequences has been found very useful for AFP [9]. Currently, an imperative issue is how to make full use of other information sources other than the sequence for protein function prediction, especially for difficult proteins [6]. As we know, human curators usually annotate protein functions with the help of relevant citation information, assigning evidence code after reading the full text of citations [10]. As such, it will be interesting to examine whether or not the performance of AFP could be improved by utilizing these relevant citations. As the largest biomedical literature database, MEDLINE contains more than 24 million citations [11]. Unfortunately, most of these citations do not have the full texts. So we use their abstracts for AFP instead. Inspired by these facts, in this paper we propose a new method called DeepText2GO for large-scale function predictions on no-knowledge proteins. The basic idea of our DeepText2GO is to seamlessly integrate text-based classifiers into sequence-based classifiers in a probabilistic (consensus) manner. Instead of using a shallow BOW (Bag of Words) representation by traditional text-based classifiers, DeepText2GO utilizes deep semantic representation of texts by which to improve the performance of protein function prediction. Furthermore, we make full use of different kinds of available protein information such as sequence homology, families, citations, domains, and motifs. The performance of DeepText2GO has been extensively validated against a benchmark dataset on no-knowledge at a large-scale level. The dataset was derived from UniProtKB/Swiss-Prot3 database by following the procedure of CAFA. In our experiments, we first examined the performance of three text-based classifiers, with the finding that the use of deep semantic representation in AFP is able to improve the prediction accuracy significantly. Then we compared the performance of text-based methods with sequence-based methods. Our experimental results show that text-based methods were effective for BPO and CCO, while homology based methods were effective for MFO. Finally, integrating text-based with sequence-based methods by using a consensus approach, DeepText2GO outperformed both of them significantly in any of the three GO domains. As an example, DeepText2GO has achieved a F-max score of 0.442 over BPO, followed by text-based methods (0.424) and sequence-based methods (0.366).

Access through your organization

Check access to the full text by signing in through your organization.

Access through your organization

Section snippets

Text-based methods for AFP

Many studies on AFP have been done by using text information [12], [13], [14], [15], [16], [17], [18]. These studies can be divided into two categories: information extraction and text categorization [16], [17]. Information extraction approaches extract and identify terms and phrases characterizing the functions of target proteins by relying on natural language processing (NLP) techniques and machine learning models. Text categorization approaches treat AFP as a classification problem, where

Overview

Fig. 1 illustrates the main workflow of DeepText2GO. Given a target protein, we first retrieve the relevant MEDLINE citations and its sequence. With different text representations, text-based classifiers using logistic regression (LR) return a score for a GO term. On the other hand, the sequence homology based method (BLAST-KNN) and sequence functional sites (such as domains and motifs) based method (LRInterPro) also assign a respective score for the same term. By integrating both text-based

GO annotations and GO ontology

We downloaded GO annotations from UniProt/Swiss-prot4 in January (T0) and October 2016 (T1). All annotations before Jan 2016 were used as our training data, and all no-knowledge proteins in October 2016 as testing. By following the CAFA settings, we only kept experimental annotations as our training and test data with the following evidence codes: ‘EXP’, ‘IDA’, ‘IPI’, ‘IMP’, ‘IGI’, ‘IEP’, ‘TAS’, or ‘IC’. We used the same 23 target species as CAFA3 for test data.

Discussion and conclusion

Utilization of text information for protein function prediction and characterization has widely been discussed [12], [13], [14], [15], [16], [17], [18]. However, previous studies focus mainly on a small number of functional terms and proteins. This means that existing methods may not work well for the large number of GO terms and proteins in practice. In this study, we have carried out extensive experiments on a large-scale protein function prediction by using text based features. Using about

Acknowledgment

This work was supported in part by the National Natural Science Foundation of China (No. 61572139).

References (31)

  • T. Hamp

    Homology-based inference sets the bar high for protein function prediction

    BMC Bioinf.

    (2013)
  • R. You et al.

    GOLabeler: improving sequence-based large-scale protein function prediction by learning to rank

    Bioinformatics

    (2018)
  • R.P. Huntley

    Understanding how and why the gene ontology and its annotations evolve: the go within uniprot

    GigaScience

    (2014)
  • NCBI Resource Coordinators

    Database resources of the national center for biotechnology information

    Nucl. Acids Res.

    (2017)
  • A. Pérez

    Gene annotation from scientific literature using mappings between keyword systems

    Bioinformatics

    (2004)
  • Cited by (72)

    • A roadmap for metagenomic enzyme discovery

      2021, Natural Product Reports
    • Machine learning techniques for protein function prediction

      2020, Proteins Structure Function and Bioinformatics
    View all citing articles on Scopus
    View full text