DeepText2GO: Improving large-scale protein function prediction with deep semantic text representation

doi:10.1016/j.ymeth.2018.05.026

Methods

Volume 145, 1 August 2018, Pages 82-90

https://doi.org/10.1016/j.ymeth.2018.05.026 Get rights and content

Highlights

•
Automatic protein function prediction (AFP) becomes increasingly important in reducing the huge gap of explosive protein sequences and very few experimental annotations.
•
The available results on CAFA benchmark of no-knowledge proteins show that sequence homology based methods such as BLAST and PSI-BLAST are highly competitive.
•
Currently, an imperative issue is how to make full use of other information sources other than the sequence for protein function prediction.
•
The basic idea of our DeepText2GO is to integrate text-based classifiers into sequence-based classifiers in a probabilistic (consensus) manner.
•
Instead of using a shallow BOW (Bag of Words) representation by traditional text-based classifiers, DeepText2GO utilizes deep semantic representation of texts by which to improve the performance of protein function prediction.
•
Furthermore, we make full use of different kinds of available protein information such as sequence homology, families, domains, motifs and citations.
•
The performance of DeepText2GO has been extensively validated against a benchmark dataset on no-knowledge at a large-scale level.
•
By integrating the text-based method with the sequence-based method through a consensus approach, DeepText2GO outperformed both of them significantly in all of three GO domains against the benchmark dataset.

Abstract

As of April 2018, UniProtKB has collected more than 115 million protein sequences. Less than 0.15% of these proteins, however, have been associated with experimental GO annotations. As such, the use of automatic protein function prediction (AFP) to reduce this huge gap becomes increasingly important. The previous studies conclude that sequence homology based methods are highly effective in AFP. In addition, mining motif, domain, and functional information from protein sequences has been found very helpful for AFP. Other than sequences, alternative information sources such as text, however, may be useful for AFP as well. Instead of using BOW (bag of words) representation in traditional text-based AFP, we propose a new method called DeepText2GO that relies on deep semantic text representation, together with different kinds of available protein information such as sequence homology, families, domains, and motifs, to improve large-scale AFP. Furthermore, DeepText2GO integrates text-based methods with sequence-based ones by means of a consensus approach. Extensive experiments on the benchmark dataset extracted from UniProt/SwissProt have demonstrated that DeepText2GO significantly outperformed both text-based and sequence-based methods, validating its superiority.

Introduction

As proteins account for almost every activity of cellular life, such as signal transduction, and catalyzing biochemical reaction, elucidating the functions of each protein is crucial for decoding secrets of life [1]. Protein functions are usually classified based on Gene Ontology, which is one of the most famous biomedical ontologies [2]. This ontology already includes more than 40,000 concepts across three domains: Molecular Function Ontology (MFO), Biological Process Ontology (BPO), and Cellular Component Ontology (CCO). New biological sequences (DNA, RNA, and protein sequences) are still available at an unprecedented rate, as a result of the rapid development of sequencing technology. It thus poses a great challenge for biologists to understand the functions of these proteins. Due to the high cost of biochemical experiments, very few proteins are associated with experimental GO annotations. Although there are more than 115 million protein sequences in UniProtKB as of April 2018,¹ less than 0.15% (0.15 million) of these proteins have experimental GO annotations [3], [4]. Therefore, automatic protein function prediction (AFP) becomes increasingly important in reducing this huge gap [5], [6], [7].

For facilitating a fair and extensive evaluation on different large-scale AFP methods, the Critical Assessment of protein Function Annotation algorithms (CAFA) challenge was initiated in 2010 as CAFA1 (2010–2011), and continued as CAFA2(2013–14), and CAFA3(2016–2017).² A time-delayed evaluation is employed to assess the accuracy of protein function prediction in these challenges. Specifically, a large set of target proteins was first available to the participants for function prediction with a submission deadline (T0). After a few months with a new deadline (T1), some target proteins with accumulated experimental annotations were then used as a benchmark for performance evaluation. For a specific GO domain (MFO, BPO or CCO), the benchmark data of CAFA2 was divided further into two categories: no-knowledge proteins and limited-knowledge proteins [6]. Both of the two categories focus on those proteins having none of experimental annotations in a target domain at T0, but having at least one accumulated experimental annotations at T1. The difference between these two categories is as follows. The no-knowledge proteins do not have any experimental annotations in any of the three domains at T0, while the limited-knowledge proteins have at least one experimental annotation in another one or two domains at T0. Very few proteins (<0.15%) have experimental annotations, so AFP for no-knowledge proteins is particularly important in practice. In this work, we also address the challenging problem of AFP for no-knowledge proteins.

The available results on CAFA benchmark of no-knowledge proteins show that sequence homology based methods such as BLAST and PSI-BLAST are highly competitive, which implies the importance of sequence information [8]. However, homology-based methods cannot make reliable prediction of proteins that have no homology proteins in training data. According to global sequence identity between a target protein and its most similar protein in training data, no-knowledge proteins can further be divided into two groups: easy and difficult [6]. The easy proteins are refereed to as those who have a global sequence identity of no less than 60%, while all other proteins are called difficult. Also, mining motif, domain and functional information from sequences has been found very useful for AFP [9]. Currently, an imperative issue is how to make full use of other information sources other than the sequence for protein function prediction, especially for difficult proteins [6]. As we know, human curators usually annotate protein functions with the help of relevant citation information, assigning evidence code after reading the full text of citations [10]. As such, it will be interesting to examine whether or not the performance of AFP could be improved by utilizing these relevant citations. As the largest biomedical literature database, MEDLINE contains more than 24 million citations [11]. Unfortunately, most of these citations do not have the full texts. So we use their abstracts for AFP instead. Inspired by these facts, in this paper we propose a new method called DeepText2GO for large-scale function predictions on no-knowledge proteins. The basic idea of our DeepText2GO is to seamlessly integrate text-based classifiers into sequence-based classifiers in a probabilistic (consensus) manner. Instead of using a shallow BOW (Bag of Words) representation by traditional text-based classifiers, DeepText2GO utilizes deep semantic representation of texts by which to improve the performance of protein function prediction. Furthermore, we make full use of different kinds of available protein information such as sequence homology, families, citations, domains, and motifs. The performance of DeepText2GO has been extensively validated against a benchmark dataset on no-knowledge at a large-scale level. The dataset was derived from UniProtKB/Swiss-Prot³ database by following the procedure of CAFA. In our experiments, we first examined the performance of three text-based classifiers, with the finding that the use of deep semantic representation in AFP is able to improve the prediction accuracy significantly. Then we compared the performance of text-based methods with sequence-based methods. Our experimental results show that text-based methods were effective for BPO and CCO, while homology based methods were effective for MFO. Finally, integrating text-based with sequence-based methods by using a consensus approach, DeepText2GO outperformed both of them significantly in any of the three GO domains. As an example, DeepText2GO has achieved a F-max score of 0.442 over BPO, followed by text-based methods (0.424) and sequence-based methods (0.366).

Access through your organization

Check access to the full text by signing in through your organization.

Access through your organization

Section snippets

Text-based methods for AFP

Many studies on AFP have been done by using text information [12], [13], [14], [15], [16], [17], [18]. These studies can be divided into two categories: information extraction and text categorization [16], [17]. Information extraction approaches extract and identify terms and phrases characterizing the functions of target proteins by relying on natural language processing (NLP) techniques and machine learning models. Text categorization approaches treat AFP as a classification problem, where

Overview

Fig. 1 illustrates the main workflow of DeepText2GO. Given a target protein, we first retrieve the relevant MEDLINE citations and its sequence. With different text representations, text-based classifiers using logistic regression (LR) return a score for a GO term. On the other hand, the sequence homology based method (BLAST-KNN) and sequence functional sites (such as domains and motifs) based method

({LR}_{InterPro})

also assign a respective score for the same term. By integrating both text-based

GO annotations and GO ontology

We downloaded GO annotations from UniProt/Swiss-prot⁴ in January (T0) and October 2016 (T1). All annotations before Jan 2016 were used as our training data, and all no-knowledge proteins in October 2016 as testing. By following the CAFA settings, we only kept experimental annotations as our training and test data with the following evidence codes: ‘EXP’, ‘IDA’, ‘IPI’, ‘IMP’, ‘IGI’, ‘IEP’, ‘TAS’, or ‘IC’. We used the same 23 target species as CAFA3 for test data.

Discussion and conclusion

Utilization of text information for protein function prediction and characterization has widely been discussed [12], [13], [14], [15], [16], [17], [18]. However, previous studies focus mainly on a small number of functional terms and proteins. This means that existing methods may not work well for the large number of GO terms and proteins in practice. In this study, we have carried out extensive experiments on a large-scale protein function prediction by using text based features. Using about

Acknowledgment

This work was supported in part by the National Natural Science Foundation of China (No. 61572139).

References (31)

Y. Jiang
An expanded evaluation of protein function prediction methods shows an improvement in accuracy
Genome Biol.
(2016)
H. Shatkay
Text as data: using text-based features for proteins representation and for computational prediction of their characteristics
Methods
(2015)
I. Khan
The PFP and ESG protein function prediction methods in 2014: effect of database updates and ensemble approaches
GigaScience
(2015)
R. Cao et al.
Integrated protein function prediction by mining function associations, sequences, and protein-protein and gene-gene interaction networks
Methods
(2016)
R.F. Weaver
Molecular Biology (WCB Cell & Molecular Biology
(2011)
M. Ashburner
Gene ontology: tool for the unification of biology
Nat. Genet.
(2000)
The UniProt Consortium
Uniprot: a hub for protein information
Nucl. Acids Res.
(2015)
R. Huntley
The GOA database: gene ontology annotation updates for 2015
Nucl. Acids Res.
(2015)
P. Radivojac
A large-scale evaluation of computational protein function prediction
Nat. Methods
(2013)
A. Shehu et al.
A survey of computational methods for protein function prediction

T. Hamp

Homology-based inference sets the bar high for protein function prediction

BMC Bioinf.

(2013)

R. You et al.

GOLabeler: improving sequence-based large-scale protein function prediction by learning to rank

Bioinformatics

(2018)

R.P. Huntley

Understanding how and why the gene ontology and its annotations evolve: the go within uniprot

GigaScience

(2014)

NCBI Resource Coordinators

Database resources of the national center for biotechnology information

Nucl. Acids Res.

(2017)

A. Pérez

Gene annotation from scientific literature using mappings between keyword systems

Bioinformatics

(2004)

Cited by (72)

A roadmap for metagenomic enzyme discovery
2021, Natural Product Reports
Covering: up to 2021
Metagenomics has yielded massive amounts of sequencing data offering a glimpse into the biosynthetic potential of the uncultivated microbial majority. While genome-resolved information about microbial communities from nearly every environment on earth is now available, the ability to accurately predict biocatalytic functions directly from sequencing data remains challenging. Compared to primary metabolic pathways, enzymes involved in secondary metabolism often catalyze specialized reactions with diverse substrates, making these pathways rich resources for the discovery of new enzymology. To date, functional insights gained from studies on environmental DNA (eDNA) have largely relied on PCR- or activity-based screening of eDNA fragments cloned in fosmid or cosmid libraries. As an alternative, shotgun metagenomics holds underexplored potential for the discovery of new enzymes directly from eDNA by avoiding common biases introduced through PCR- or activity-guided functional metagenomics workflows. However, inferring new enzyme functions directly from eDNA is similar to searching for a ‘needle in a haystack’ without direct links between genotype and phenotype. The goal of this review is to provide a roadmap to navigate shotgun metagenomic sequencing data and identify new candidate biosynthetic enzymes. We cover both computational and experimental strategies to mine metagenomes and explore protein sequence space with a spotlight on natural product biosynthesis. Specifically, we compare in silico methods for enzyme discovery including phylogenetics, sequence similarity networks, genomic context, 3D structure-based approaches, and machine learning techniques. We also discuss various experimental strategies to test computational predictions including heterologous expression and screening. Finally, we provide an outlook for future directions in the field with an emphasis on meta-omics, single-cell genomics, cell-free expression systems, and sequence-independent methods.
TALE: Transformer-based protein function Annotation with joint sequence-Label Embedding
2021, Bioinformatics
Machine learning: A powerful tool for gene function prediction in plants
2020, Applications in Plant Sciences
UDSMProt: Universal deep sequence models for protein classification
2020, Bioinformatics
Machine learning techniques for protein function prediction
2020, Proteins Structure Function and Bioinformatics
DeepGOPlus: Improved protein function prediction from sequence
2020, Bioinformatics

View all citing articles on Scopus

View full text

Text-based methods for AFP

Overview

GO annotations and GO ontology

Genome Biol.

Methods

GigaScience

Methods

Nat. Genet.

Nucl. Acids Res.

Nucl. Acids Res.

Nat. Methods

BMC Bioinf.

Bioinformatics

GigaScience

Nucl. Acids Res.

Bioinformatics