RNA-binding proteins (RBPs) interact with their cognate RNAs to form biomolecular assemblies called as ribonucleoprotein (RNP) complexes which may be transient (such as the exon junction complex) or stable (such as the ribosome). The biological functions of proteins can be better understood by grouping them into domain families based on the analysis of their structural features [1, 2]. The realisation of connections to structural domains of known function can help to predict the mechanism(s) of RNA binding in RBPs and also the type of cognate RNA. The number of members in a structural domain family reflects the diversity and evolutionary ability of that family to adapt to biological contexts [3]. This, however, cannot be generalised since certain protein structures are more difficult to solve as compared to others.
A comprehensive analysis of RNA-protein interactions at the atomic and residue levels was performed by Jones and coworkers in 2001, with a dataset of 32 RNA-protein complexes (solved by either X-ray crystallography or Nuclear Magnetic Resonance (NMR) spectroscopy) that were available in the Nucleic Acid Database (NDB) [4] in December 1999. This led to a classification of RBPs into 14 structural families [5]. In 2004, Han and coworkers had trained a Support Vector Machine (SVM) system to recognise RBPs directly from their primary sequence on the basis of knowledge of known RBPs and non-RBPs [6].
The BindN web tool, introduced in 2006, employed SVM models to predict potential DNA-binding and RNA-binding residues from amino acid sequence [7]. In 2008, Shazman and coworkers classified RBPs on the basis of their three-dimensional structures by using a SVM approach [8]. Their dataset comprised of 76 RNA-protein complexes (solved by either X-ray crystallography or NMR) that were then available in the PDB. The method had achieved 88 % accuracy in classifying RBPs, but could not distinguish them from DNA-binding proteins (DBPs) and was based on the characterization of the unique properties of electrostatic patches in these proteins. Shazman and coworkers had trained the multi-class SVM classifier on transfer RNA (tRNA)-, ribosomal RNA (rRNA)- and messenger RNA (mRNA)-binding proteins only.
In 2010, Kazan and coworkers introduced a motif-finding algorithm named RNAcontext, that was designed to elucidate RBP-specific sequence and structural preferences with a high accuracy [9]. Two years later, Jahandideh and coworkers used the Gene Ontology Annotated (GOA) database (available at http://www.ebi.ac.uk/GOA) and the Structural Classification of Proteins (SCOP) database [10], to design a machine learning approach for classifying structurally solved RNA-binding domains (RBDs) in different subclasses [11].
The catRAPID omics web server introduced in 2013, performed calculation of ribonucleoprotein associations like analysis of nucleic acid-binding regions in proteins and identification of RNA motifs involved in protein recognition in different model organisms [12]. It included binding residues and evolutionary information for prediction of RBPs. In 2014, Fukunaga and coworkers proposed the CapR algorithm for studying RNA-protein interactions using CLIP-seq data [13]. The authors had shown that several RBPs bind RNA based on specific structural contexts. RBPmap, the newest of the above-mentioned methods, was used for prediction and mapping of RBP-binding sites on RNA [14].
In 2011, a collection of RNA-binding sites on the basis of RBDs were made available in a database named RBPDB (RNA-binding protein database) [15]. Two of the recent repositories, RAID (RNA-associated interaction database) [16] and ViRBase (virus–host ncRNA-associated interaction database) [17], described RNA-associated (RNA-RNA/RNA-protein) interactions and virus-host ncRNA-associated interactions respectively. The NPIDB (Nucleic acid-Protein interaction database) [18] and BIPA (Biological interaction database for protein-nucleic acid) [19] are also well-known databases on the structural front. However, these repositories can offer information about those for which structural data are available.
Since an increasing number of protein structures are being solved every day, there arises a need to design an automated protocol for classifying the new structures into families that, will in turn, provide an insight into the putative functions of these newer proteins. Most of the previous studies had employed machine learning algorithms to predict or classify RBPs [6–8, 11, 20, 21]. Electrostatic properties of the solvent accessible surface were used as one of the primary features in such machine learning algorithms. This property was very different even among proteins with very similar structures and functions [22].
Here, we report a web server, RStrucFam, which to the best of our knowledge is the first of its kind that exploits structurally conserved features, derived from family members with known structures and imprinted in mathematical profiles, to predict the structure, the type of cognate RNA(s) (not only tRNA, rRNA or mRNA but also to the other kinds of RNA that are currently known) and function(s) of proteins from mere sequence information. The user input protein sequence will be searched against the Hidden Markov Models of RBP families (HMMRBP) database comprising of 437 HMMs of RBP structural families that have been generated using structure-based sequence alignments of RBPs with known structures. Proteins that fail to associate with such structure-centric families will be further queried against the 746 sequence-centric RBP family HMMs in the HMMRBP database. The search protocol has been previously employed in the lab for prediction of RBPs in humans on a genome-wide scale [23]. The users can browse through the HMMRBP database for details pertaining to each family, protein or RNA and their related information, based on keyword search or RNA motif search. RStrucFam web server is distinct from searches possible within the PDB, Structural Classification of Proteins (SCOP) [10], SCOP extended (SCOPe) [24] and the Protein Alignments organised as Structural Superfamilies 2 (PASS2) [25] resources, in being able to identify or classify RBPs even without a known structure, as well as prediction of cognate RNA(s) and function(s) of the protein from mere sequence information. RStrucFam can be accessed at http://caps.ncbs.res.in/rstrucfam/.