Plant Co-expression Annotation Resource: a web server for identifying targets for genetically modified crop breeding pipelines

The development of genetically modified crops (GM) includes the discovery of candidate genes through bioinformatics analysis using genomics data, gene expression, and others. Proteins of unknown function (PUFs) are interesting targets for GM crops breeding pipelines for the novelty associated with such targets and also to avoid copyright protection. One method of inferring the putative function of PUFs is by relating them to factors of interest such as abiotic stresses using orthology and co-expression networks, in a guilt-by-association manner. In this regard, we have downloaded, analyzed, and processed genomics data of 53 angiosperms, totaling 1,862,010 genes and 2,332,974 RNA. Diamond and InterproScan were used to discover 72,266 PUFs for all organisms. RNA-seq datasets related to abiotic stresses were downloaded from NCBI/GEO. The RNA-seq data was used as input to the LSTrAP software to construct co-expression networks. LSTrAP also created clusters of transcripts with correlated expression, whose members are more probably related to the molecular mechanisms associated with abiotic stresses in the plants. Orthologous groups were created (OrhtoMCL) using all 2,332,974 proteins in order to associate PUFs to abiotic stress-related clusters of co-expression and therefore infer their function in a guilt-by-association manner. A freely available web resource named “Plant Co-expression Annotation Resource” (https://www.machado.cnptia.embrapa.br/plantannot), Plantannot, was created to provide indexed queries to search for PUF putatively associated with abiotic stresses. The web interface also allows browsing, querying, and retrieving of public genomics data from 53 plants. We hope Plantannot to be useful for researchers trying to obtain novel GM crops resistant to climate change hazards.

the accuracy and speed of genome modifications in plants, such as the CRISPR/CAS system [3,4].
Regarding to climate change and environmental factors, plants are being genetically modified to become resilient to abiotic stresses, such as drought, high temperature, rising atmospheric CO2, in order to potentially overcome the yield losses due to these factors [5,6].
Consequently, over the last years, many patent applications for genetically improved crops regarding stress tolerance were filled [8]. Intellectual property rights (IPR) are vastly used by biotechnology enterprises for their GM plants to allow exclusive rights and yield better returns for the high investments in research and development [7]. To avoid selecting patented genes, it's possible to start researching genes and proteins with no function yet described.
The first phase for creating GM crops is the candidate gene discovery, which relies on bioinformatics analyses of huge volumes of genomics data available on public resources [8,9]. These proteins of unknown function (PUF) are very prevalent in eukaryotic genomes and may play a role in determining the differences between species [10] and also may be related to resistance to abiotic stresses [11].
The resistance to abiotic stresses is a complex and multigenic trait. Computational analyses related to QTL, GWAS, gene expression and regulatory networks can be employed to identify genes and molecular mechanisms that may play a role in these conditions [12][13][14], and successful results were already published [6,15,16].
It is known that differences in the gene expression patterns, allied to environmental influences, lead to differences in the morphology and phenotype of animals and plants [17]. It is also well established that organs and tissues with the same evolutionary origin have correlated gene expression patterns [18]. To perform molecular comparisons between different species, it's necessary to focus on genes with the same evolutionary origin and, therefore, with homolog functions, i.e. orthologs [19]. One approach for studying the regulatory functions of a network of genes over different species is to align the co-expression networks using ortholog genes [20].
In the present work, we present a web resource named "Plant co-expression annotation resource" (https ://www.macha do.cnpti a.embra pa.br/plant annot ) which uses plant genomics data, RNA sequencing data, orthology, and co-expression networks to enable the identification of PUFs as abiotic stress-related candidates to enter GM crop breeding pipelines.

Raw data
Genome data (sequence assembly in FASTA formatted files and annotation in GFF files) for 53 angiosperms (Table 1), including Glycine max (Gma), Zea mays (Zma), Arabidopsis thaliana (Ath), and Oryza sativa (Osa), were obtained from Phytozome v12 [21] and one from NCBI (Boea hygrometrica). The total number of genes and mRNA stored was 1,862,010 and 2,332,974, respectively, together with their translated proteins.
RNA-seq data related to abiotic stresses (heat, drought, dehydration, and osmotic stress) were downloaded from NCBI/GEO in a total of 17 different GEO Series, 53 GEO Samples and 60 SRA short read files only for Gma, Zma, Gma and Ath ( Table 2). The data was obtained by searching GEO datasets for the given organisms using the keywords "stress" and filtering the study type by "Expression profiling by high throughput sequencing". The raw reads, corresponding to the GEO Samples, were obtained from NCBI/SRA automatically using the sratoolkit v2.9.2 [22].

Analyses
The RNA-seq data was used as input to the LSTrAP v1.3 software [14] to construct coexpression networks. Only leaf tissue expression data was used to obtain the networks, to avoid adding noise to the data. LSTrAP was also used to create groups of co-expression, that are clusters of transcripts with correlated expression by using the software MCL version 14-137. In order to characterize PUFs, Diamond v0.9.24 [23] was used to align all proteins against the NCBI's nr database (downloaded in January 2018). Diamond BLAST was run with the flag-max-target-seqs 5 and the best hit was selected. InterproScan v5.26-65.0 [24] was used to annotate the proteins from the 53 genomes. All other software were run using default parameters. Homolog groups were created using OrhtoMCL v2.0.9 [25] and the 53 genome's proteins as input, with default options.

Framework interface
The Machado software [26] was used to store all data and results, and also provide a web server as an interface for fast data browsing.

Filter protocols
The Plantannot software provides several filters and a text search box that allows searching for molecules by its desired annotation features. These filters are needed to obtain PUFs and to try to relate them to abiotic stresses using RNA-seq expression data and co-expression networks. The Filters menu is separated in 8 fields, of those we are going to use only five: "Organism", "Feature type", "Orthology", "Orthologs_coexpression" and "Analyses". The "Feature Type" filter has three molecule types, from those the polypeptide box is the only that is going to be always checked and the others blank. By using the other 4 remaining filters, 6 protocols were created ( Table 3) as examples of different ways of selecting PUFs. Protocol A [27]: using a lack of both homology and protein domain signatures. Protocol B [28]: using lack of homology, presence of domain signatures-trying to select Domains of Unknown Function (DUF) from PFAM, and the text search "Unknown function". Protocol C [29]: using homology, lack of protein domain signatures, and the text search "Unknown function". Protocol D-F [30][31][32]: same protocols of A-C but using ortholog groups to find homolog proteins with co-expression data related to abiotic stress. The protocols are explained in Table 3.

Overview
An overview of the component processes of the system covering all data and analysis results used as input to the Machado framework can be found in Fig. 1a.

Homolog groups
The 2,332,974 proteins were used as input to the OrhtoMCL software to produce 164,267 clusters or groups of homolog proteins (putative orthologs). All groups comprise 1,900,313 proteins, and the mean cluster size was 11.57 protein members, ranging from 1 to 4587 members. It is worth mentioning that 8535 clusters (5.19%) were left with only 1 protein and 75% of all clusters are composed of up to 6 proteins. The ortholog groups are automatically shown in the "Results" frame of the software.

Co-expression networks
To construct co-expression networks, the 53 GEO Samples (Table 2) were filtered to get expression data only from "leaf " tissue (17,8,13, and 15 for Ath, Zma, Gma, and Osa respectively). Four co-expression networks were constructed for each of the four organisms (Ath, Zma, Gma, and Osa), using the default filters and options of LSTrAP. Groups of co-expression were created using the MCL software following the default instructions in LSTrAP. The MCL software clusters the transcripts with correlated expression. Therefore, the groups of co-expression are supposedly correlated to the molecular mechanisms regarding abiotic stress. 524 groups were obtained (169, 36, 177 and 142 for Ath, Zma, Gma and Osa respectively), with mean size of 140, 113, 282 and 225 for Ath, Zma, Gma, and Osa transcript members each, ranging from 1 to 7097 members for Ath, 1 to 4786 for Zma, 1 to 6927 for Gma and 1 to 6636 for Osa.

PUF characterization
After analyzing all 2,332,974 proteins with Diamond and InterproScan, 72,266 PUFs were characterized (Table 1-Protocol A) as sequences with no annotation using either Diamond or InterproScan. Another less sensitive way to find PUFs is to text search for "Unknown proteins" and filter for InterproScan matches (e.g.: trying to select PFAM's DUF domains) only or Diamond matches only (e.g.: trying to find proteins with uninformative function annotations), which leads to 2409 and 6569 PUFs respectively (Table 1-Protocols B and C respectively).

PUF annotation
As there is no information regarding the function of PUFs, one way to infer function is to link PUFs to other molecules by using orthology groups using a guilt-by-association algorithm (Fig. 1b). Therefore, members from a given ortholog group which already have annotation and/or have protein domains characterized, can be used as a proxy to infer function for the PUF proteins by association. There are 21,895 PUFs as members of ortholog groups which could be a source of functional information and annotation (Protocol A, plus adding the filter "Orthology: orthologs"). Furthermore, whenever a given PUF is part of an ortholog group in which some member, necessarily one of Ath, Gma, Osa, or Zma, have its mRNA composing a co-expression group, then by association, the initial PUF is supposedly also related to response to abiotic stresses in plants by inference (see Fig. 2). 1364 PUFs were related to coexpression groups using filters that were created to automate this selection (Table 3, Protocol D). This method of searching for PUFs was found to be very strict, since it only retrieves proteins that have no annotations whatsoever. However, there are many cases in which PUFs have uninformative annotations, such as: "protein with unknown function", "putative" or "hypothetical" for example. By modifying Protocol D and text searching for "Unknown function" plus filtering for InterproScan matches only or Diamond matches only, we could annotate 13 and 2,280 PUFs respectively (Table 3, Protocols E and F respectively).

Utility and discussion
Many web servers and online tools available allow navigation and comparative search of expression and co-expression data in plants. Some tools only work online and are not open source like PLAZA 3.0 [33], others are generic and seek any type of annotation such as CoNeKT [34] or use microarray data like the Genevestigator [35]. Plantannot has a very specific role of surveying proteins with unknown function possibly related to abiotic stresses in plants by comparing genomics data of a large number of organisms (53 angiosperm species). Also, the algorithm used to search for PUF annotation includes meta-analyses and data relations that involve searches for similarities of sequences, orthology, and networks of gene co-expression that are specific and unique.
To demonstrate the potential of Plantannot we devised 6 protocols for filtering sequences of interest.
From all the 6 protocols, Protocol A was the most permissive, as it seems that most of the organisms have many proteins that do not return as Diamond best hits against the "nr" database. These sequences were selected by the "no diamond matches" filter and could be retrieved (see Table 1). By modifying protocol A and inserting the textual search filter "Unknown function", led to Protocols B and C.
It is important to mention that genome projects end up having proteins of unknown function annotated in several different ways, by using terms like "hypothetical", "putative", "unknown protein", etc. Therefore, there should be specific text searches for each organism to obtain the best results for selecting PUFs. For example, we needed to adapt the filtering protocols for Boea hygrometrica, whose PUFs were best retrieved using the text search "hypothetical". Other examples can be cited, such as the text search "putative protein" used more efficiently to select PUFs from the organism Ricinus communis.
Protocol B uses InterproScan results to search for "Domains of Unknown Function", or DUFs, from PFAM, which are annotations that could result in more PUFs selected. Protocol C uses the text search to filter Diamond hits and also the original sequence annotations to filter out more PUFs.
The Protocols D-F are more complex protocols that refer to modifications of the Protocols A-C, respectively. They were created by adding filters that could retrieve PUFs that were in the same group of homologous proteins, whose mRNA participate in coexpression network clusters, related do abiotic stresses. This guilt-by-association algorithm explained in Fig. 2 led to filtering of many interesting PUFs that would not be highlighted using protocols A-C, such as those described in the study case section.
Protocol D is quite stringent and after applying it, 15 organisms out of 53 involved did not show any results. The reason for this result is that many organisms already have their proteins deposited in the "nr" database and the Diamond best hits would retrieve their own sequence leading them to be filtered out. This occurred with Boea hygrometrica but did not occur with Oropetium thomaeum, both described in our case studies above.
Many other protocols can still be created, for example, modifying Protocols D-F filtering only by groups of orthologs (filter "Orthology: orthology") and not by co-expression. This filter selected 21,895 PUFs that belonged to any group of orthologs. This simpler filter could allow one to infer possible functions to these PUFs by just relating them to the annotations found in the members of their common groups of orthologs. Similarly, after applying Protocol D for all organisms, we could manually curate the 1364 PUFs selected, supposedly related to abiotic stress. By conducting a manual search in the groups of orthologs that these PUFs belong, we were able to confirm 159 PUFs with functions possibly related to abiotic stress, found in annotations of ortholog co-members of these PUFs. This result equals 11.6% of the initial PUFs (check the Additional file 2 for a complete list of PUFs and annotations for all organisms using this methodology).

Case Study: PUF annotations of desiccation-tolerant species
We used two species known to be tolerant to desiccation as a pilot study for Plantannot as we believe there can be interesting target PUFs related to abiotic stresses to be encountered in these organisms.

Oropetium thomaeum
Recently added to the Phytozome database, Oropetium thomaeum [36] is a good candidate to discover genes related to abiotic stress. This grass is resilient to extreme and prolonged drying and must have genes involved in the molecular mechanisms related to the control of this phenotype. To find PUFs for Oropetium thomaeum one could use Protocol D as described in Table 1. By doing this one will see 10 PUFs in the "Results" page. As there is no annotation for these proteins (although there is one protein that was already annotated as "PTHR13020:SF36-EXPRESSED PROTEIN (1 of 1" that is not much informative of a function), one can survey the homologous sequences present in the orthologous groups to check for other annotations. In this regard, one can click, for example, on the first member of the "Plantannot22668" group ID, in the "Orthologous Group" column of which the PUF "Oropetium_20150105_06293A.v1.0" is a member. By doing this a new "Results" page will show all members of the "plantannot22668" group. Interestingly the majority of the members are annotated as having an "AP2 domain (PFAM-PF00847)". By investigating the function of this PFAM domain PF00847, one can discover that AP2 is a transcription factor that has a major role in hormone regulation [37] and one study shows that there is a binding factor DBF1 that binds AP2 and is related to osmotic stress tolerance and abiotic stress responses in Arabidopsis thaliana [38]. By association, it is possible to infer that the PUF "Oropetium_20150105_06293A. v1.0" have a function possibly related to "AP2", and that orthology could be useful to give novel information for the PUFs. Going further, the "Orthologs_coexpression" box checked before, filtered for orthologous groups of which at least one member participates in a co-expression group. Therefore, this adds up more evidence that the PUF "Oropetium_20150105_06293A.v1.0" is a good candidate to be related to abiotic stresses and should be further investigated. To check for the co-expression group related to this PUF, one can follow the procedure in Fig. 2 showing that one member of the ortholog group "Plantannot22668" is a protein from Ath, Osa, Zma or Gma, and whose respective mRNA participate in a co-expression group (in this case, the protein from Gma and its mRNA with the same ID: Glyma.19G163900.1.Wm82.a2.v1). This case study can be performed by checking the tutorial session in Plantannot's initial page.

Boea hygrometrica (Dorcoceras hygrometricum)
"Drying without dying" is an essential feature in the evolution of earthly plants and Boea hygrometrica is an important model of resurrection plant that survives the drying of its leaves and roots without dying [39]. By using a modified version of Protocol F from Table 3 in which we used the text search word "hypothetical", we recovered 414 PUFs. From these, we obtained possible annotations for 199 PUFs (48% of the total) by surveying the orthologous group members as described above. By manually inspecting all 193 annotations we found that 153 (36.95% of the total) had references to abiotic stresses. From these, we chose 3 interesting PUFs to describe the possible efficiency of our protocol. The first is the protein KZV45975.1, a member of the ortholog group "plantan-not11681", which had members related to "E3 ubiquitin ligase family of proteins". This family of proteins seems to enhance drought tolerance in Arabidopsis thaliana [40]. Another interesting example is the KZV43328.1 protein, a member of "plantannot19415" ortholog group, which has 5 members with the PFAM domain "PF00642-Zinc finger C-× 8-C-× 5-C-× 3-H type (and similar) (zf-CCCH)". This domain apparently plays roles in abiotic stress response in maize [41]. The final example is the KZV34923.1 protein, who is member of the "plantannot11601" ortholog group which has 17 members that have the PFAM domain "PF05349-GATA-type transcription activator, N-terminal (GATA-N) (1 of 1)". It is has been shown that GATA like transcription factors are related to abiotic stress responses in rice [42]. It is worth mentioning that some annotations found refer to abiotic stress that were not part of our RNA-seq data set experimental conditions, like resistance to Aluminum and Cadmium. This could be due to the fact that drought and desiccation tolerance involves a complex process to avoid oxidative damage [43] and we speculate if it may share molecular mechanisms with other kinds of abiotic stresses. The full Boea's PUF survey can be retrieved from the Additional file 1.

Conclusion
We believe that the Plant Co-expression Annotation Resource can be a valuable bioinformatics tool to be used for the search of proof of concept targets to enter pipelines for the creation of genetic modified crops resistant to abiotic stresses and adapted to climate change.
Additional file 1. Complete PUF annotation list for Boea hygrometrica obtained using a modified version of protocol F.

Additional file 2.
Complete PUF annotation list for all species using protocol D.