LimiTT: link miRNAs to targets

Background MicroRNAs (miRNAs) impact various biological processes within animals and plants. They complementarily bind target mRNAs, effecting a post-transcriptional negative regulation on mRNA level. The investigation of miRNA target interactions (MTIs) by high throughput screenings is challenging, as frequently used in silico target prediction tools are prone to emit false positives. This issue is aggravated for niche model organisms, where validated miRNAs and MTIs both have to be transferred from well described model organisms. Even though DBs exist that contain experimentally validated MTIs, they are limited in their search options and they utilize different miRNA and target identifiers. Results The implemented pipeline LimiTT integrates four existing DBs containing experimentally validated MTIs. In contrast to other cumulative databases (DBs), LimiTT includes MTI data of 26 species. Additionally, the pipeline enables the identification and enrichment analysis of MTIs with and without species specificity based on dynamic quality criteria. Multiple tabular and graphical outputs are generated to permit the detailed assessment of results. Conclusion Our freely available web-based pipeline LimiTT (https://bioinformatics.mpi-bn.mpg.de/) is optimized to determine MTIs with and without species specification. It links miRNAs and/or putative targets with high granularity. The integrated mapping to homologous target identifiers enables the identification of MTIs not only for standard models, but for niche model organisms as well. Electronic supplementary material The online version of this article (doi:10.1186/s12859-016-1070-1) contains supplementary material, which is available to authorized users.


Pre-processing a. MTI database data
The contents of all four MTI DBs (TarBase [1], miRTarBase [2], miRecords [3] and starBase [4]) were downloaded and pre-processed to easily access and compare the MTIs. Because the miRNA information within the TarBase data solely consisted of miRBase [5] accession numbers, a method was implemented to map the accessions onto full miRNA identifiers. Therefore, a flatfile from miRBase containing the required information was downloaded and converted to a python dictionary with miRBase miRNA accessions as keys and the corresponding miRNA identifier as values to convert miRBase accession numbers at runtime. For local execution, all pre-processed database files are included within the download archive. The database pre-processing is performed regularly.
After the combination of all MTIs originating from the four MTI DBs, all species abbreviations were assigned to their full scientific names and vice versa, and the assignments were saved within a dictionary. Another dictionary was created for assigning the species to the categories Animals, Plants, Fungi, Protozoa and Viruses and vice versa. A third dictionary is required for the experimental methods: While TarBase uses consistent categories of experimental methods for its MTIs, miRTarBase and miRecords use a wide range of different terms for the methods, probably adopted from the original source. Because this data inconsistency inhibits an acceptable identification of MTIs validated by a specific experimental method, each method was assigned to one of the six categories Reporter assay (e.g. GFP and luciferase assay), Western blot, qPCR (qPCR and RT PCR), Microarray, Next generation sequencing (NGS; e.g. ChIP-Seq, CLASH and Degradome) and Other (e.g. 5'-Race, ELISA, Northern blot and Proteomics). These three dictionaries are either used for the mapping of gene symbols onto UniProtAccs, or for enabling a selection of MTIs concerning species and experimental methods.
Because each MTI DB update could result in a change of at least one of the three dictionaries, a script compares the dictionary's content with the updated datasets and initiates assignment of data not existent within the dictionaries.

b. UniProtKB data
All available data from UniProtKB [6] was downloaded in text format and pre-processed to retrieve a trimmed list of reviewed (Swiss-Prot) and unreviewed (TrEMBL) UniProtKB entries, including information about UniProt accessions, species, gene/protein names and synonyms, cross-references to the database RefSeq [7], UniGene [8], Ensembl [9], GeneID [10] and KEGG [11]. Subsequently a database-like structure with three dictionaries was created, containing the trimmed information limited to UniProtKB entries that could be associated with symbols from the MTI DBs and that possess a UniProtAcc. The first dictionary contains unique identifier numbers linked to each of the 2,170,850 trimmed entries. In the second dictionary all UniProtAccs are listed with their corresponding entry number. The third dictionary is filled with all 21,433 gene names, synonyms and cross-reference symbols, linked to a list of the corresponding entry numbers. This procedure enables easy and fast access to information about each UniProtAcc and each stored symbol during the processing of the pipeline.

Input
All input files (miRNA file, annotated transcriptome/proteome, gene/protein expression file) can either be uploaded using our Upload Tool on the website (Helpful Tools / Get Data / Upload Files) or using an account on our FTP server. The latter is only possible after user registration, which automatically creates an account with the same username (=email) and password on the FTP server (ftp://bioinformatics.mpi-bn.mpg.de/). After uploading to the FTP the files have to be uploaded to the server (Helpful Tools / Get Data / Upload Files). The data will be deleted from the server after two weeks. The Website allows user identification, which provides workspaces that can be reused without uploading files again after return to the webpage.

a. Annotation File
File Type: Tab delimited

Header: No
Required content: UniProt accessions per line or separated by comma.

Allowed content:
Several columns, empty content and accessions with attached information concerning for example the underlying database, delimited by a pipe ( | ) symbol (e.g. sp|Q9XS59|S6A15_BOVIN). At this, only this identifier will be saved, which occurs after the first pipe symbol (e.g. sp|Q9XS59|S6A15_BOVIN > Q9XS59). Identifiers from other databases are ignored.
File Examples "Required": The user has to define the column which includes the UniProt accessions, and it is optional to choose one column with additional information (e.g. column 1 for the transcript identifiers) and a description of this information (e.g. "Transcript") which will be included in the output of LimiTT.

File
The uploaded annotation file -Column of UniProt Accessions Number of the column which contains UniProt accessions. 3

Column of additional information
Number of the column with additional information to save for the corresponding UniProt accession. 1

Description of additional information
A keyword describing the additional information. Transcript

b. miRNA File
File Type: Tab delimited

Allowed content:
Several columns, and shortened miRNA identifiers. Shortened miRNA identifiers have to consist at least of the prefix miR, lin or let, the identification number and, if existent, the lettered suffix showing sequence similarity (e.g miR-17a).

File Example "Required":
miR-93b miR-36f miR-29d miR-29c File Example "Allowed": The example is a part of an original output of the MIRPIPE pipeline [12], which the parameters are adjusted to. MiRNA identifiers have to be listed in the first column. Additionally it is possible to choose one column with additional information (e.g. column 5 for the miRNA sequences) and a description of this information (e.g. "miRNA sequence") which will be included in the output of LimiTT. Parameters:

File
The uploaded miRNA file -

Column of additional information
Number of the column with additional information to save for the corresponding miRNA. 5 Description of additional information A keyword describing the additional information. miRNA sequence c. Ranking File File Type: Tab delimited

Required content:
UniProt accessions in column one, corresponding ranking value in column 2.

Allowed content:
The content must not be sorted by the ranking values.

Parameter Selection a. miRNAs
For each miRNA listed in the optional miRNA file, MTIs with an appropriate miRNA notation are selected. It is not just possible to choose miRNAs by their full identifiers, but also by shortened identifiers which need to consist solely of the prefix miR, lin or let, the number and, if existent, the lettered suffix showing sequence similarity. As a result, all miRNAs matching with this core identifier are clustered under this shortened name, ignoring species and hairpin arm information. Passing for example miR-123a to the pipeline, LimiTT will group all MTIs with miRNA identifiers like miR-123a-5p, miR-123a-3p and miR 123a* of all species under the passed miRNA identifier. If no lettered suffix is given (e.g. miR-123), LimiTT again just clusters miRNAs with the additional suffixes -5p, -3p or * (e.g. miR-123-5p, miR-123-3p, miR-123*).
The possibility to cluster miRNAs under their shortened identifiers is also possible if no list of miRNAs was passed to the pipeline.

b. MTI databases and occurrence
By default, LimiTT uses MTIs from all four MTI DBs TarBase, miRTarBase, miRecords and starBase.
However, it is possible to use the information just of some of the DBs and ignore others. If more than one DB was selected, the parameter "Occurrence over DBs" can be used to define the minimum

Output a. Bar Graphs
The bar graphs (Figure 3

b. MTI matrix
Within the MTI matrix file ( Table 2) all interactions between identified miRNAs and targets as UniProtAccs are marked by a binary string in the matrix, which represents the occurrence of the MTI over the chosen MTI DBs.

c. MTI information file
The MTI information file is a list of all identified target UniProtAccs combined with further information about these, collected during the process. Additional information consists of all identified interacting miRNAs, the original gene symbol within the MTI DBs, further gene synonyms, the target species, protein names, the UniProt review status, the Enzyme Commission number (EC number; [13]) and a list of Gene Ontology (GO; [14]) identifiers. If in the beginning of the process additional information from the annotation file and/or the miRNA list was specified, this information will also be part of the MTI information file.

d. MTI set overlap Heatmap
Based on the idea that each identified miRNA interacts with a set of target genes, the Heatmap ( Figure 4) depicts the ratio of overlapping UniProtAcc targets between each of these MTI sets. If the MTI set enrichment analysis was used, the Heatmap output will depict for each MTI set the ratio of overlapping target genes which are part of the leading edge sets of the corresponding MTI sets.

Figure 4: Heatmap output of LimiTT
Shown is an example of the symmetrical Heatmap output of LimiTT after the leading edge analysis. Depicted is the ratio of overlapping leading edge target genes for each identified miRNA interacting with target genes (MTI Set). The corresponding ratio is coloured based on the colour key on the right.

e. Ranked MTI sets file
After MTISEA, the ranking file contains the results of the analysis for each set of miRNA targets ( Table   3). For each MTI set, the size of the set is given, which is the number of overlapping UniProtAccs of the MTI set and the ranked list. Furthermore, the calculated ES, NES and FDR q-value are listed, together with the index of this element in the ranked file, for which the running sum statistic calculated the maximal ES. Additionally, the results of the leading edge analysis are given, which proceeds as follows: Depending on whether the ES of a MTI set is positive or negative, the set of leading edge targets either consists of the MTI set targets before or after the peak in the running sum calculation.
Based on this, three statistics are calculated. "Tags" represents the ratio of leading edge targets to all targets in the given set. "List" calculates the ratio of UniProtAccs from the ranked dataset before/after the ES of the current set, to all UniProtAccs in the submitted file. "Signal" is a combination of the two previous calculations, describing the distribution of the MTI set targets over the ranked dataset Thus, signal results in 100% or more, if all targets of the set can be found at the beginning of the ranked list.

g. MTI set gene file
The MTI set gene file output of LimiTT is more or less a written version of all enrichment plots and thus just produced, if an enrichment analyses was initiated. The file lists for each MTI set, the targets which overlap with the ranked list of UniProtAccs, the index of each of this targets within the ranked list, the running ES for this target and whether it is a member of the leading edge set or not ( Table 4).