Description of Ebbie
Ebbie's dynamic web pages are platform and browser independent (browsers tested: Mozilla Firefox 1.0.4 on Linux and Windows XP operating system, Safari 2.0 for Macintosh, MS Internet Explorer 6.0 on Windows XP). A tutorial for Ebbie is also available in the online supplements [see Additional file 2]. The front page requests the user to select a particular database before entering the program. Once selected, Ebbie subsequently works with this database unless the database selection is changed on the front page. As seen in Figure 3, Ebbie's main page has four basic functions. First, it retrieves the current 5'- and 3'-cloning primers (5-CP, 3-CP) from the selected MySQL database and displays their sense and antisense sequence. The user can edit these sequences as desired. All sequences entered into these text fields are capitalized. Second, the user can browse the local computer's file manager to select and analyze a DNA sequencing text file. Ebbie maintains a log of all processed files that can be reviewed by clicking on the Display logbook function. Through Ebbie's main page, three online database management tools are available: Annotation Change, View All entries of database and Lost & Found.
The uploaded file name serves as the primary id for the MySQL database entry. If no file is selected or if the id/filename already exists in the database, an error message is displayed and the process aborted. If a file is valid (i.e. it is new and unique), the DNA sequence data is converted into a string, capitalizing the characters A, C, G and T. All other characters remain unchanged. Perl's index function is used to confirm that at least one 5-CP and 3-CP pair exists, if this condition is not met or if perl's index function identifies an uneven number of 5-CP and 3-CP pairs, then an appropriate error message is generated in the logbook. The algorithm starts at the 5' end of the DNA sequence and finds the first occurrence of a 5-CP (or antisense 3-CP). Moving in the 3' direction, the next 3-CP (or antisense 5-CP) is located. An insert is deposited into the MySQL database, if a sequence of length > 0 is found between the two primer pairs. If no insert is found in the DNA sequencing text-file, a message is displayed and recorded in the logbook.
Database selection
On the front page of Ebbie, the user can choose between different databases. These database names correspond to the names used to setup the MySQL database on a given implementation of Ebbie. These databases can be customized by editing the file/ebbie/lib/mysql.lib#sub:mysqldb and Ebbie's front page (index.html). Once a database is selected from the front page of Ebbie, the user will work with the chosen database until another database is selected by returning to the front page. BlastN flat files are kept for each database in order to allow continually updated BlastN comparison with the growing MySQL database.
The 'Database Management Tool: Annotation Change' allows the user to change only two fields of each insert: 'annotation' and 'group', all other fields (no, id, sequence, length, orientation and sample source) cannot be edited in order to preserve the integrity of the database. This restriction was deliberately chosen to maximize the integrity of primary data.
Analysis of inserts
Once an insert is found, Ebbie:
-
Deposits the id and sequence insert into the dynamic BlastN database,
-
Deposits the insert into the MySQL database, in the correct sense specified by the orientation of 5-CP and 3-CP,
-
Determines the inserts length,
-
Determines its id based on the file name, and
-
Determines its sample source, which is inferred from the first character of the file name.
The last function relies on grep to determine the initial character and then assigns the sample source by referencing an external text file. This sample source assignment can easily be manipulated by editing the external text file (/ebbie/mod/source.nt). Currently, file names starting with 1, 2, ... 9 have an automatic sample source assigned; other file names will result in 'unknown' sample source.
Following the automated sequence deposition, the sequence insert is subject to BlastN searches against locally installed databases. In our case, the BlastN searches included the BlastN database from the Arabidopsis smRNA-cloning project[25], the genomes of Y-Satellite plus its helper virus Cucumber Mosaic Virus (NCBI accession numbers for viral genomes: NC_002034.1, NC_002035.1, NC_001440.1, D10038.1) and a complete BlastN database of all previously found inserts. The latter dynamic BlastN database extends each time a new insert is found, allowing for rapid identification of new groups. Each BlastN analysis is scanned by grep, probing for 'No hits found.' in which case it will only print one line onto the screen, indicating an unsuccessfully searched database. Otherwise, the complete BlastN analysis is displayed on the web page to facilitate user-mediated annotation. Figure 4 shows the analysis of a clone, in this case finding a complete match against three previous entries.
The user can now fill out three additional annotation fields:
-
(1)
Group pull-down menu: The group pull-down menu offers standard RNA types found previously during data entry and analysis. A new group can be added through the accompanying text field if a group is identified by local BlastN analysis. Once submitted, this new group description is simultaneously added to the smRNA annotation in the MySQL database, the BlastN dynamic database and the group pull-down menu. The latter menu is sorted alphabetically and is made available for subsequent group annotations. This form of annotation proved quite powerful in the analysis of our set of smRNAs.
-
(2)
Annotation field: a text field allowing for user generated comments based on the automatic BlastN searches or external BlastN searches (our BlastN searches were limited by the amount of RAM available on the server).
-
(3)
Orientation pull-down menu: allows the selection of three categories: N/A, sense and antisense to classify the BlastN search results. This is important when working with smRNAs as smRNAs are known to be produced by RNA dependent RNA polymerases that synthesize the reverse complement of their original genomic sequence[26].
Once annotated, the insert's MySQL entry is updated by pressing the submit button. Consecutively, Ebbie's deposit algorithm appends the id, group annotation (if applicable) and insert sequence in FASTA format into a BlastN flat file. The flat file is subsequently formatted for subsequent BlastN analysis. The newly created web page displays the MySQL entry (id, sequence, length, group and annotation) and allows the user to return to Ebbie's main page.
An example: rRNA group 01
During our smRNA cloning project of virally infected tobacco plants[6], Ebbie identified 33 groups among 700 smRNA sequences. (We classified a group as two sequences overlapped by 12 or more consecutive base pairs. This empirical overlap proved to be stringent in retrospect; a 16 base pair non-gapped overlap would have resulted in 32 groups. A percentage overlap was not chosen, as a BlastN search might not align the whole query sequence to any given subject, thus misleading the user about the percentage identity.) The first group Ebbie identified in infected/non-infected tobacco plants was a 24 nt long smRNA resulting from the end of the small ribosomal RNA. This accumulation is an intriguing fact and does not seem random, considering that 18S rRNA is approximately 1,800 nts in length. Currently, this group is under further investigation.
Multiple inserts
If the number of inserts in the sequencing file exceeds one, all inserts are automatically entered into the MySQL database in the correct 5'-3' orientation, together with their length and sample source. The id for each insert is specified uniquely by appending to the end of the filename a unique insert number. The user is notified of the number of primer pairs found and the number of inserts deposited into the MySQL database. To analyze individual sequences, a pull-down menu is created that displays all inserts found in the current sequencing file. Following the selection of an insert, the user can analyze each one individually (as described in the previous section above). As long as unannotated inserts are in the database, the user can select from the pull-down menu the inserts that remain to be annotated.
Logbook function
The logbook function is reached from Ebbie's main page. Each time a file is uploaded and analyzed by Ebbie, the system time is recorded, together with the filename. Once the file is analyzed, a comment is recorded depending on the outcome of the analysis: 'Sorry, there was no insert found', 'Single insert found.', 'There were x primer pairs and y inserts deposited into z' (where x is the number of primer pairs found, y the number of inserts deposited and z the database used) and 'Number of 5'- and 3'-cloning primers not even!'. The last comment is displayed in red, as this file may need manual intervention to rescue its content before subjecting it again to the insert excision algorithm.
Review database
All entries in the selected database can be reviewed and ordered by id, length, group and number fields using the 'View All' button. For each database, Ebbie will remember the last selection of this pull-down menu. This feature is useful while generating a database and allows a quick survey of the database during data entry.
Lost & found
The Lost & Found function allows the user to use one or more wildcard characters for querying the database. '_' is used for single character and '%' for multiple character wildcard. From a pull-down menu the user selects a category, e.g. id, and in the adjacent text field the query is entered, e.g. '3%'. In this example, all entries with the starting character of '3' would be displayed.
For more complex queries, a second pull-down menu is available, which includes AND/OR BOOLEAN operators. For example, all smRNAs belonging to the class of "Y-Sat RNA" AND length of "21" nucleotides can be selected.
Change annotation
To update the annotation of individual inserts, a change annotation script was implemented. The script searches for either the id or number of the insert. The id is useful once a new group is identified in, for example, a BlastN search result. The number is convenient after reviewing the database. Once a number or id has been submitted, the record of the id is retrieved from the database (no, id, sequence, length, sample source, group and annotation). The user can then choose a standard group description from the group pull-down menu or add a new group. The 'Annotation field' will display the current annotation in the text field, allowing the user to add supplementary information to it. Once adjustments are made, the new annotation can be submitted and the corresponding fields in the MySQL database are updated. Further, if the group annotation is changed, the BlastN flat file will be edited to reflect the current group annotation. The user is unable to use a wild card character for the change annotation function.
Limitations of Ebbie
The algorithm will experience difficulties if low complexity or ambiguous repetitive 5-CP or 3-CPs primer sequences are used, which should be avoided by the correct design of primer pairs. Similar fundamental problems are encountered when cloning RNA using poly(A)-polymerase to extend the 3' end of a sequence which may already contain poly(A) residues. Additional wet lab experiments (e.g. primer extension assays) need to be conducted in order to determine the RNA's true 3' end/length. Also, no wild card characters are allowed when identifying 5-CP and 3-CP primers within the DNA sequence file, to ensure the quality of the DNA read. Imperfect primers can be identified by a subsequent manual examination of sequence files that are flagged as having uneven or no primer pairs in the logbook.
Comparable software
To our knowledge, no comparable software exists. Other DNA sequencing programs are concerned with automated base calling, e.g. phred[27, 28]. The closest DNA sequence analysis tools are vector-trimming programs, which remove external vector sequences from the DNA sequence. In the case of single inserts, this kind of algorithm could be used, but it would still leave the insert surrounded by the 5-CP and 3-CP primers. Also, once the vector is removed, there is typically no further analysis of the remaining sequence, e.g. BlastN search. For multiple inserts, vector removal programs are unsuitable, as they would result in a single insert consisting of a concatenated set of inserts flanked by 5-CP and 3-CPs.
Future directions
Besides local BlastN searches, it is also feasible to perform remote BlastN searches using NCBI's netblast. The web server (apache2) requires modification by setting the 'KeepAliveTimeout' to at least 200 seconds. Typically, this was the time interval necessary for netblast to return a BlastN search result and slowed down annotation time significantly. Some laboratories with a faster link to NCBI might consider this option for searching very large databases.
Currently, Ebbie analyzes DNA sequencing text files. Ebbie could be expanded using other DNA sequencing analysis software, e.g. base calling software phred. The latter software is not yet available under GNU GPL and was therefore not implemented in this version of Ebbie.
For cloning smRNAs it is desirable to display the length distribution of all or groups of smRNAs in a histogram. This function will be implemented in the near future.