SLALOM, a flexible method for the identification and statistical analysis of overlapping continuous sequence elements in sequence- and time-series data

Background Protein or nucleic acid sequences contain a multitude of associated annotations representing continuous sequence elements (CSEs). Comparing these CSEs is needed, whenever we want to match identical annotations or integrate distinctive ones. Currently, there is no ready-to-use software available that provides comprehensive statistical readout for comparing two annotations of the same type with each other, which can be adapted to the application logic of the scientific question. Results We have developed a method, SLALOM (for StatisticaL Analysis of Locus Overlap Method), to perform comparative analysis of sequence annotations in a highly flexible way. SLALOM implements six major operation modes and a number of additional options that can answer a variety of statistical questions about a pair of input annotations of a given sequence collection. We demonstrate the results of SLALOM on three different examples from biology and economics and compare our method to already existing software. We discuss the importance of carefully choosing the application logic to address specific scientific questions. Conclusion SLALOM is a highly versatile, command-line based method for comparing annotations in a collection of sequences, with a statistical read-out for performance evaluation and benchmarking of predictors and gene annotation pipelines. Abstraction from sequence content even allows SLALOM to compare other kinds of positional data including, for example, data coming from time series. Electronic supplementary material The online version of this article (10.1186/s12859-018-2020-x) contains supplementary material, which is available to authorized users.


Supplementary Material to: SLALOM, a flexible method for the identification and statistical analysis of overlapping continuous sequence elements in sequence-and timeseries data
Authors: Roman Prytuliak, Friedhelm Pfeiffer, Bianca H. Habermann Available command line options for SLALOM Supplementary Table S1. Input command line options. Used abbreviations: SID -sequence identifier, GID -group identifier, CSE -continuous sequence element, TSV -tab-separated values.

Type
Default Description Main input/output files -s, --seqlenfile string (empty) Sequence length file. Maps SIDs to corresponding sequence lengths (or, in case of time series, start and finish points). Must be in tabular format. The file shall not be provided, if the length of all sequences is specified elsewhere -m, --mapfile string (empty) Mapping file. Maps GIDs and SIDs. Must be in tabular format -a1, --anno1file string (required) First/benchmark annotation file. Lists all CSEs from the first annotation, one CSE per line. Must be in tabular format, unless the simplified GenBank mode is activated -a2, --anno2file string (required) Second (prediction) annotation file. See the option '-a1' for details -o, --outfile string (required) Output TSV file with calculated performance measures, for each group and dataset-wide Simplified modes --genbank bool false Activate the simplified GenBank mode. The annotations are expected in GenBank format. Sequence length will be read automatically --bed bool false Activate the simplified BED mode. The annotations are expected in BED format Operation mode setup -b, --benchmarking bool false Activate the benchmarking mode, i.e., treat the first annotation as benchmark and the second as prediction (default: treat the annotations symmetrically) -E, --enrichment_count integer or string 0 Switch between symbol-resolved, gross, and enrichment modes. Must be a non-negative integer or 'gross'. If 0, activates the symbol-resolved mode. If >=1, activates the enrichment mode with minimal number of occurrences in CSEs of a symbol position to count it as enriched. If 'gross', activates the gross mode. See also Figure 2 Core algorithm controls -Os, --overlap_symbols integer 1 Minimal number of symbols required for a CSE of the query overlapping with a CSE/CSEs of the compared annotation to be counted as a match. Must be a positive integer -Op, --overlap_part float 0.0 Minimal share of a query CSE required to overlap with CSE/CSEs of the other annotation to be counted as a match.
Must be in range [0,1] -Oa, --overlap_apply enum shortest The principle to apply the minimal number of symbols and the minimal part: -'shortest': to the shortest of two CSEs (one CSE from the first annotation and the other one from the second) -'longest': to the longest of two CSEs -'current': to the CSE from the annotation currently considered; CSEs from the other annotation are considered one at a time -'patched': to the CSE from the annotation currently considered; allowing multiple CSEs from the subject annotation to contribute simultaneously See also Figure 4 -On, --overlap_nature enum neutral The required overlap nature of the predicted CSEs: -'neutral': count CSE overlaps regardless of the order of their start positions -'leading': only count a predicted CSE as match, if its start position is earlier or the same as that of the respective benchmark CSE -'lagging': only count predicted CSE as match, if the begin position is later or the same as that of the respective benchmark CSE The default value can be changed only in the benchmarking mode and only for non-circular sequences -a, --averaging enum group Order of averaging when calculating performance measures: -'sequence': calculate the measures for each sequence individually, then simple-average groupwide, then simple-average dataset-wide (macromacro averaging) -'group': sum the counts group-wide, then calculate the measures, then simple-average dataset-wide (micro-macro averaging) -'dataset ': sum the counts dataset-wide and calculate the measures (micro-micro averaging); for results on individual groups, apply macro averaging -A, --adjust_for_seqlen bool false Adjust the counts for each sequence for its length, then average the adjusted values (default: sum the counts). Affects only symbol-wise measures. Not compatible with sequence-wide averaging. See also Figure 3.
If an empty string is passed, space will be used, multiple spaces will be collapsed and leading, as well as trailing spaces will be removed -md, --mapfile_delim -a1d, --anno1file_delim -a2d, --anno2file_delim (see the option '-sd') -sh, --seqlenfile_headers integer 0 Number of header rows to skip in the sequence length file. Must be a non-negative integer -mh, --mapfile_headers -a1h, --anno1file_headers -a2h, --anno2file_headers (see the option '-sh') -sc, --seqlenfile_colnumbers list of integers (adjusted) Comma-delimited list of column numbers (left-most column has number 1) in the sequence length file with SIDs and sequence length in that order. In case of time series, start and finish points must be provided instead of the length. The default value is adjusted according to the expected columns -mc, --mapfile_colnumbers list of integers 1,2 Comma-delimited list of column numbers (left-most column has number 1) in the group-mapping file with SIDs and GIDs in that order -a1c, --anno1file_colnumbers list of integers (adjusted) Comma-delimited list of column numbers (left-most column has number 1) in the first annotation file with CSE start positions, CSE end positions, SIDs, GIDs, and CSE names in that order (with those not provided skipped). The default value is adjusted according to the expected columns -a2c, --anno2file_colnumbers (see the option '-a1c') -sq, --seqlenfile_quotes bool false Treat single quotes (apostrophes) in the sequence length file literally. By default, delimiters inside quotes (both single and double) are treated as belonging to fields and quotes themselves are ignored. Double quotes can never be part of read input; their presence in the file triggers an error, if this option is true -mq, --mapfile_quotes -a1q, --anno1file_quotes -a2q, --anno2file_quotes (see the option '-sq') Alternative input options -l, --seqlen_value integer 0 Length of all sequences. Must be a non-negative integer. If 0, this option is ignored. If positive, the sequence length file shall not be provided; the length of every sequence is set to the specified value; all SIDs encountered in the annotations and the group-mapping file will be used. With time series options '-ts' and '-tf' shall be used instead -ss, --single_sequence bool false Process single sequence. SIDs shall not be provided if this option is activated -ts, --timeseries_start string (empty) Start of all time series. If empty, this option is ignored. Must be used together with the option '-tf'. For details, see the option '-l' -tf, --timeseries_finish string (empty) Finish of all time series. For details, see the option '-ts' -a1as, --anno1file_all_sequences bool false Consider all the CSEs from the first annotation as belonging to all the sequences. SIDs shall not be provided if this option is activated. This option cannot be selected for both annotations simultaneously -a2as, --anno2file_all_sequences (see the option '-a1as') -a1ag, --anno2file_all_groups bool false Consider all the CSEs from the first annotation as belonging to sequences with provided SIDs in all the groups. If this option is activated, GIDs shall not be provided, although the group mapping must be provided. This option can be selected for both annotations, but it is not compatible with considering all the CSEs as belonging to all the sequences -a2ag, --anno2file_all_groups (see the option '-a1ag') -sg, --sequences_as_groups bool false Generate respective 1-sequence groups for all the SIDs. If this option is activated, neither the group-mapping file nor GIDs shall be provided -nOg, --non_overlapping_groups bool false Define the groups provided in the group-mapping file as non-overlapping. If this option is activated, GIDs in the annotation files shall not be provided Input controls -n, --site_names bool false Also read in CSE names from the annotation files. This option does not affect calculations, but the names will be shown in the detailed and site-wise output files. The names can be duplicated, but cannot be empty or contain double quotes -t, --time_unit The rule to resolve CSE overlaps within the first annotation: -'all': leave all CSEs untouched -'first': retain only the first CSE from an overlapping group -'last': retain only the last CSE from an overlapping group -'merge': merge an overlapping group to single CSE If 'first' or 'last' is selected, only one CSE will be retained from the whole group, even if the first and last CSEs of the group do not overlap. For circular sequences, only 'all' and 'merge' are supported -a2r, --anno2file_resolve (see the option '-a1r') -a1bs, --anno1file_begin_shift integer 0 Constant shift in symbols of CSE start positions in the first annotation; this number will be added to all the start positions. If strand detection is activated, this option affects actual end positions in the reverse strand. If frame detection is activated, the frames are not affected by this value. In the simplified BED mode, the start positions are additionally incremented by 1 -a2bs, --anno2file_begin_shift (see the option '-a1bs') -a1es, --anno1file_end_shift The rule for overflowing CSE end positions, which is required if a CSE start is non-positive or a CSE end exceeds the sequence length: -'forbid': terminate program with the error message -'trim': trim the CSE to fit the sequence; ignore, if both, begin and end exceed the sequence length and the CSE is thus completely out of the sequence scope -'ignore': ignore the CSE completely -'circular': make overflowing CSEs reappear at the other end of the sequence; the difference between the end and the begin must still be nonnegative and less than the sequence length For time series, 'circular' is not supported -z, --zero_for_na bool false Treat nan values as zeros to calculate averages (default: exclude from averaging) -min, --min_group_size integer 1 Minimal number of sequences in a group. Smaller groups will be ignored. Must be a positive integer -max, --max_group_size integer 0 Maximal number of sequences of a group. Larger groups will be ignored. Must be a non-negative integer. If 0, the size is unlimited -d, --detect enum none If 'strand', detect DNA strand based on the information in the file. If 'frame', in addition detect reading frames on the basis of the remainder of division by 3 of the gene start position in the corresponding strand. The frames are not adjusted by user-specified shifts (see the option '-a1bs' for details). The default value can be changed only in the simplified GenBank or BED modes Additional output files -od, --outfile_detailed string (empty) Output file with details at the sequence and group levels -os, --outfile_sites string (empty) Output file with site-wise statistics including match information -ou, --outfile_union string (empty) Output TSV file with the union of the two input annotations -oi, --outfile_intersection string (empty) Output TSV file with the intersection of the two input annotations -oc1, --outfile_complement_1 string (empty) Output TSV file with the complement of the first/benchmark annotation -oc2, --outfile_complement_2 string (empty) Output TSV file with the complement of the second/prediction annotation -ore1, --outfile_rel_enrichment_1 string (empty) Output TSV file with the sites of relative enrichment in the first/benchmark annotation -ore2, --outfile_rel_enrichment_2 string (empty) Output TSV file with the sites of relative enrichment in the second/prediction annotation Produce cleaned output TSV without the commented lines. This file will contain a single header line with column names and will be ready for import into various data analysis software packages -sort, --sort_output bool false Sort the main output file by the group names (default: sort by the order of appearance in the input). Applicable only, if groups are defined -sum, --calculate_sums bool false Calculate sums of integer counts in addition to averages. Adds additional row in the main output table. Applicable only, if groups are defined Other options -preparse, --preparse_mapfile bool false Pre-parse the group-mapping file before parsing the sequence length file. This option improves performance, if the number of SIDs in the sequence length file is much larger than in the mapping file -w, --warning_level integer 1 Warning level: -0: switch off warnings -1: show all warnings -q, --quiet bool false Do not print the progress in the command line

Case study 3. Influence of events on exchange rates
Sequences: time series of all minutes in a calendar year.
Groups: categories of economic events (e.g., countries of origin); each group contains a single sequence.
Benchmark annotation: directional trends in EURUSD (the underlying open-high-low-close (OHLC) data were downloaded from HistData.com).
Predictor annotation: time intervals of fixed duration from the beginning of each event (the event database was downloaded from FXStreet.com).
In this example, we treated the calendar year 2015 as a sequence of minutes (i.e., the sequence of 60*24*365=525,600 symbols). First, we marked the trends in EURUSD according to the following definition: a bullish trend is an interval encompassing as many as possible, but at least 5 consecutive M5 bars, which close higher than the previous bars, and similarly for a bearish trend. Trends that encompassed missing bars or gaps were excluded.
That means, a trend of N bars has a duration of 5*N minutes (i.e., the length of 5*N symbols). There were in total 1801 trends marked. The trend time intervals were used as the benchmark. With this benchmark, we were testing, if certain events could be good predictors of approaching trends. For this purpose, we marked the time intervals of fixed duration of 30 minutes (i.e., the length of 30 symbols) following each selected event as the predictor annotation. We concentrated on site-wise TPR and FPR with the requirement of at least 50% overlap of the current CSE (event or trend) to match. With this logic, TPR is the share of trends, for which at least half of their duration was within the half-hour following an event, while PPV is the share of events, which for at least 15 minutes within the first halfhour after the announcement coincided with a trend. We set the groups for countries and focused on seven major economies, while events originating from other countries were implicitly filtered out. In this example, one should consider the two following issues: (1) events may occur with less than half-hour interval, and thus 'overlap' in our representation, and (2) trends may start before the event announcements. To address the former, we resolved the overlaps within the individual annotations (i.e., events in particular countries; the trends in our definition never overlap), while retaining only the last event for each overlapping group. To address the latter, we marked our predictor as 'leading', which considers a benchmark-to-prediction CSE match only, if the start position of the predicted CSE (the event) is no greater than the start position of the benchmark CSE (the trend). As can be seen from the Supplementary Table S2, the latter requirement essentially halves the resulting PPV values. To be more precise, it multiplies them with the expected factor of 4/7.
Given that the average duration of a trend is 29 minutes and therefore almost equal to the duration of an event, it can be interpreted such that events and trends are essentially independent.

Country
Number In addition to durable trends, there are also 'spikes' observed on FX markets. Spikes are very rapid and intensive movements; although their magnitude can be quite large, they only last for a few minutes or even seconds. The spikes are often associated with events (like news releases), and, as traders often end up having bigger than expected losses at the spikes (owing to slippage in stop orders), it is a common recommendation (e.g., [1]) not to trade in the wake of news releases. But how grounded are such recommendations? To shed some light on this question, we decided to look at time coincidences of events and spikes. We defined spikes as directional movements of close-to-close magnitude no lower than 0.002 USD with close-to-close on each bar no lower than 0.0005 USD in the same direction. We used M1 bars and posed no restrictions on the number of bars (i.e., it could be just one bar).
We detected overall 426 spikes (or roughly 2 per trading day) with an average duration of roughly 2 minutes. We set the 'duration' of an event to 2 minutes accordingly and recorded a match on any overlap. As a result, we found almost no time correlation of spikes and events (see Supplementary Table S3).

Country
Number of  The detailed information on used command line options can be found further down in this Supplementary Material.

Supplement to the case study 1: option sets, files, and explanations.
Input files: • elm_seq_len_db.tsv: database with protein sequence lengths • elm_instances_2016.tsv: an ELM database dump (as of 14.03.2016) • hhmotif_prediction.csv: results of the HH-MOTiF prediction, as reported in [2] The lengths database file is a two-column TSV containing the UniprotKB identifiers and corresponding sequence lengths; if an alternative isoform is considered, the identifier is followed by dash and the isoform number. A five-line snippet from the file is shown below.
The ELM dump is a 13-column TSV. This file was used both for grouping and as the benchmark annotation. Column 3 contains ELM class names (used as group names), column 5 contains UniprotKB identifiers, while columns 7 and 8 contain begin and end positions of the SLiMs. The rest of the columns were not used in the current analysis. The file also contains six header lines that must be skipped. The newest version of this file can be downloaded from the ELM website (elm.eu.org).
The prediction results file was generated by the command line version of the HH-MOTiF. It is a 4-column CSV. Semicolon is used as separator. The columns contain UniprotKB identifiers, ELM class names, begin and end residues of the predicted SLiMs (motif roots and motif leaves treated equally) in this order. The first five lines of the file are shown below. This command will start the SLALOM analysis in the symbol-resolved mode, with group-wide averaging, no adjustment for sequence length, and without treating nans as zeros. Supplying the additional '-E gross' will switch to the gross mode; the option '-A' switches on the adjustment for sequence length; the option '-z' will switch to the treating nans as zeros. Table 3 in the main text contains the resulting data of all eight combinations from the corresponding columns at the bottom row of the output file.
The option set A1 is also the basis for producing the results shown in Table 4 The file elm_seq_len_db.tsv contains SIDs and lengths in the columns 1 and 2 respectively -m elm_instances_2016.tsv Read group mapping from the file elm_instances_2016.tsv -mc 5,3 The file elm_seq_len_db.tsv contains SIDs and GIDs for mapping in the columns 5 and 3 respectively -mh 6 Skip the first 6 (header) rows in the group mapping file -a1 elm_instances_2016.tsv Read the first annotation (the benchmark) from the file elm_instances_2016.tsv -a1c 7,8,5,3 The file elm_seq_len_db.tsv as the first annotation contains begin residues, end residues, SIDs, and GIDs in the columns 7, 8, 5, and 3 respectively -a1h 6 Skip the first 6 (header) rows in the first annotation file -a2d ';' The second annotation file contains semicolon as delimiter -min 3 The minimal size of a group is 3 (i.e., discard GIDs that are mapped to less then 3 unique SIDs)

-b
Treat the annotations as predicted and benchmark set: the first one is the benchmark, while the second one is the prediction Input files: • N_pharaonis_annotation1.gb: genome annotation from GenBank Output sums in addition to averages for integer counts -c Write the 'clean' output scores file (without comments) As the group mapping is not provided and the '-c' option is used, the main output files for the case without frame separation are 2-line TSVs (with one line being header), and 9-line TSVs otherwise (including both averages and sums). The column 'Nseq' contains the number 1, if there is no division by frame, and 6 otherwise. The site-wise statistics are identical for both output files in the symbol-resolved and gross modes, but show differences upon frame separation. A snippet from the file mapping_6frames.txt is shown below.  The country group mapping is in the file countries.txt -mc 1,1 Let each country be a group on its own by reading the country and the group names from the same column (column 1) of the file -a1 EURUSD_trends.csv Read the first annotation (trends) from the file EURUSD_trends.csv -a1c 1,2 Read the starts and ends of the trends from the columns 1 and 2 of the file, respectively; SIDs and GIDs are not provided because the option '-a1as' is used -a1d ',' Column delimiter in the first annotation file is comma -a1h 1 Skip 1 (header) row at the beginning of the first annotation file -a1as All time intervals in the first annotation belong to all sequences (i.e., considered for every country); do not read SIDs and GIDs from the annotation file itself -a2 eventdates2015.csv Read the second annotation (events) from the file eventdates2015.csv -a2c 1,1,3 Read the starts and 'ends' of the events from the same column 1 of the file; read SIDs (country names) from column 3; GIDs are not provided because of the option '-a2ag' -a2q Read in single quotes from the second annotation file literally -a2ag All time intervals in the second annotation belong to all groups, which contain the corresponding SID (i.e., to all country groups that contain this country) -a2es 30 The ends' shift in the second annotation is 30 symbols forward (i.e., events 'end' 30 minutes after they are announced) -a2r last Resolve 'overlaps' within the second annotation by considering only the last event from an 'overlapping' group -b Treat the first annotation (trends) as the benchmark, while the second (events) -as prediction i.e., number of events considered. To obtain the total number of events, including the overlapping ones, one should not provide the option '-a2r last'. Similarly, the column 'SiteNB' contains the number of trends or spikes. Its value is equal for all countries, as all movements are always considered. The columns 'SiteTPR' and 'SitePPV' were used to produce the data for Tables 5 and 6. The column 'Nseq' contains the number 1, because there is a single country in each country 'group'. Alternatively, one could group the countries by region: Germany, France, and Italy under 'EU', and China and Japan under 'Asia' (as in the provided file country_groups.tsv, which is to be used with the option '-mc 2,1'). In this case one gets only 4 lines of data instead of 7, each line containing the statistics on the whole group.
Furthermore, as some Uniprot identifiers mentioned in ELM were already phased out by Uniprot, we also ignored motifs from those sequences.
The choices we made on micro or macro averaging for comparison between SLALOM and BioConductor are for demonstration purposes only and are not the only ones possible. We rather demonstrate that both BioConductor and SLALOM are flexible in this regard and can be adjusted for the averaging method required.
As the benchmarking mode is used, motif residues in the first annotation are viewed as golden standard (relevant) elements, and the ones in the second annotation are viewed as predicted (selected) elements. Consequently, the column 'Rel.' ('relevant') of the output file 'output_comparison_questions1_2.tsv' contains the macro-averaged share of total protein length covered by the motif instances for all the proteins marked as containing motifs from a specific ELM class. The average (the bottom row) in this column is the simple average for all classes. Similarly, the column 'Sel.' ('selected') contains the corresponding share for the second annotation. These averages are the answer for question 1 from the Main Text. In the provided BioConductor R script, only the former value is calculated.
The options '-Op 1.0 -Oa longest' literally register a matching motif between the two annotations, if the overlap is 100% of the length of the longest motif. This is only possible, if the match is exact on residue level. On the other hand, the options '-Op 0.5 -Oa current' are more relaxed, as they require only 50% coverage for the currently considered motif instance, which may be the shorter one. The share of motif instances in the first annotation having an exact match in the second one -the answer for question 2 from Main Text -can be found in the bottom row ('Average') and the column 'SiteTPR' of the output file 'output_comparison_questions1_2.tsv'. The corresponding share of motif instances covered by at least 50% of their length by a specific instance form the second annotationthe answer to the question 3 -can be found at the same place in the output file 'output_comparison_question3.tsv'. In this example, micro-averaging is done at all levels using the option '-a dataset'.