Selected abstracts of “Bioinformatics: from Algorithms to Applications 2020” conference

high-resolution

• Sequencing technologies • Molecular sequence analysis • Computational genomics • Genome assembly • Transcriptomics • Metagenomics • Agrigenomics • Viromics • Natural Products Discovery The Fourth international conference "Bioinformatics: from Algorithms to Applications" was held on July 27-28, 2020 and was accompanied by the 2-day online workshop that included metagenomic data analysis and annotation using MGnify [1].
to promote elucidation of biological mechanisms which made this bacterial genus associated with public health risks.

O4
Do multiple long-distance transfers shape TBEV spread pattern?
Andrei A. Deviatkin 1,2* , Yulia A. Vakulenko 3,4 , Ivan S. Kholodilov 5 , Galina G. Karganova 5,6  Tick-borne encephalitis (TBE) is viral zoonosis transmitted by the bite of infected ticks. In 1999, phylogenetic analysis demonstrated clear separation of TBE viruses into three subtypes, that were called based on its distribution: European, Siberian, and Far-Eastern. It is now becoming apparent that the actual spread of these viruses may differ from the nominal. Herein, 848 TBEV sequences (1028 nt E-gene fragments) were analyzed to indicate all long-distance virus transfers, that can be revealed from the sequence data. Threshold of 500 km was used for the selection of long-distance virus transfers. Noteworthy, ticks are not able to spread the infection on their own over such a distance. In other words, these long-distance virus transmissions were caused by vector-assisted tick transmission. In all subtypes and most of the smaller groups in these subtypes, there were a lot of recent long-distance virus transfers, that was revealed by Bayesian evolutionary analysis. Moreover, this is suggested to be a systematic pattern, rather than anecdotal events. For example, 19 out of 125 known sequences the Far-Eastern subtype were obtained in Japan. Genetic diversity of viruses found within this country was comparable with the diversity of the whole subtype. At the same time, this subtype is distributed throughout Japan, China, South Korea, Russia, Estonia and Latvia. The above arguments allow us to state that long transfers may be considered as a normal and abundant pattern in TBEV spreading.
Background Gene duplication resulting in paralogues is an imperfectly understood process in terms of the order of duplications facilitating the evolution of new functions. An example is the Squalene Synthase-Like (SSL) gene family of the oil-bearing green alga Botryococcus braunii race B. Squalene synthases combine two half reactions of catalysis, being the formation of presqualene disphosphate (PSPP) and then the subsequent synthesis of a triterpenoid, in this case squalene. It was previously established that the SSL paralogues have separated these two half reactions. Materials and methods The squalene synthase (SS) and SSL genes of the organism were sequenced using Illumina reads and SOAP and Velvet assembly, in such a way that the rest of the genome was essentially ignored but this gene family was retrieved in complete detail. By this means the full set of paralogues were determined. Secondly, the genetic "distances" between four genes (as in-silico proteins) so recovered were compared pairwise with each other, as well as with the same set of genes published by a previous group, and with other green algal SS genes. Finally, the pairwise distance comparisons were input into a novel algorithm for Multi-Dimensional Scaling (MDS), which in combination with standard substitution matrices and a simple averaging model for evolutionary rate of the genes, enabled a tree to be derived. Specifically, 5 dimensions were used in the MDS. Results The order of evolution SS → SSL2 → SSL1 /SSL3 was determined, and inference of an evolutionary scenario was made. Further, an alignment process necessitated reannotation of key squalene synthases from green algal model organisms Chlamydomonas and Volvox, with support also obtained from the homologous proteins of non-model green algae Micromonas and Ostreococcus. Conclusions Gene B. braunii SS, also known as BSS, has diverged further than a typical green algal SS, because selection on BSS was relaxed through duplication to create SSL2 which has all the functions of BSS except that PSPP synthesis is down-regulated without disappearing. C-terminal analysis indicated that green algal SSs may be membrane associated via two transmembrane alpha-helices plus an additional putative anchoring region. By this C-terminal indicator, BSS and SSL2 proteins may be in the same compartment/organelle as each other, explaining the somewhat relaxed selection on each. By contrast SSL1 and SSL3 which together generate a triterpenoid isomer named botryococcene, are lacking the C-terminal transmembrane alpha helices when analysed bioinformatically. This may indicate they have migrated to a different compartment or are in some way separated from the site of squalene synthesis, that has facilitated their separate evolution. SSL1 and SSL3 have stochastically recombined with each other which may have also facilitated the evolution of their new combined function-botryocccene synthesis, and are predicted to exist as a heterodimer. Detection of hidden viral diversity is a challenging task, which goes beyond the standard protocol of processing metagenomic data. Meanwhile, publicly available databases contain a large amount of metagenomic data-the promising source of novel viral genomes, which remains largely understudied. Here we present the new pipeline for detecting full-length viral genomes from assembled metagenomes. Viral genomes represent cyclic or linear molecules with the ends containing repeated sequences. Both types could be recognized as cyclic sequences. We detect such contigs by searching repeats ranging from 50 to 200 bp using Knuth-Morris-Pratt algorithm. This algorithm takes linear time depending on the maximum length of the allowed repeat, which permits to process large amounts of data and reduce its dimensionality. We classify cyclic sequences as viral or non-viral based on predicted gene content using viralVerify tool. For each selected viral contig we identify the capsid and terminase genes based on HMM profiles. We aligned found protein sequences against the NCBI nr database with Diamond. The protein sequences, both queries and hits, belonging to each HMM profile were clustered with CD-HIT v4.8.1 (span 80%, identity 50%). The resulting centroid sequences were aligned using MAFFT v7.310 with default parameters, followed by phylogeny reconstruction using UPGMA and RAxML v8.2.11 separately. Clusters that do not contain any hits were classified as previously unknown. The completeness of viral contigs was inspected with viralComplete and CheckV. We tested our pipeline on assembled metagenomes from NCBI Assembly database. More than 170 Gb of data representing about 1300 metagenomes derived from seawater, soil and biofilms habitats were analyzed. Our analysis revealed that the diversity of viruses is much greater than we know up to date. Hundreds of new viruses clusters were detected. For example, we identified 3 new representatives of the Siphoviridae and Podoviridae bacteriophage families from 10 biofilm-derived metagenomes. Our approach allows us to detect full-length viral genomes with a lower chance of false-positive results. In the future, the user of our pipeline can submit metagenome assemblies or raw reads to the input and receive annotated viral genomes from the data. Further analysis of metagenomes from other habitats is indispensable. Project is available on GitHub: https ://githu b.com/Yulia -Yakov leva/metav irome .

Acknowledgments
This work was supported by Saint Petersburg State University (project ID 51555639).

O7
The 3C criterion: Contiguity, Completeness and Correctness to assess de novo genome assemblies De novo genome assembly is an open challenge in bioinformatic analyzes. In order to select the reconstructed sequence that is closest to the real genome, different approaches to evaluate and select assemblies have been implemented. First, different metrics have been used that are related to the number and size of pieces obtained with respect to the expected sequence, the Contiguity. Other comparison strategies have focused on the ability to reconstruct essential genes and known elements, the Completeness (how much of the genome is represented by the pieces of the assembly). Also, accuracy between the sequenced and the expected bases has been a matter of discussion, which depends on DNA sequencing technologies. This can be referred as Correctness, how well those pieces accurately represent the genome sequenced. In our previous work we conceptualized the criterion 3C (contiguity, completeness and correctness) as a set of metrics that can be used to benchmark genome assemblies. We assessed this criterion using the Costa Rican Pseudomonas aeruginosa AG1 isolate as model [1].
For the current study, two new clones of P. aeruginosa AG1 were obtained after culturing using high ciprofloxacin concentration media, in which this bacteria regularly do not grow. In comparison to the Reference genome (P. aeruginosa PAO1), it was estimated that P. aeruginosa AG1 and the two new clones had ~ 1 Mb additional DNA sequence in its genome, justifying a de novo assembly. All genomes were sequenced using short-(Illumina) and long-reads (Nanopore) technologies. A benchmark of 10 approaches was done for each strain, considering different algorithms (assemblers) and DNA sequencing technologies for hybrid and non-hybrid models. The 3C criterion was used for each strain to select the best assembly.
From the benchmarking results a better performance of long reads technologies to solve repeated zones (impacting contiguity) and the fidelity (correctness) obtained by short reads technology stand out. Despite the fact that some assembly algorithms achieved a single contig as expected, surprisingly a large number of fragmented genes (frameshifts) were identified for long reads assemblers (affecting correctness and completeness). Thus, assessment using 3C criterion showed an improved performance for a hybrid assembly approach, with the best advantages of each sequencing technology. This steps are critical not only to understand the genome architecture of these strains, but also for further studies at other-omic levels, as we recently published for the transcriptomic response to ciprofloxacin in P. aeruginosa AG1 [2].  The emergence of a new generation of sequencing for the first time allowed us to significantly accelerate and reduce the cost of determining the complete sequence of millions of genomes of organisms, from bacteria to human. Scientists are now looking to miniaturize and automate the sequencing process, increase the amount of data obtained, and reduce the cost of it. It is clear from bioinformatics that as the cost of sequencing decreases, the number of data processed will increase. It is necessary to identify and automate the areas of analysis that are routine. We created Pannopi-a scalable, easy-to-use assembly and annotation pipeline based on a hierarchical pangenome graph. The program performs a large-scale analysis of the nucleic acid sequence of bacteria from preparation to functional annotation. The process runs from the preparation of sequence reads to genome assembly, through cleaning up the genome from external contamination to structural and functional annotation. Quality control is carried out throughout the process. Pannopi has tests and benchmarks not only for genome assembly, but also for its annotation using eight genomes from different taxonomic groups. This allows new annotation methods to be tested and benchmarking quickly. Pipeline includes the most advanced and effective tools for genomic annotation and allows for flexible customization of their use. So that the user can select between a few genome assemblers and annotators or even to run all of the tools for subsequent compare. Also, Pannopi allows users to select the taxons for pan-genome comparative genomics and required modules; it can be used on a separate command-line program or through a web interface. Pannopi output includes includes: raw data quality control; assembly; cleaned from contamination assembly; assembly quality control; structural annotation; functional annotation; pangenome-based comparative annotation; lists of antibiotic resistance and virulence genes, plasmids, phages, IS elements, tandem repeats, mlst-type, and serotype. As the sequencing technologies advance and costs decrease, it is now becoming possible to produce individual Reference genomes that rival the quality of the well-established References. In this project we produced new population-specific human Reference genome. For the initial assembly we used publicly available Illumina, Oxford Nanopore ultralong and PacBio HiFi data for Human Genome Project HG002 individual from Ashkenazi jewish trio. These data are available from Genome In A Bottle (GIAB) project. We assembled the reads with MaSuRCA genome assembler version 3.3.1, and then used MaSuRCA chromosome scaffolder to validate, order and orient the assembled contigs based on their alignments to the human Reference genome GRCh38.p12. GRCh38 Reference contains a lot of repetitive and therefore hard-to-assemble regions some of which have been put together using labor-intensive manual curation. Even the long reads from Nanopore and PacBio are still unable to properly resolve many of these regions. Thus, rather than having gaps in the chromosome sequences, wherever possible, we filled them using GRCh38.p12 sequence in lowercase letters.

O9
The new Reference that we call Ash1, has more sequence placed on the chromosomes: 2,973,118,650 nucleotides as compared to 2,937,639,212 in GRCh38. While GRCh38 is a mosaic of many different individual genomes, our Reference represents a single traditional haplotype-merged individual genome. We annotated the genome by transferring the CHESS 2.0 annotation from GRCh38.p12 Reference using a novel Liftoff tool. The Ash1 annotation identified 20,157 protein-coding genes, of which 19,563 are > 99% identical to their counterparts on GRCh38.p12. 40 of the protein-coding genes present in GRCh38.p12 are missing from Ash1. However, all these genes are members of multi-gene families for which Ash1 contains other copies. We found no cases at all where a gene present in GRCh38.p12 totally missing from Ash1. Alignment of Illumina reads from an unrelated part-Ashkenazi (~ 70%) individual PGP17 from Personal Genome Project to Ash1 identified about 1 million fewer homozygous SNPs than alignment of those same sequences to the more-distant GRCh38 Reference, illustrating one of the benefits of having a population-specific Reference genome. Red Sea brine SMGCs were detected. The Red Sea brine SMGCs were found to be potentially encoding for natural products pertaining to 28 classes, that were functionally grouped into three main categories, which comprise the following diverse chemistries -in addition to hybrid clusters: (1) saccharides, fatty acids, aryl polyenes, acyl-homoserine lactones, (2) terpenes, ribosomal peptides, non-ribosomal peptides, polyketides, phosphonates and (3) polyunsaturated fatty acids, ectoine, ladderane and others. We recently reported our findings, and here we will focus on the specific methodology of SMGCs detection in metagenomic samples, and on a particular selected group of natural products, which are the Ribosomally synthesized and posttranslationally modified peptides (RiPPs Sea brine RiPPs. In addition to our earlier reported results, here we will focus more on the methodology and recommendations for optimal mining microbial metagenomes for SMGCs, furthermore, we focus on and prioritize an additional selected group (RiPPs) for recommendation to the experimental work to validate and highlight the importance of the implemented methodology.

O11
Metagenomic analysis using k-mer-based tools reveal cyanobacteria and heavy metal response genes in a copper mining site in Benguet Province, Philippines Kraken2 produced varying identifications of cyanobacteria in all sites, while CLARK consistently identified the same cyanobacterial species in all sites. Data sets were deposited at DDBJ/ENA/GenBank BioProject ID PRJNA504923 under the accession VFQP0000000000, VFQQ000000000, VFQR000000000, VFQS000000000, VFQT000000000 and VFQU0000000. Protein-coding sequences output from Prokka [7] that were evaluated using eggNOG [8] revealed genes conferring stress response to Cu 2+ , Zn 2+ , Pb 2+ , Cd 2+ , Ca 2+ metal ions and smt metallothionein. These genes are reported to be responsible for the efflux/transport functions and heavy metal resistance that can be major attributes of cyanobacterial species for their survival to extreme metal conditions. This is the first report of cyanobacteria in a copper mining site in Benguet Province that were analyzed using shotgun metagenomics. BMC Bioinformatics 2020, 21(Suppl 20):567

O12
The search for genetic risk factors of ischemic stroke with the genome-wide association study and machine learning methods Motivation Since BLAST introduced the seed and extend paradigm, indexing fixed-length words (k-mers) from a set of sequences is the bread and butter of most algorithms and methods relying on sequence similarity. Due to the ever-increasing amount of available Reference genomes, there is a growing interest in global approaches able to take into account a very broad sequence range. Ambitious applications such as pangenomics or metagenomics require to index billions of distinct k-mers and would benefit from incorporating as many Reference genomes as possible. Recently, the problem of representing massive k-mer sets with low memory usage and a high throughput caught the community's interest. In the last few years, several efficient methods (Pufferfish [1], Bifrost [2], BLight [3], REINDEER [4], Kallisto [5], Jellyfish [6], SRC [7]) were proposed with various applications: k-mer counting, quantification, assembly, … Some implementations are specific to their main application, others are generic libraries that can fit various purposes. Jellyfish indexes k-mers using an efficient lock-free dynamic hash table scheme to enable fast k-mer counting. Such a scheme needs to store each k-mer in memory, which represents a memory cost of several bytes per k-mer (4 bytes for 31-mers). Probabilistic dictionaries [7] can use less than 2 bytes per k-mer at the expense of a low false-positive rate. Recent improvements provided efficient deterministic k-mer set representations, exploiting nucleotide redundancy in k-mer sets to lower the memory cost [1], and k-mer partitioning to further reduce the storage cost and raise cache coherency BMC Bioinformatics 2020, 21(Suppl 20):567 [3]. However, the efficiency of some of those methods relies on their static aspect. Large construction or update costs make them unfit to some applications where insertions or deletion are required. For instance, the rapid acquisition of new data for microbial pangenomes could benefit from dynamic structures. Large scale dynamic de Bruijn graphs [2,8] are another possible application that is gaining traction.

Results
We present BRISK (Brisk Reduced Index for sequence of k-mers) a resource-efficient dynamic dictionary able to associate value to k-mers without false positives. It relies on three main ideas. First, instead of storing k-mers independently, we store super-k-mers, a sequence of k-mers that share the same minimizer, to reduce the amount of nucleotides required to encode overlapping k-mers. We partition super-k-mers according to their minimizers, which allows us to work on smaller structures, and improves cache coherence. Second, we represent a partition as a sorted list of super-k-mers to ensure fast retrieval of k-mers. Lastly, we use less nucleotides by encoding only the suffix and the prefix of a super-k-mer without its minimizer. In practice using this scheme, we can encode on average [9] eight 31-mers into a single super-k-mer that can fit on a 64 bits integer. The larger the minimizer size, the faster the queries but also the larger the space overhead. That means that queries can be adapted for different space/time tradeoffs. Furthermore, index usage is highly cache-coherent as querying several k-mers sharing the same minimizer only requires one random memory access.  Secondary structure is known to have a crucial impact on RNA molecules functioning, therefore, development of algorithms for secondary structure modeling and prediction is a fundamental task in computational genomics. Among other methods, secondary structure can be theoretically described by means of formal grammars [1,2]. An approach for sequences secondary structure analysis by combination of formal grammars and neural networks was proposed in [3,4]. In this work, we apply this approach to RNA secondary structure prediction. Secondary structure can be described as composition of stems having different heights and loop sizes [5]. We use context-free grammar from [3] to encode the most common kinds of stems and parsing algorithm [6] to find such stems in sequences. Note that this grammar describes only the classical base pairs and cannot express pseudoknots. The result of a matrix-based parsing algorithm for some sequence is a boolean matrix that represents all the theoretically possible stems in terms of grammar, but the real secondary structure cannot contain all of them at once and, besides, there can be more complex, not expressible in given grammar elements. Therefore, parsing matrices require further processing and we propose to use a neural network to handle them in order to generate an actual secondary structure. For experimental research we took sequences from RnaCentral [7] database and as Reference data for network training we used the output of CentroidFold tool [8]: contact matrices that represent connections between nucleotides in secondary structure. We transformed parsing matrices and contact maps to blackand-white images. These images were used for training the generative neural network which takes a parsingprovided image as an input and transforms it to the maximal approximation of the considered contact map. We applied deep residual networks with the local alignment algorithm at the end of the sequence of layers. We trained models with and without alignment on several datasets with fixed sequence length interval and estimated them by precision, recall and F1 score metrics calculated for numbers of correctly and incorrectly guessed contacts for each image. All models showed F1 score up to 70% and we discovered that the smaller the window size, the more accurate the model, moreover, alignment significantly improves precision of neural networks due to removing the contacts that break the secondary structure.
To conclude, the set of experiments confirmed that the proposed approach is applicable to secondary structure prediction problem and further research is required.

Dissecting the evolutionary mechanisms of the 3-domain Cry toxins diversity
Anton E. Shikov 1,2* , Yury V. Malovichko  Biologicals based on the entomopathogenic gram-positive bacteria, Bacillus thuringiensis, represent one of the most widespread biopesticides. The potency and specificity of insecticidal action are determined mostly by insecticidal moieties produced mainly as the crystalline inclusions during the sporulation growth phase of the bacterium. Although diverse virulence factors are produced by B. thuringiensis, Cry toxins refer to the most useful and agriculturally applicable biopesticides. Cry toxins and their subset, 3-D (three-domain) Cry toxins, exhibit a wide range of affected hosts and high specificity. Unfortunately, an emerging resistance of insects to these toxins due to mutations in hosts' receptors retards efficient pest management. The two strategies that could contribute to solving this issue are the search for novel toxins and the construction of artificial toxins through the domain shuffling. To discover new 3-D Cry toxins in genomic data, we have recently developed an HMM-based tool called CryProcessor that retrieves sequences of 3-D Cry toxins from large datasets and provides an opportunity to get the layout of individual domains [1]. This tool outperforms its analogs in terms of accuracy, speed, and throughput. Cry toxins' domain layout provided by CryProcessor could facilitate the development of chimeric toxins by accelerating in silico construction of chimeric toxins. Considering the diversity of Cry toxins, one generally accepted yet not lucidly validated hypothesis links this diversification of the 3-D Cry toxins with domains' exchanges between them. To fulfill this gap, here we conducted a large-scale phylogenetic study of the 3-D Cry toxins. Using CryProcessor, we screened the IPG and Genbank databases and identified 600 novel toxins, which were then merged with toxins from the Bt Nomenclature. We constructed phylogenetic trees based both on full sequences and separate domains. The evaluation of topological differences between the trees revealed a dissimilarity between the topology of the full sequence-based tree and the domain-only trees. We then screened sequences for signals of recombination events. As a result, we revealed 50 recombination events that belonged to each of the domains. Our results indicate that recombination events represent a pivotal mechanism for the evolution and diversification of 3-D Cry toxins. A more in-depth look into the history of recombination events would allow us to BMC Bioinformatics 2020, 21(Suppl 20):567 understand evolutionary mechanisms originating the Cry toxins diversity, and to develop new toxins precisely and efficiently. This study was supported by the Russian Science Foundation (20-76-10044). We study the genetic diversity in trees of the deciduous forests in Thailand to understand their adaptation to past environments and predict their response to future climate changes. High-throughput sequencing greatly enhances population genetics studies through its power to genotype individuals at multiple loci. Though methods exist for obtaining genotypes and polymorphism data without a Reference genome, having a Reference sequence at least for the single-copy regions of the genome does help when one would like to compare diversity parameters across populations and species. Through k-mer frequency analysis at multiple k-mer lengths the tree genomes (Xylia xylocarpa, Dipterocarpus tuberculatus, Gluta usitata, Dalbergia spp., Afzelia xylocarpa) are shown to be highly heterozygous. The k-mer length at which the peak frequency of homozygous k-mers equals the peak frequency of heterozygous k-mers is proposed as a reliable measure to compare the level of polymorphism across heterozygous species.
As the individual genomes are highly heterozygous, assembly programmes struggle to deliver an acceptable Reference sequence even for the non-repetitive part of the genome. Platanus, Platanus-allee, SPAdes and Meraculous were compared for their genome assembly capabilities. None of the programmes delivered a usable Reference genome from the sequence data, approximately 12-25 × genome coverage of a single library sequenced by Illumina ® paired-end reads of 101 or 150 nucleotides. Assemblies are highly fragmented, generally producing a single contig in short regions of lower heterozygosity and two contigs for regions where the haplotypes are highly divergent with breaks in between. We are developing a "haplotype-specific k-mer walking" assembly pipeline which is based on identifying trusted single-copy k-mers and extending the two chromosomal copies concurrently using read pairs containing those k-mers exactly. Through this pipeline contigs longer than 10,000 bp can be assembled spanning multiple scaffolds produced by regular genome assemblers. Moreover, a sufficient number of read pairs contain two polymorphisms allowing not only genome assembly but also phasing of polymorphic sites at the same time. The k-mer selection and walking process needs to be further automated and parallelized. The data from moderate to low coverage Illumina ® genome sequencing contain sufficient information for the assembly of long contigs representing both haplotypes derived from the two genomes in heterozygous individuals.

Metagenomic analysis of the soil microbiota associated with plant gigantism of the unique Siberian Chernevaya Taiga
Mikhail Rayko 1* , Anastasia Kulemzina 2* , Evgeny Abakumov 3 , Georgy Istigechev 4 , Evgeny Andronov 5,6 , Nikolay Lashchinsky 7 , Alla Lapidus Chernevaya taiga can be described as a boreal forest formation, limited in its spread by hyper humid sections of the Altai-Sayan mountainous region. It is characterized by a series of unique ecological traits, the most notable of which is the gigantism of the perennial grassy plants and bushes.
The main goal of the study is to discover and parametrize the main factors that affect the anomalously elevated effective fertility of the Chernevaya taiga soils, with the focus on the microbial communities. We aim to establish a link between the distinct properties of Chernevaya taiga with the chemical parameters of the soil, the rate of moisturization, the unique composition of the microbiota and/or the aggregate of all of these factors.
Based on 16S analysis of the soils from two Chernevaya taiga locations (Novosibirsk and Tomsk regions) and control soils, we found that the richness of the soil microbiota is decreasing significantly with increasing sampling depth. The taxonomic structure of the microbiota of the top layers (0-15 cm) has similar properties in the different geographical locations of the Chernevaya taiga. The most prevalent phyla in the top layers of the Chernevaya taiga soils are Proteobacteria, Acidobacteria and Verrucomicrobia. Differences in microbiota composition of the rhizosphere of Crepis sibirica between Chernevaya taiga and control regions were investigated using linear discriminant analysis effect size approach. We found bacterial taxa that are relatively abundant between both groups. Bacteroidetes (in particular Sphingobacteria and Cytophagia) were more abundant in the control group, and Actinobacteria (mostly Thermoleophilia) and Verrucomicrobia (Chthoniobacterales) in the Chernevaya taiga samples. It may indicate the specificity of the Chernevaya taiga microbiome, and its importance for the features of this biotope. The reported study was funded by Russian Scientific Foundation (grant ID 19-16-00049).
obtained via Illumina MiSEQ, processing was performed in R. For trimming, filtering and ASVs determination followed by merging reads, standard pipeline for Dada2 was used. Phylogenetic tree was constructed by MAFFT and FastTree implementation in Qiime2. Rarefaction by minimum number (12833) reads, bargraphs, alpha-and beta-diversity metrics calculation was performed using phyloseq (vizualization by ggplot2).
In parallel with sediment sampling we collected samples of dissolved greenhouse gases (CO 2 and CH 4 ) by head space method and further measured CO 2 and CH 4 concentrations and δ13C-CH 4 and δ13C-CO 2 . Metabolic pathways of methane production we revealed using δ13C-CH 4 signature in anoxic incubation experiment with sediments. As a result, we analyzed 52 bacterial and archaeal phyla, among which metanogenic Archaea constituted only 0.5-0.6% of sequences in the amplicon libraries (depending on sample). Methanogenic community structure of bottom sediments of the Yenisei River dominated by archaeons belonging to Methanosarcina, Methanosaeta and Methanoregula. The OTU abundance of these archaeons was larger in sediments collected between 56ºN and 61ºN. Along this channel segment the values of δ13C-CH 4 in the dissolved methane has increased from -54 to -43‰ VPDB that indicated methylotrophic and acetoclastic methanogenesis. In the segment between 61ºN and 64ºN the OTU abundance of methanogenic Archaea decreased dramatically (5-190 times) which was accompanied by the sharp depletion of δ13C-CH 4 up to − 60 to − 80 ‰ VPDB indicating the shift to hydrogenotrophic metabolic pathway of methane production. Also in this river area we observed increasing OTU abundance of anaerobic methanotrophs belonging to Candidatus Methanoperedens. Further North (64-67ºN) we observed enrichment of δ13C-CH 4 and increasing in methanogenic community archaeons belonging to Methanosarcina and Methanoregula. We think that NGS sequencing data can clarify taxonomical composition of metanotrophic and metanogenic communities of sediments, their activity and impact on geochemical methane-driving processes, and reveal active participants in those communities.

P9
Bioinformatics analysis of short-chain fatty acid production potential in the human gut microbiome Conclusions CPI gives a probabilistic estimate of the fraction of community cells possessing a specific metabolic capability. The high concordance between in silico predicted butyrate and propionate production capabilities and their in vitro measured concentrations provide a validation of our phenotype profiling approach.
undergoes developmental retardation at the pre-storage phase of development followed by developmental acceleration after 10 DAP. Consistent with this notion, transcripts associated with storage compound accumulation and acquisition of desiccation tolerance were found to be expressed earlier in Sprint-2. The earlier transition to maturation in Sprint-2 can be partly explained by the premature activation of genes encoding for primary LAFL transcription factors, which were also found to bear strong missense substitutions compared to the two other pea lines. Moreover, at both 10 and DAP Sprint-2 demonstrated an elevated rate of mobile genetic element activity, mostly of retrotransposons of Copia family. The promoted transposon activity may be connected to an altered pattern of DNA methylase genes found in Sprint-2. This includes an earlier onset of CROMOMETHYLASE (CMT) pathway and an elevated level of RDM1 and DRM genes expression at 20 DAP compared to the two normally maturing accessions. The obtained data indicate that transposable element activity may underlie or rather accompany seed development heterochrony in pea, but further experiments are needed to elucidate this hypothesis.
Background Microbial genomics has seen rapid improvements in the past decade primarily due to the development of novel algorithms capable of assembling the data generated by a variety of next-generation sequencing technologies, into a high-quality genome. Depending on the sequencing technology, type of libraries, and the complexity of the genome, this has most often resulted in the generation of draft genomes. The completion of these microbial genomes however, have remained a challenge. Even with the advancement of technologies which produce long reads, the cost-effectiveness of short-read technologies has resulted in the deposition of 468,154 (as of December 2019) permanent-draft genomes (i.e., genomes unlikely to be ever completed) in the NCBI database, while the number of complete genomes is only 16,814. The present work aims to develop a computational workflow to improve the quality of these permanent draft genomes using information from complete genomes of evolutionarily related microbes.

Materials and methods
The complete genome of Escherichia coli (NZ_CP027599.1) was selected as the standard assembly for our study. Short reads data sets of varying read lengths (75 bp to 250 bp) were simulated using the programs ART [1], DWGSIM [2], NEAT [3], pIRS [4], and Wgsim [5]. These reads were Reference mapped to the standard assembly (NZ_CP027599.1) using BWA [6], Bowtie [7], Novoalign (https ://novoc raft.com/), and SMALT (https ://www.sange r.ac.uk/tool/smalt -0/). Fifteen different genomes from the genus Citrobacter, Enterobacter, Salmonella, Shigella, and Yersinia at varying evolutionary distance to Escherichia coli were used as References, for mapping the aforementioned simulated reads. The best resulting assemblies were selected as input for the software GFinisher [8]. De novo assembly of the simulated reads were done using Unicycler [9] for comparison. Comparison of the different assemblies to the complete genome of Escherichia coli (NZ_CP027599.1) were made using QUAST [10] and Circos [11]. Results Our workflow uses information from multiple-Reference genomes to obtain an improved assembly of the simulated reads of Escherichia coli (NZ_CP027599.1). It is envisaged that with the increase in the number of complete genomes of a given Genus in the NCBI, the information contained in the genomes of related microbes can be exploited to obtain an assembly with improved contiguity, and with no loss in strain specific information, using the original short-read data from the short read archive.
Conclusions A proof-of-concept using simulated short-read data sets of Escherichia coli is presented to highlight the improvements in the assembly guided by multiple Reference genomes.

P12
Reducing redundancy of input data sets to improve inference of transcription factor binding sites The majority of bacterial genome annotations lack information about transcription factor (TF) binding sites (operators) which control how genomic information is expressed. We are developing the Sig-moID application [1] to solve this problem in a highly automated fashion. In brief, the motif discovery algorithm involves analysing 3D crystal structures of TF-operator complexes, finding TFs with the same contacts between operators and DNA-binding domains (CR-tag) and then looking for autoregulatory operator motifs in the promoter regions surrounding the genes encoding these TFs. The success of motif discovery strongly depends on the diversity of promoter region dataset. Assembling appropriate datasets proved to be challenging due to large sizes and rapid expansion of protein databases. In the first step of our pipeline, homologous TFs with CR-tags identical to the one being studied are retrieved. In our experience, this stage proved to be highly unreliable if public phmmer or blastp servers were used. Local searches require fast workstation and maintaining large databases which is undesirable taking into the account the target audience (bench scientists). Also, many thousands of homologous proteins with matching CR-tag are expected for many TFs, while not more than 30-50 are usually required. We have replaced the problematic TF homologs search step by fast lookup tables. The tables match a CRtag to IDs of all proteins with this tag. Usage of the PIR representative proteome databases [2] with different co-membership thresholds as sequence source for building lookup tables mostly solved the excessive redundancy problem. The optimal homologs number could often be achieved by simply taking IDs of the proteins from one of the five lookup tables. If homologs number was still excessive, an additional promoter regions clustering step was performed. We have found MeShClust [3] to be the optimal tool at this stage. The efficiency of different clustering approaches and search options was tested by inferring operator motifs for Escherichia coli TFs from several protein families. The double clustering approach proved to be the fastest and produced better motifs in some cases as it didn't have to resort to random selection of sub-optimal promoter regions when their number was excessive. We have also noticed many cases of SigmoID producing "good" operator motifs matching experimental data when such a motif was not present in the RegulonDB database [4] or was incorrect. The SigmoID v2 software with CR-tag lookup tables for 13 TF families is available for download on Github at https ://githu b.com/nikol aichi k/Sigmo ID.