Next-generation sequencing (NGS) technologies are widely used to answer key biological questions at the scale of the entire genome and with an unprecedented depth [1–4]. Whether determining genetic or genomic variations, cataloguing transcripts and assessing their expression levels, identifying DNA-protein interactions or chromatin modifications, surveying the species diversity in an environmental sample, all these tasks are now tackled with large-scale sequencing and require computer intensive bioinformatic analyses [5–7], although different.
Identification of genetic variations can be addressed by whole genome sequencing (WGS) or whole exome sequencing (WES) of single individuals. WGS is particularly attractive because it allows to access the full spectrum of genetic variations, i.e. coding and non coding Single Nucleotide Variations (SNV) and short insertion-deletion variants (indels), as well as Copy Number Variants (CNV) and Structural Variants (SV) [2, 8]. In practice, out of major genome centers and a fortiori for the clinical routine translation, the development of this approach is still constrained by various difficulties such as the production organization, the yet expensive cost, the actual error rate of the technologies (~ 1 error per 100 kb; ~30, 000 erroneous variant calls for the whole genome), the sheer volume of data to store and to transfer, requiring intensive informatics infrastructures and robust bioinformatics and filter procedures to retain only clinically relevant variants [8, 9]. As new genomes are sequenced, for example in the context of large projects like the 1000 Genomes Project , the number of expected variations may decrease. But, first complete individual constitutional genome sequencing studies reported 3-4 million of SNP per genome, 80-90% of which highly overlapped the National Center for Biotechnology Information public SNP database (dbSNP) , leaving anyway 0.5 million novel variations to sift per genome .
While WGS remains an appealing ultimate perspective, WES focusing on only the coding regions of the genome, has become in a few years the choice strategy to meet the challenge of identifying a coding allelic variant for rare human monogenic disorder . Thanks to DNA enrichment techniques, targeted sequencing of coding regions decreases the cost and improves the efficiency of large-scale coding variations discovery compared with what would require the entire human genome. The human exome, made of ~180,000 exons for a size of ~30 Mbp, is 1.5% of the total human genome. Thereby, not only targeted selection strategy reduces the cost but also accelerates the discovery of coding genetic variants that cause rare Mendelian diseases. In 2009, Ng et al. , by using an intersection recurrence strategy, showed the proof of the concept that identifying a gene responsible for a rare dominantly inherited disorder (Freeman-Sheldon syndrome) was possible using WES of independant index cases. Since then, more and more papers confirmed the success of this strategy [14–17].
Up to now, classical approaches such as linkage analysis using genetic markers have been extensively used to identify the molecular basis for nearly 3,500 Mendelian disorders . But for over 3,500 Mendelian disorders, the gene remains unknown [18, 19]. The limited number of patients for rare diseases or the limited access to the related members of the family has been a frequent obstacle to conduct linkage analysis . As the NGS technologies have emerged, the long and fastidious classical linkage analysis for human Mendelian disorders will be replaced by more direct identification of the causal variation(s) and the corresponding gene. Moreover, in numerous cases there are no caryotypic nor CGH-array anomaly or negative result with Sanger sequencing on known mutated genes or on neighbor genes in a pathway of interest, because of the low depth of this first generation sequencing technology . So, the exome-scale sequencing approach generates a technological breakthrough in medical genetics history in fundamental research for disease gene discovery and consequently in terms of new diagnostic methods and personalized medicine [12, 14, 16, 21].
Numerous algorithms and software tools have been developed to efficiently manage terabytes of raw sequence variation data from WES. Commonly adopted variation discovery pipelines include successive bioinformatics steps for quality control of the short reads, alignment of the short reads to a reference sequence, variation calling and variation annotation [1, 19, 22–24]. Generally, ~20,000 variations per individual exome are obtained. The challenge remains in efficient filtering strategies to find the causal variant(s) and corresponding gene for a rare disease, among these thousands of candidates. With this aim, additional analytical procedures which implicate various heuristic filtering strategies have emerged [19, 24]. Usually, wide range common variations (more than 90% of the total) are firstly excluded. This is done by comparison to publicy available databases of human genetic variations and privately available variants from other exome sequencing projects. To narrow down the search on remaining variations (often between 200 to 500), other filters take into account the type of variations (focus on presumed deleterious allelic variants, i.e. nonsynonymous, nonsense, stop loss, frameshift, splice site) and evaluate the functional effect of variations on gene products. Usually, various criteria are inspected for this task such as the physical properties of the wild-type and variant amino acids, the structural properties affecting protein dynamics and stability, the integrity of functional motifs and binding domains or sites implicated to posttranslational processing and cellular localization of proteins, evolutionary properties derived from a sequence alignment [21–24]. Beside these molecular nature and effects of the alternative allelic variants, filtering strategies also have to take into account the mode of inheritance of the disorder suggested by pedigree (recessive or dominant model for Mendelian disorders or sporadic cases). Finally, taking advantages of multiple individuals, intersection or differential exome strategies can drastically reduce the remaining variations to several genes.
As the exome-scale sequencing is today positioned as a method of choice for disease gene discovery and personalized medicine, the success of the unavoidable filtering strategies of thousands variations lies in their implementation into convivial and versatile software tools. End users with no computational skill have to be autonomous to conduct and combine themselves different filtering approaches, depending on their assumptions and of their study design, leading them to extract a limited list of likely candidate genes underlying a genetic disease.
With this aim, in partnership with and for medical geneticists, we developed EVA (Exome Variation Analyzer), a user-friendly web-interfaced free software dedicated to filtering strategies for medical projects investigated with exome sequencing. EVA integrates the main filters dealing with common variations, molecular types, inheritance mode and multiple samples. Here we report a demonstrative case study with EVA that allowed to identify a new candidate gene related to a rare form of Alzeihmer disease . We discuss our development choices and the position of EVA among other filtering tools recently published.