DMDtoolkit: a tool for visualizing the mutated dystrophin protein and predicting the clinical severity in DMD

Background Dystrophinopathy is one of the most common human monogenic diseases which results in Duchenne muscular dystrophy (DMD) and Becker muscular dystrophy (BMD). Mutations in the dystrophin gene are responsible for both DMD and BMD. However, the clinical phenotypes and treatments are quite different in these two muscular dystrophies. Since early diagnosis and treatment results in better clinical outcome in DMD it is essential to establish accurate early diagnosis of DMD to allow efficient management. Previously, the reading-frame rule was used to predict DMD versus BMD. However, there are limitations using this traditional tool. Here, we report a novel molecular method to improve the accuracy of predicting clinical phenotypes in dystrophinopathy. We utilized several additional molecular genetic rules or patterns such as “ambush hypothesis”, “hidden stop codons” and “exonic splicing enhancer (ESE)” to predict the expressed clinical phenotypes as DMD versus BMD. Results A computer software “DMDtoolkit” was developed to visualize the structure and to predict the functional changes of mutated dystrophin protein. It also assists statistical prediction for clinical phenotypes. Using the DMDtoolkit we showed that the accuracy of predicting DMD versus BMD raised about 3% in all types of dystrophin mutations when compared with previous methods. We performed statistical analyses using correlation coefficients, regression coefficients, pedigree graphs, histograms, scatter plots with trend lines, and stem and leaf plots. Conclusions We present a novel DMDtoolkit, to improve the accuracy of clinical diagnosis for DMD/BMD. This computer program allows automatic and comprehensive identification of clinical risk and allowing them the benefit of early medication treatments. DMDtoolkit is implemented in Perl and R under the GNU license. This resource is freely available at http://github.com/zhoujp111/DMDtoolkit, and http://www.dmd-registry.com. Electronic supplementary material The online version of this article (doi:10.1186/s12859-017-1504-4) contains supplementary material, which is available to authorized users.


Background
Duchenne muscular dystrophy (DMD) is an X-linked recessive disorder caused by dystrophin gene mutations [1]. It occurs in boys with an incidence rate of 1/3500 [2,3]. DMD patients usually show symptoms between 3 and 5 years old. They tend to lose ability to walk by age 12 years and succumb to cardiopulmonary failure from late teens to early 20s. Both DMD and BMD (a milder phenotype) are caused by mutations in the dystrophin gene. Dystrophin is the largest gene in human genome, spaning 2.4 Mb and containing 79 exons. The full-length transcript expressed in human skeletal muscle encodes a protein of 3685 amino acids, which gives rise to a 427 kDa dystrophin protein (Dp427m) that links cytoskeletal actin to the extracellular matrix via the sarcolemmal dystrophinassociated glycoprotein complex (DGC). Dp427m is composed of four domains: an amino-terminal actinbinding domain (ABD), a central rod domain that contains spectrin-like repeats, a cysteine-rich domain, and a unique carboxy-terminal domain.
The theory currently used to predict whether a mutation will result in a DMD or BMD phenotype is the readingframe rule (Monaco rule): "Adjacent exons that can maintain an open reading frame (ORF) in the spliced mRNA despite a deletion event would give rise to the less severe BMD phenotype and predict the production of a lower molecular weight, semifunctional dystrophin protein. Adjacent exons that cannot maintain an ORF because of frame shifted triplet codons would give rise to the more severe DMD phenotype due to the production of a truncated, nonfunctional dystrophin protein [4]". In-frame mutations, such as deletion of exons 45-47 whose length is 474 bp (i.e., 158 codons), would maintain the ORF and usually lead to BMD. In the case of DMD, the well-known types of DMD-causing mutations include large mutations [large deletions (larger than 1 exon), large duplications (larger than 1 exon)], small mutations [small deletions (less than 1 exon), small insertions (less than 1 exon)], splice site mutations (less than 10 bp from exon), point mutations (nonsense, missense), and mid-intronic mutations. Large deletions, such as deletion of exon 45 whose length is 176 bp (causing frameshift), are the most commonly observed and account for about 68% of the total mutations. The second common mutation is large duplications, such as duplication of exon 2, that account for about 11% [5]. Large deletions usually occur in the rod domain while large duplications mostly occur in the ABD domain.
Currently there are a number of databases reporting correlation between DMD genotype and phenotype. These include the Leiden muscular dystrophy pages (http://www. dmd.nl/) in the Netherlands [6], the UMD-DMD (http:// www.umd.be/DMD/) [7], the eDystrophin (http://edystro phin.genouest.org/) in France, and the TREAT-NMD DMD Global database (http://umd.be/TREAT_DMD/) in Belgium [5]. They offer a web-based query for existing mutations, showing their effects on the function of dystrophin gene and protein, and the frequency of each mutation. Although eDystrophin correlates information between protein isoforms and structures with pathology phenotypes it only shows structure of dystrophin protein and phenotype distribution for existing in-frame mutations. The small insertions or deletions to the splice sites of dystrophin gene appear to follow the reading-frame rule, but it is sometimes difficult to apply to a novel mutation or a nonsense mutation or a combination of multiple mutations. Furthermore, exceptions to the reading-frame rule have been widely reported. Given this limitation, we considered the potential underlying mechanisms of DMD and proposed using several other rules or patterns such as "ambush hypothesis" [8], "hidden stop codons" [9] and "exonic splicing enhancer (ESE)" [10,11] to distinguish between DMD and BMD of various types of mutations.
We previously built a Registration Network of Genetic Diseases database in China (www.dmd-registry.com) with information of more than 1400 Chinese DMD/BMD patients. We have now established a collaboration with the Lilac Garden (www.dxy.cn) according to the upcoming "One City, One Doctor Project" [12]. Lilac Garden is the leading online network and service provider in China. Our doctors and researchers in the field of clinical medicine and life sciences are establishing close working relationships with patients to improve the established database. These are important for developing an effective management team in the field. Early and accurate diagnosis is key to an effective treatment.
In the present study, we developed a computer software DMDtoolkit, which was based on Perl (Practical extraction and reporting language) and R environment, to provide an aid to the diagnosis of DMD. We also took into the consideration of other molecular characteristics such as mutated protein structure, pedigree of DMD family, and frequency of mutations. The DMDtoolkit is provided in the Additional files and can be downloaded from http://github.com/zhoujp111/DMDtoolkit or http:// www.dmd-registry.com after registration.

Implementation
The DMDtoolkit (including DMDtoolkit.pl, DMDtoolkit.R, etc.) was designed to Perl and R by the Department of Neurology in the General Hospital of Chinese People's Armed Police Forces. It is a free software for statistical computing and graphics. DMDtoolkit is a tool for analyzing the dystrophin mutations, predicting structure and features of the disordered protein, and visualizing statistical and genetic test results. It can help the clinicians and patients to better understand DMD.

Environment
Perl is a scripting language first created by Larry Wall to be used as a supplement to the programming which is freely available for download and general use [13]. R is a language and environment for statistical computing and graphics, also freely available for download and academic use [14]. These two platforms can be used jointly to quickly and effectively analyze and visualize the data. All codes were designed using ActivePerl version 5.16.2 and R version 3.0.2 on Windows 10 Professional platform. We refer the reader to the section ' Availability and Requirements' at the end of this article, which is a summary of the software involved in this version of DMDtoolkit.

Data analysis and visualization framework Smartly screening of incomplete data
We found patients' medical records are often incomplete in the clinical indicators for DMD patients. In order to maximize the use of existing data, a module of automatic screening was developed. An example is provided in Table 1.
This DMD child made three visits to the clinic and four sets of indicators were collected: body mass index (BMI), left ventricular end-diastolic dimension (LVEDD), sniff nasal inspiratory pressure (SNIP) and Wechsler Intelligence Scale for Children (WISC). Due to the nonlinear changes of some indicators, the imputation method might not be suitable to use [15]. However, the module named SmartScreen.R was coded with imputation method based on random forest which was a type of ensemble machine learning algorithm. In our work, the most informative data could be to select according to weighted score which is the sum of weight value of all indicators [16]. The weight for each indicator equals to one by default and can be changed by parameter settings. One or more than one indicator of interest can be set indispensable, which means that if any of them was missing, the entire data would be discarded. We used the following formula to calculate the scores: i is the column number of the first indicator, and j is the column number of the last indicator; weight vector can be set via command or be changed in the program.

Assisted diagnosis for DMD/BMD
Reading-frame rule has traditionally been used to distinguish between DMD and BMD, which has been shown to hold true for about 90% of patients [5,6]. Another two methods were later developed: the length of mutated protein [8] and the number of potential stop-gains [9]. The length of mutated protein method was initiated by ambush hypothesis. Fanin et al. emphasized the threshold effect and estimated that the size of a molecule needed to ensure the integrity and function of the dystrophinassociated glycoprotein (DAG) complex should be at least 200 kDa (about 43 exons or 2000 aa) [8]. In this study, the threshold was identified as 3000 aa, which is explained in the following paragraphs. Seligmann et al. revealed that hidden stop codons prevent off-frame gene reading, which was named potential stop-gains in our research. Thus, the number of potential stop-gains was associated with harmfulness of the mutation [9]. At this work, the transcript carrying a mutation or multiple mutations was translated into the mutated protein, and the length of mutated protein and the number of potential stop-gains were calculated. The cutoff value for the length of mutated protein was identified in DMD as less than 3000 aa or more than 3685 aa (outside of the length of normal protein). The cutoff value for the number of potential stop-gains to be regarded in DMD was ≠1, since normal protein has only one stop codon. We combined all three rules (the reading frame rule, the protein length, and the potential stop gains) to predict a DMD versus BMD. Some other rules or patterns were also applied. For example, large in-frame deletions in the central rod domain removing more than 35 exons usually led to DMD [8], and mutations in the cysteine-rich domain usually resulted in DMD [17,18]. The effect of exonic splicing enhancer (ESE) was also considered. A file named "ESE matrices.txt" (in the Additional file 1 "codes_DMDtoolkit") which contains the matrices of serine/arginine-rich (SR) proteins was used to predict the ESE effect. Cartegni et al. revealed that point mutations responsible for genetic diseases may cause aberrant splicing. Such mutations can disrupt splicing by directly inactivating or creating a splice site, by activating a cryptic splice site or by interfering with splicing regulatory elements [19]. A patient would be diagnosed as "DMD" by the joint predication if any method predicted him as "DMD". False positive rate (FPR) and false negative rate (FNR) were calculated for each method. Here a "false positive" means that a "BMD" is falsely predicted as a "DMD", while a "false negative" means that a "DMD" is falsely predicted as a "BMD".
We made several assumptions during the data analysis on DMD/BMD prediction: a. For exon splice sites, we assumed that only the nearby exons would be skipped. b. For missing the promoter region, we assumed that it could not create a transcript. c. For a combination of multiple mutations, the 5′ mutation would be firstly considered. If the transcript was predicted to stop translating before another mutation (3′ mutation), the 3′ mutation would be ignored. d. For some large deletions, several supplementary rules or patterns [8,20,21] for the reading-frame rule were applied.
The patient data used for this test were selected from the TREAT-NMD DMD Global database [5], Flanigan's DMD patients [19] and DMD patients of GHCPAPF (General Hospital of Chinese People's Armed Police Forces). The data were prepared in plain text format, such as "Flanigan's_DMD_patients.txt" in the Additional file 2 "data_DMDtoolkit". This research was approved by research ethics committee and medical ethics committee of General Hospital of Chinese People's Armed Police Forces. We clearly confirmed that signed informed consents were obtained from parents of DMD/BMD children or BMD patients in adult. The mutations which cannot tell the exact change of nucleotide sequence, such as c.1335ins680, were filtered since DMDtoolkit conducts the prediction via translating nucleotide sequence to protein sequence.

Visualization of mutated protein
We drew the sequence of the mutated protein according to its mutation, motifs, and potential protein length, then applied the reading-frame rule. We also analyzed the combination of multiple mutations (such as Large duplication + Small deletion, Splice site + Nonsense). RGui was used to execute the code and display the statistics in the figures (Fig. 1). More snapshots of GUIs can be found in "codes_DMDtoolkit/Manual.docx" in the Additional file 1.
Users familiar with R can also use R studio which includes a code editor, debugging and visualization tools. We selected some common mutations from TREAT-NMD database [5] (3,6) performed in the R Console creates the figure of trend line which is displayed in the R Graphics alone was not able to answer the protein length and stopgain number seemed to be able to avoid this problem.

Visualizing distribution of mutations and pedigree
Basic characteristics such as distribution of mutations in DMD patients and their pedigrees was graphed for easy understanding with its respective modules. The test data were selected from DMD patients of GHCPAPF.

Key features and functionalities
The functions of DMDtoolkit include: 1) aided diagnosis for DMD/BMD using genetic testing 2) drawing the sequence and motifs of mutated protein 3) drawing pedigree of DMD family 4) smart screening data to maximize the use of existing data 5) performing statistical analysis for DMD population and visualizing results.

Assisted diagnosis
DMDtoolkit was used according to four rules: readingframe rule, length of potential protein, number of potential stop-gains, ESE rule, and several patterns on location of mutations. This created three result files: *.diag, *.diag2 and *.diag3 (in the Additional file 3 "results_DMDtoolkit/ results_diagnosis/"). The differences between the three files were the extent of application of reading-frame rule and whether applying patterns or not: *.diag was restricted to exon deletions/duplications; *.diag2 was expanded to small deletions/duplications and splice sites; and *.diag3 applied ESE rule to nonsense mutations, and applied size and location info to in-frame deletions. The following results were obtained based primarily on the *.diag2 file (first four columns of statistics in Table 2) and partly on the *.diag3 file (the last column of statistics in Table 2). The detailed results can be found in the Additional file 3 "prediction results of DMD patients.xlsx" in the folder "results_DMDtoolkit/results_diagnosis/". Based on the reading-frame rule, the accuracy, FPR (false positive rate) and FNR (false negative rate), of predicting DMD/BMD were 91.0%, NA (not applicable), 9   The joint prediction uses all the above three methods. Two criteria were used to determine the joint judgment. First, if a positive judgment comes from two of the three methods, the result is regarded as positive (i.e., DMD/ DMD/BMD will be judged as DMD). By this criterion, the accuracy, FPR and FNR were 93.9%, NA, 6 . We took this result as the "joint prediction (Rules)" in Table 2.
For nonsense mutations, the accuracy of protein length, stop-gain number and the joint prediction for TREAT-NMD DMD patients was 82.4, 99.9 and 99.9%, respectively; for Flanigan's DMD patients they were 70.9, 85.4 and 85.4%, respectively. The accuracy for DMD patients of GHCPAPF was 94.7, 100.0 and 100.0% respectively. The FPR and FNR of protein length, stopgain number and the joint prediction were shown in Table 2. After application of ESE rule [10,11] the accuracy of ESE disrupted mutations among DMD patients in TREAT-NMD, Flanigan and GHCPAPF dropped to 0/ 70 = 0%, 6/49 = 12.2% and 0/5 = 0%, respectively. Therefore, the ESE rule was not used in the DMDtoolkit.

Visualization
DMDtoolkit can draw the sequence of a mutant protein and turn the document into a pdf file (Fig. 2). For protein with multiple mutations resulting in more than two frameshifts, it is difficult to apply the reading-frame rule to predict the mutant protein because the stop-gain may happen before the second frameshift, or the upstream frameshift may change the downstream missense to nonsense. Visualization is an easy way to show the change of a mutated protein, such as c.9563+1G>A plus c.9568C>T (Fig. 2). The Additional file 3 "results_DMDtoolkit/results_diagnosis/*.pdf", such as "case7-1 (combination of multiple mutations).pdf", showed examples of the seven mutation types. DMDtoolkit expanded the R package "kinship" [27] to draw multiple pedigrees at once (Fig. 3). DMDtoolkit can also draw the top N mutations' distribution (N can be set via command parameter) (Fig. 4).

Discussion
According to the first criterion, the joint prediction method was 1.9, 0 and 1.6% higher than the readingframe rule on accuracy of DMD patients from TREAT-NMD, Flanigan's and GHCPAPF groups respectively. According to the second criterion, the joint prediction was 4.1, 3.2 and 2.0% more than the reading-frame rule on accuracy of DMD patients from TREAT-NMD, Flanigan's and GHCPAPF, respectively. The improvement of accuracy mainly originated from the decrease of FNR in DMD patients with large deletions/duplications, and it benefited from the length of potential protein method.
For large deletions, the application of the supplemental patterns improved the total accuracy of joint prediction method without patterns (i.e., "joint prediction (Rules)" in Table 2) up by 1.7, 0.3, 0% and up to 96.8, 88.8 and 94.0% in DMD patients of TREAT-NMD, Flanigan and GHCPAPF, respectively. The improvement was due to the identification of in-frame deletions removing both the actin-binding domain and part of the central rod domain which usually cause DMD.
Future plans for development include the integration of data on pathways and protein-protein interaction (PPI) networks [28]. These will allow more comprehensive analyses on the biological processes of dystrophin and its interactive genes. An automated machine learning approach [29] will also be exploited to quantitatively predict the procession of disease using all available risk/ benefit indicators as well as the probability of BMD/ IMD/DMD.

Conclusions
DMDtoolkit is a unique computer software specifically developed to provide an easy way to analyze the mutant dystrophin protein in order to predict the diagnosis of DMD/BMD. This is achieved by combining genomic analysis with a bioinformatic approach. As for the prediction of DMD/BMD, DMDtoolkit provides a unique advantage when compared with previous predictions solely  Fig. 3 A pedigree of BMD father and DMD son. The family has inherited a disease-causing mutation, i.e., Exon45-47del, which led to BMD. Individual 13 was the proband, the BMD father. Individual 23 was his DMD son with a novo mutation of exon48-52del