- Open Access
DMDtoolkit: a tool for visualizing the mutated dystrophin protein and predicting the clinical severity in DMD
BMC Bioinformaticsvolume 18, Article number: 87 (2017)
Dystrophinopathy is one of the most common human monogenic diseases which results in Duchenne muscular dystrophy (DMD) and Becker muscular dystrophy (BMD). Mutations in the dystrophin gene are responsible for both DMD and BMD. However, the clinical phenotypes and treatments are quite different in these two muscular dystrophies. Since early diagnosis and treatment results in better clinical outcome in DMD it is essential to establish accurate early diagnosis of DMD to allow efficient management. Previously, the reading-frame rule was used to predict DMD versus BMD. However, there are limitations using this traditional tool. Here, we report a novel molecular method to improve the accuracy of predicting clinical phenotypes in dystrophinopathy. We utilized several additional molecular genetic rules or patterns such as “ambush hypothesis”, “hidden stop codons” and “exonic splicing enhancer (ESE)” to predict the expressed clinical phenotypes as DMD versus BMD.
A computer software “DMDtoolkit” was developed to visualize the structure and to predict the functional changes of mutated dystrophin protein. It also assists statistical prediction for clinical phenotypes. Using the DMDtoolkit we showed that the accuracy of predicting DMD versus BMD raised about 3% in all types of dystrophin mutations when compared with previous methods. We performed statistical analyses using correlation coefficients, regression coefficients, pedigree graphs, histograms, scatter plots with trend lines, and stem and leaf plots.
We present a novel DMDtoolkit, to improve the accuracy of clinical diagnosis for DMD/BMD. This computer program allows automatic and comprehensive identification of clinical risk and allowing them the benefit of early medication treatments. DMDtoolkit is implemented in Perl and R under the GNU license. This resource is freely available at http://github.com/zhoujp111/DMDtoolkit, and http://www.dmd-registry.com.
Duchenne muscular dystrophy (DMD) is an X-linked recessive disorder caused by dystrophin gene mutations . It occurs in boys with an incidence rate of 1/3500 [2, 3]. DMD patients usually show symptoms between 3 and 5 years old. They tend to lose ability to walk by age 12 years and succumb to cardiopulmonary failure from late teens to early 20s. Both DMD and BMD (a milder phenotype) are caused by mutations in the dystrophin gene. Dystrophin is the largest gene in human genome, spaning 2.4 Mb and containing 79 exons. The full-length transcript expressed in human skeletal muscle encodes a protein of 3685 amino acids, which gives rise to a 427 kDa dystrophin protein (Dp427m) that links cytoskeletal actin to the extracellular matrix via the sarcolemmal dystrophin-associated glycoprotein complex (DGC). Dp427m is composed of four domains: an amino-terminal actin-binding domain (ABD), a central rod domain that contains spectrin-like repeats, a cysteine-rich domain, and a unique carboxy-terminal domain.
The theory currently used to predict whether a mutation will result in a DMD or BMD phenotype is the reading-frame rule (Monaco rule): “Adjacent exons that can maintain an open reading frame (ORF) in the spliced mRNA despite a deletion event would give rise to the less severe BMD phenotype and predict the production of a lower molecular weight, semifunctional dystrophin protein. Adjacent exons that cannot maintain an ORF because of frame shifted triplet codons would give rise to the more severe DMD phenotype due to the production of a truncated, nonfunctional dystrophin protein ”. In-frame mutations, such as deletion of exons 45-47 whose length is 474 bp (i.e., 158 codons), would maintain the ORF and usually lead to BMD. In the case of DMD, the well-known types of DMD-causing mutations include large mutations [large deletions (larger than 1 exon), large duplications (larger than 1 exon)], small mutations [small deletions (less than 1 exon), small insertions (less than 1 exon)], splice site mutations (less than 10 bp from exon), point mutations (nonsense, missense), and mid-intronic mutations. Large deletions, such as deletion of exon 45 whose length is 176 bp (causing frameshift), are the most commonly observed and account for about 68% of the total mutations. The second common mutation is large duplications, such as duplication of exon 2, that account for about 11% . Large deletions usually occur in the rod domain while large duplications mostly occur in the ABD domain.
Currently there are a number of databases reporting correlation between DMD genotype and phenotype. These include the Leiden muscular dystrophy pages (http://www.dmd.nl/) in the Netherlands , the UMD-DMD (http://www.umd.be/DMD/) , the eDystrophin (http://edystrophin.genouest.org/) in France, and the TREAT-NMD DMD Global database (http://umd.be/TREAT_DMD/) in Belgium . They offer a web-based query for existing mutations, showing their effects on the function of dystrophin gene and protein, and the frequency of each mutation. Although eDystrophin correlates information between protein isoforms and structures with pathology phenotypes it only shows structure of dystrophin protein and phenotype distribution for existing in-frame mutations. The small insertions or deletions to the splice sites of dystrophin gene appear to follow the reading-frame rule, but it is sometimes difficult to apply to a novel mutation or a nonsense mutation or a combination of multiple mutations. Furthermore, exceptions to the reading-frame rule have been widely reported. Given this limitation, we considered the potential underlying mechanisms of DMD and proposed using several other rules or patterns such as “ambush hypothesis” , “hidden stop codons”  and “exonic splicing enhancer (ESE)” [10, 11] to distinguish between DMD and BMD of various types of mutations.
We previously built a Registration Network of Genetic Diseases database in China (www.dmd-registry.com) with information of more than 1400 Chinese DMD/BMD patients. We have now established a collaboration with the Lilac Garden (www.dxy.cn) according to the upcoming “One City, One Doctor Project” . Lilac Garden is the leading online network and service provider in China. Our doctors and researchers in the field of clinical medicine and life sciences are establishing close working relationships with patients to improve the established database. These are important for developing an effective management team in the field. Early and accurate diagnosis is key to an effective treatment.
In the present study, we developed a computer software DMDtoolkit, which was based on Perl (Practical extraction and reporting language) and R environment, to provide an aid to the diagnosis of DMD. We also took into the consideration of other molecular characteristics such as mutated protein structure, pedigree of DMD family, and frequency of mutations. The DMDtoolkit is provided in the Additional files and can be downloaded from http://github.com/zhoujp111/DMDtoolkit or http://www.dmd-registry.com after registration.
The DMDtoolkit (including DMDtoolkit.pl, DMDtoolkit.R, etc.) was designed to Perl and R by the Department of Neurology in the General Hospital of Chinese People’s Armed Police Forces. It is a free software for statistical computing and graphics. DMDtoolkit is a tool for analyzing the dystrophin mutations, predicting structure and features of the disordered protein, and visualizing statistical and genetic test results. It can help the clinicians and patients to better understand DMD.
Perl is a scripting language first created by Larry Wall to be used as a supplement to the programming which is freely available for download and general use . R is a language and environment for statistical computing and graphics, also freely available for download and academic use . These two platforms can be used jointly to quickly and effectively analyze and visualize the data. All codes were designed using ActivePerl version 5.16.2 and R version 3.0.2 on Windows 10 Professional platform. We refer the reader to the section ‘Availability and Requirements’ at the end of this article, which is a summary of the software involved in this version of DMDtoolkit.
Data analysis and visualization framework
Smartly screening of incomplete data
We found patients’ medical records are often incomplete in the clinical indicators for DMD patients. In order to maximize the use of existing data, a module of automatic screening was developed. An example is provided in Table 1.
This DMD child made three visits to the clinic and four sets of indicators were collected: body mass index (BMI), left ventricular end-diastolic dimension (LVEDD), sniff nasal inspiratory pressure (SNIP) and Wechsler Intelligence Scale for Children (WISC). Due to the non-linear changes of some indicators, the imputation method might not be suitable to use . However, the module named SmartScreen.R was coded with imputation method based on random forest which was a type of ensemble machine learning algorithm. In our work, the most informative data could be to select according to weighted score which is the sum of weight value of all indicators . The weight for each indicator equals to one by default and can be changed by parameter settings. One or more than one indicator of interest can be set indispensable, which means that if any of them was missing, the entire data would be discarded. We used the following formula to calculate the scores:
i is the column number of the first indicator, and j is the column number of the last indicator; weight vector can be set via command or be changed in the program.
Assisted diagnosis for DMD/BMD
Reading-frame rule has traditionally been used to distinguish between DMD and BMD, which has been shown to hold true for about 90% of patients [5, 6]. Another two methods were later developed: the length of mutated protein  and the number of potential stop-gains . The length of mutated protein method was initiated by ambush hypothesis. Fanin et al. emphasized the threshold effect and estimated that the size of a molecule needed to ensure the integrity and function of the dystrophin-associated glycoprotein (DAG) complex should be at least 200 kDa (about 43 exons or 2000 aa) . In this study, the threshold was identified as 3000 aa, which is explained in the following paragraphs. Seligmann et al. revealed that hidden stop codons prevent off-frame gene reading, which was named potential stop-gains in our research. Thus, the number of potential stop-gains was associated with harmfulness of the mutation . At this work, the transcript carrying a mutation or multiple mutations was translated into the mutated protein, and the length of mutated protein and the number of potential stop-gains were calculated. The cutoff value for the length of mutated protein was identified in DMD as less than 3000 aa or more than 3685 aa (outside of the length of normal protein). The cutoff value for the number of potential stop-gains to be regarded in DMD was ≠1, since normal protein has only one stop codon. We combined all three rules (the reading frame rule, the protein length, and the potential stop gains) to predict a DMD versus BMD. Some other rules or patterns were also applied. For example, large in-frame deletions in the central rod domain removing more than 35 exons usually led to DMD , and mutations in the cysteine-rich domain usually resulted in DMD [17, 18]. The effect of exonic splicing enhancer (ESE) was also considered. A file named “ESE matrices.txt” (in the Additional file 1 “codes_DMDtoolkit”) which contains the matrices of serine/arginine-rich (SR) proteins was used to predict the ESE effect. Cartegni et al. revealed that point mutations responsible for genetic diseases may cause aberrant splicing. Such mutations can disrupt splicing by directly inactivating or creating a splice site, by activating a cryptic splice site or by interfering with splicing regulatory elements . A patient would be diagnosed as “DMD” by the joint predication if any method predicted him as “DMD”. False positive rate (FPR) and false negative rate (FNR) were calculated for each method. Here a “false positive” means that a “BMD” is falsely predicted as a “DMD”, while a “false negative” means that a “DMD” is falsely predicted as a “BMD”.
We made several assumptions during the data analysis on DMD/BMD prediction:
For exon splice sites, we assumed that only the nearby exons would be skipped.
For missing the promoter region, we assumed that it could not create a transcript.
For a combination of multiple mutations, the 5′ mutation would be firstly considered. If the transcript was predicted to stop translating before another mutation (3′ mutation), the 3′mutation would be ignored.
The patient data used for this test were selected from the TREAT-NMD DMD Global database , Flanigan’s DMD patients  and DMD patients of GHCPAPF (General Hospital of Chinese People’s Armed Police Forces). The data were prepared in plain text format, such as “Flanigan’s_DMD_patients.txt” in the Additional file 2 “data_DMDtoolkit”. This research was approved by research ethics committee and medical ethics committee of General Hospital of Chinese People’s Armed Police Forces. We clearly confirmed that signed informed consents were obtained from parents of DMD/BMD children or BMD patients in adult. The mutations which cannot tell the exact change of nucleotide sequence, such as c.1335ins680, were filtered since DMDtoolkit conducts the prediction via translating nucleotide sequence to protein sequence.
Visualization of mutated protein
We drew the sequence of the mutated protein according to its mutation, motifs, and potential protein length, then applied the reading-frame rule. We also analyzed the combination of multiple mutations (such as Large duplication + Small deletion, Splice site + Nonsense). RGui was used to execute the code and display the statistics in the figures (Fig. 1). More snapshots of GUIs can be found in “codes_DMDtoolkit/Manual.docx” in the Additional file 1. Users familiar with R can also use R studio which includes a code editor, debugging and visualization tools. We selected some common mutations from TREAT-NMD database  as test data (totally 51 mutation types in the Additional file 2 “data_DMDtoolkit/data_diagnosis/DMDsamples.txt”): large deletions (≥1 exon) (freq ≥100), large duplications (≥1 exon) (freq ≥10), small deletions (<1 exon) (freq ≥4), small insertions (<1 exon) (freq ≥3), splice sites (<10 bp from exon) (freq ≥4), nonsense (freq ≥10). We simulated two combinations of multiple mutations with common mutations from TREAT-NMD, i.e., exon56-62dup plus c.9204_9207delCAAA and c.9563+1G>A plus c.9568C>T. In addition, there were six combinations of multiple mutations (exon5-19dup plus exon38-41dup, exon29dup plus exon45dup, exon45-55dup plus exon65-79dup, exon5-18dup plus exon19-41del plus exon42dup plus exon43-44del, exon10-16dup plus exon22-24dup, exon50-60dup plus exon63-79dup) from Flanigan’s DMD patients, and seven DMD patients from GHCPAPF (exon1del plus exon2dup plus Dp427cdel, exon31-43dup plus c.4000G>A p.Gly1334Arg, exon45del plus exon47-52del, exon50del plus exon52del, c.1898dupA plus c.5234G>A p.Arg1745His, exon1-12del plus Dp427c-490ntdel, c.7096C>A p.Gln2366Lys plus c.10101_10103delAGA p.Glu3367del). Using “c.1898dupA plus c.5234G>A p.Arg1745His” as an example, we investigated whether the frameshift c.1898dupA would change the downstream missense c.5234G>A to a nonsense. While the reading-frame rule alone was not able to answer the protein length and stop-gain number seemed to be able to avoid this problem.
Visualizing distribution of mutations and pedigree
Basic characteristics such as distribution of mutations in DMD patients and their pedigrees was graphed for easy understanding with its respective modules. The test data were selected from DMD patients of GHCPAPF.
Key features and functionalities
The functions of DMDtoolkit include:
aided diagnosis for DMD/BMD using genetic testing
drawing the sequence and motifs of mutated protein
drawing pedigree of DMD family
smart screening data to maximize the use of existing data
performing statistical analysis for DMD population and visualizing results.
DMDtoolkit was used according to four rules: reading-frame rule, length of potential protein, number of potential stop-gains, ESE rule, and several patterns on location of mutations. This created three result files: *.diag, *.diag2 and *.diag3 (in the Additional file 3 “results_DMDtoolkit/results_diagnosis/”). The differences between the three files were the extent of application of reading-frame rule and whether applying patterns or not: *.diag was restricted to exon deletions/duplications; *.diag2 was expanded to small deletions/duplications and splice sites; and *.diag3 applied ESE rule to nonsense mutations, and applied size and location info to in-frame deletions. The following results were obtained based primarily on the *.diag2 file (first four columns of statistics in Table 2) and partly on the *.diag3 file (the last column of statistics in Table 2). The detailed results can be found in the Additional file 3 “prediction results of DMD patients.xlsx” in the folder “results_DMDtoolkit/results_diagnosis/”. Based on the reading-frame rule, the accuracy, FPR (false positive rate) and FNR (false negative rate), of predicting DMD/BMD were 91.0%, NA (not applicable), 9.0% (large deletions/duplications in 5161/5681 = 90.8% for accuracy, NA for FPR, 520/5681 = 9.2% for FNR; small deletions/duplications in 475/483 = 98.3% for accuracy, NA for FPR, 8/483 = 1.7% for FNR; splice sites in 155/198 = 78.3% for accuracy, NA for FPR, 43/198 = 21.7% for FNR) in TREAT-NMD DMD patients. They were 85.0, 5.8, and 9.2% (large deletions/duplications in 369/435 = 84.8% for accuracy, 23/435 = 5.3% for FPR, 43/435 = 9.8% for FNR; small deletions/duplications in 69/78 = 88.5% for accuracy, 7/78 = 9.0% for FPR, 2/78 = 2.6% for FNR; splice sites in 17/22 = 77.3% for accuracy, 1/22 = 4.5% for FPR, 4/22 = 18.2% for FNR) in Flanigan’s DMD patients. In GHCPAPF patients they were 92.0, 0.4, and 7.6%, (large deletions/duplications in 213/231 = 92.2% for accuracy, 0/231 = 0% for FPR, 18/231 = 7.8% for FNR; small deletions/duplications in 15/17 = 88.2% for accuracy, 1/17 = 5.9% for FPR, 1/17 = 5.9% for FNR; splice sites in 1/1 = 100.0% for accuracy, 0/1 = 0% for FPR, 0/1 = 0% for FNR). According to the length of potential protein, the accuracy, FPR and FNR were 91.0%, NA, 9.0% (large deletions/duplications in 5283/5681 = 93.0% for accuracy, NA for FPR, 390/5681 = 7.0% for FNR; small deletions/duplications in 408/483 = 84.5% for accuracy, NA for FPR, 75/483 = 15.5% for FNR; splice sites in 99/198 = 50.0% for accuracy, NA for FPR, 99/198 = 50.0% for FNR) in TREAT-NMD DMD patients. They were 83.2, 6.7, 10.1% (large deletions/duplications in 374/435 = 86.0% for accuracy, 31/435 = 7.1% for FPR, 30/435 = 6.9% for FNR; small deletions/duplications in 59/78 = 75.6% for accuracy, 4/78 = 5.1% for FPR, 15/78 = 19.2% for FNR; splice sites in 12/22 = 54.5% for accuracy, 1/22 = 4.5% for FPR, 9/22 = 40.9% for FNR) in Flanigan’s DMD patients. In GAPGH group they were 92.8, 2.4, 4.8% (large deletions/duplications in 217/231 = 93.9% for accuracy, 5/231 = 2.2% for FPR, 9/231 = 3.9% for FNR; small deletions/duplications in 13/17 = 76.5% for accuracy, 1/17 = 5.9% for FPR, 3/17 = 17.6% for FNR; splice sites in 1/1 = 100.0% for accuracy, 0/1 = 0% for FPR, 0/1 = 0% for FNR). For the number of potential stop-gains, the accuracy, FPR and FNR were 91.3%, NA, 8.7% (large deletions/duplications in 5180/5681 = 91.2% for accuracy, NA for FPR, 501/5681 = 8.8% for FNR; small deletions/duplications in 476/483 = 98.6% for accuracy, NA for FPR, 7/483 = 1.4% for FNR; splice sites in 155/198 = 78.3% for accuracy, NA for FPR, 43/198 = 21.7% for FNR) in TREAT-NMD DMD patients. They were 85.2, 6.0, 8.8% (large deletions/duplications in 370/435 = 85.1% for accuracy, 24/435 = 5.5% for FPR, 41/435 = 9.4% for FNR; small deletions/duplications in 69/78 = 88.5% for accuracy, 7/78 = 9.0% for FPR, 2/78 = 2.6% for FNR; splice sites in 17/22 = 77.3% for accuracy, 1/22 = 4.5% for FPR, 4/22 = 18.2% for FNR) in Flanigan’s DMD patients. In the GHCPAPF group they were 93.6, 0.4, 6.0% (large deletions/duplications in 217/231 = 93.9% for accuracy, 0/231 = 0% for FPR, 14/231 = 6.1% for FNR; small deletions/duplications in 15/17 = 88.2% for accuracy, 1/17 = 5.9% for FPR, 1/17 = 5.9% for FNR; splice sites in 1/1 = 100.0% for accuracy, 0/1 = 0% for FPR, 0/1 = 0% for FNR). If we use 2000 aa  as the threshold of the length of potential mutated protein, the accuracy, FPR and FNR were 31.3%, NA, 68.7% (large deletions/duplications in 1634/5681 = 28.8% for accuracy, NA for FPR, 4047/5681 = 71.2% for FNR; small deletions/duplications in 293/483 = 60.7% for accuracy, NA for FPR, 190/483 = 39.3% for FNR; splice sites in 155/198 = 78.3% for accuracy, NA for FPR, 43/198 = 21.7% for FNR) in TREAT-NMD DMD patients. They were 44.9, 5.0, 50.1% (large deletions/duplications in 185/435 = 42.5% for accuracy, 23/435 = 5.3% for FPR, 227/435 = 52.2% for FNR; small deletions/duplications in 49/78 = 62.8% for accuracy, 4/78 = 5.1% for FPR, 25/78 = 32.1% for FNR; splice sites in 6/22 = 27.3% for accuracy, 0/22 = 0% for FPR, 16/22 = 72.7% for FNR) in Flanigan’s DMD patients. In GHCPAPF group they were 38.4, 1.9, 59.7% (large deletions/duplications in 81/231 = 35.1% for accuracy, 4/231 = 1.7% for FPR, 146/231 = 63.2% for FNR; small deletions/duplications in 10/17 = 58.8% for accuracy, 1/17 = 5.9% for FPR, 6/17 = 35.3% for FNR; splice sites in 1/1 = 100.0% for accuracy, 0/1 = 0% for FPR, 0/1 = 0% for FNR). Thus, we chose the 3000 aa as the threshold. The two new methods have a similar accuracy to the reading-frame method.
The joint prediction uses all the above three methods. Two criteria were used to determine the joint judgment. First, if a positive judgment comes from two of the three methods, the result is regarded as positive (i.e., DMD/DMD/BMD will be judged as DMD). By this criterion, the accuracy, FPR and FNR were 93.9%, NA, 6.1% (large deletions/duplications in 5341/5681 = 94.0% for accuracy, NA for FPR, 340/5681 = 6.0% for FNR; small deletions/duplications in 476/483 = 98.6% for accuracy, NA for FPR, 7/483 = 1.4% for FNR; splice sites in 155/198 = 78.3% for accuracy, NA for FPR, 43/198 = 21.7% for FNR) in TREAT-NMD DMD patients. They were 85.0, 6.0, 9.0% (large deletions/duplications in 369/435 = 84.8% for accuracy, 24/435 = 5.5% for FPR, 42/435 = 9.7% for FNR; small deletions/duplications in 69/78 = 88.5% for accuracy, 7/78 = 9.0% for FPR, 2/78 = 2.6% for FNR; splice sites in 17/22 = 77.3% for accuracy, 1/22 = 4.5% for FPR, 4/22 = 18.2% for FNR) in Flanigan’s DMD patients. In the GHCPAPF group they were 93.6, 0.4, and 6.0% (large deletions/duplications in 217/231 = 93.9% for accuracy, 0/231 = 0% for FPR, 14/231 = 6.1% for FNR; small deletions/duplications in 15/17 = 88.2% for accuracy, 1/17 = 5.9% for FPR, 1/17 = 5.9% for FNR; splice sites in 1/1 = 100.0% for accuracy, 0/1 = 0% for FPR, 0/1 = 0% for FNR). Second, if we get a positive judgment based on one of the three methods, the result is regarded as positive. The accuracy, FPR and FNR would increase to 95.1%, NA, and 4.9% (large deletions/duplications in 5419/5681 = 95.4% for accuracy, NA for FPR, 262/5681 = 4.6% for FNR; small deletions/duplications in 476/483 = 98.6% for accuracy, NA for FPR, 7/483 = 1.4% for FNR; splice sites in 155/198 = 78.3% for accuracy, NA for FPR, 43/198 = 21.7% for FNR) in TREAT-NMD DMD patients. They were 88.2, 7.3, and 4.5% (large deletions/duplications in 386/435 = 88.7% for accuracy, 31/435 = 7.1% for FPR, 18/435 = 4.1% for FNR; small deletions/duplications in 69/78 = 88.5% for accuracy, 7/78 = 9.0% for FPR, 2/78 = 2.6% for FNR; splice sites in 17/22 = 77.3% for accuracy, 1/22 = 4.5% for FPR, 4/22 = 18.2% for FNR) in Flanigan’s DMD patients. In GHCPAPF group they were 94.0, 2.4, 3.6% (large deletions/duplications in 218/231 = 94.4% for accuracy, 5/231 = 2.2% for FPR, 8/231 = 3.5% for FNR; small deletions/duplications in 15/17 = 88.2% for accuracy, 1/17 = 5.9% for FPR, 1/17 = 5.9% for FNR; splice sites in 1/1 = 100.0% for accuracy, 0/1 = 0% for FPR, 0/1 = 0% for FNR). We took this result as the “joint prediction (Rules)” in Table 2.
For nonsense mutations, the accuracy of protein length, stop-gain number and the joint prediction for TREAT-NMD DMD patients was 82.4, 99.9 and 99.9%, respectively; for Flanigan’s DMD patients they were 70.9, 85.4 and 85.4%, respectively. The accuracy for DMD patients of GHCPAPF was 94.7, 100.0 and 100.0% respectively. The FPR and FNR of protein length, stop-gain number and the joint prediction were shown in Table 2. After application of ESE rule [10, 11] the accuracy of ESE disrupted mutations among DMD patients in TREAT-NMD, Flanigan and GHCPAPF dropped to 0/70 = 0%, 6/49 = 12.2% and 0/5 = 0%, respectively. Therefore, the ESE rule was not used in the DMDtoolkit.
For large deletions, several supplemental patterns to the reading-frame rule were applied. It was reported that in-frame deletions within exons 2–8 caused severe BMD, whereas deletions in the major hotspot generally caused typical BMD [21, 22]. In-frame deletions removing both the actin-binding domain and part of the central rod domain usually cause DMD [8, 20, 23]. Large in-frame deletions in the central rod domain removing more than 35 exons usually led to DMD , while deletions of no more than 35 exons likely led to BMD [8, 24, 25]. Mutations in the cysteine-rich domain usually resulted in DMD [17, 18] whereas deletions in the syntrophin-binding domain (exons 71-74) were reported in some BMD patients and mutations located in exon 74 or behind it were found in both BMD and DMD patients [17, 26]. Compared to the joint prediction without conducting the supplementary patterns, the accuracy, FPR, FNR with application of patterns was 100% (16/16), NA, 0% (0/16) in DMD patients of TREAT-NMD; was 80.0% (4/5), 20.0% (1/5), 0% (0/5) in DMD patients of Flanigan. They were 50.0% (1/2), 50.0% (1/2), 0% (0/2) in DMD patients of GHCPAPF (see sheet “Supplementary patterns” in Additional file 3 “prediction results of DMD patients.xlsx” for details).
The accuracy, FPR, FNR of the six and seven combinations of multiple mutations in Flanigan and GHCPAPF were 83.3% (5/6), 0% (0/6), 16.7% (1/6), and 85.7% (6/7), 0% (0/7), 14.3% (1/7), respectively. While the accuracy, FPR, FNR of reading-frame rule was 83.3% (5/6), 0% (0/6), 16.7% (1/6) and 42.9% (3/7), 0% (0/7), 57.1% (4/7) in Flanigan and GHCPAPF, respectively. Please see the sheet “Multiple mutations” in Additional file 3 “prediction results of DMD patients.xlsx” for details.
DMDtoolkit can draw the sequence of a mutant protein and turn the document into a pdf file (Fig. 2). For protein with multiple mutations resulting in more than two frameshifts, it is difficult to apply the reading-frame rule to predict the mutant protein because the stop-gain may happen before the second frameshift, or the upstream frameshift may change the downstream missense to nonsense. Visualization is an easy way to show the change of a mutated protein, such as c.9563+1G>A plus c.9568C>T (Fig. 2). The Additional file 3 “results_DMDtoolkit/results_diagnosis/*.pdf”, such as “case7-1 (combination of multiple mutations).pdf”, showed examples of the seven mutation types. DMDtoolkit expanded the R package “kinship”  to draw multiple pedigrees at once (Fig. 3). DMDtoolkit can also draw the top N mutations’ distribution (N can be set via command parameter) (Fig. 4).
According to the first criterion, the joint prediction method was 1.9, 0 and 1.6% higher than the reading-frame rule on accuracy of DMD patients from TREAT-NMD, Flanigan’s and GHCPAPF groups respectively. According to the second criterion, the joint prediction was 4.1, 3.2 and 2.0% more than the reading-frame rule on accuracy of DMD patients from TREAT-NMD, Flanigan’s and GHCPAPF, respectively. The improvement of accuracy mainly originated from the decrease of FNR in DMD patients with large deletions/duplications, and it benefited from the length of potential protein method.
For large deletions, the application of the supplemental patterns improved the total accuracy of joint prediction method without patterns (i.e., “joint prediction (Rules)” in Table 2) up by 1.7, 0.3, 0% and up to 96.8, 88.8 and 94.0% in DMD patients of TREAT-NMD, Flanigan and GHCPAPF, respectively. The improvement was due to the identification of in-frame deletions removing both the actin-binding domain and part of the central rod domain which usually cause DMD.
Future plans for development include the integration of data on pathways and protein-protein interaction (PPI) networks . These will allow more comprehensive analyses on the biological processes of dystrophin and its interactive genes. An automated machine learning approach  will also be exploited to quantitatively predict the procession of disease using all available risk/benefit indicators as well as the probability of BMD/IMD/DMD.
DMDtoolkit is a unique computer software specifically developed to provide an easy way to analyze the mutant dystrophin protein in order to predict the diagnosis of DMD/BMD. This is achieved by combining genomic analysis with a bioinformatic approach. As for the prediction of DMD/BMD, DMDtoolkit provides a unique advantage when compared with previous predictions solely based on the reading-frame rule. It can automatically and rapidly predict clinical phenotypes even in the presence of multiple mutations. The accuracy of the current joint method is about 3% more than that of reading-frame rule alone. Its advantage is due to the bioinformatics approach combining the three different methods for prediction.
Basic statistics include calculation of summary, correlation coefficient, regression coefficient, and t test (in the Additional file 3 “supplement1.docx” in the folder “results_DMDtoolkit/results_statistics & graph/”). Basic graphs include pedigree, histogram, scatter plot with trend line, stem and leaf plot, and cluster dendrogram (in the Additional file 3 “supplement2.docx” in the folder “results_DMDtoolkit/results_statistics & graph/”). These results can help patients and clinicians more easily understand the disease and detect risk/benefit indicators.
Availability of data and materials
Project name: DMDtoolkit.
Archived version: 1.0.
Operating system(s): Platform independent.
Programming language: R and Perl.
Other requirements: R 3.0 or higher, ActivePerl 5.16 or higher.
License: GNU GPL.
Any restrictions to use by non-academics: licence needed.
Becker muscular dystrophy
Body mass index
Duchenne muscular dystrophy
Exonic splicing enhancer
False negative rate
False positive rate
General Hospital of Chinese People’s Armed Police Forces
GNU’s Not Unix
Intermediate muscular dystrophy
Left ventricular end-diastolic dimension
Practical extraction and reporting language
Sniff nasal inspiratory pressure
Wechsler Intelligence Scale for Children
Muntoni F, Torelli S, Ferlini A. Dystrophin and mutations: one gene, several proteins, multiple phenotypes. Lancet Neurol. 2003;2(12):731–40.
Mendell JR, Shilling C, Leslie ND, Flanigan KM, al-Dahhak R, Gastier-Foster J, Kneile K, Dunn DM, Duval B, Aoyagi A, et al. Evidence-based path to newborn screening for Duchenne muscular dystrophy. Ann Neurol. 2012;71(3):304–13.
Moat SJ, Bradley DM, Salmon R, Clarke A, Hartley L. Newborn bloodspot screening for Duchenne muscular dystrophy: 21 years experience in Wales (UK). Eur J Hum Genet. 2013;21(10):1049–53.
Monaco AP, Bertelson CJ, Liechti-Gallati S, Moser H, Kunkel LM. An explanation for the phenotypic differences between patients bearing partial deletions of the DMD locus. Genomics. 1988;2(1):90–5.
Bladen CL, Salgado D, Monges S, Foncuberta ME, Kekou K, Kosma K, Dawkins H, Lamont L, Roy AJ, Chamova T, et al. The TREAT-NMD DMD Global Database: analysis of more than 7,000 Duchenne muscular dystrophy mutations. Hum Mutat. 2015;36(4):395–402. http://umd.be/TREAT_DMD/. Accessed 25 Nov 2015.
Aartsma-Rus A, Van Deutekom JC, Fokkema IF, Van Ommen GJ, Den Dunnen JT. Entries in the Leiden Duchenne muscular dystrophy mutation database: an overview of mutation types and paradoxical cases that confirm the reading-frame rule. Muscle Nerve. 2006;34(2):135–44. http://www.dmd.nl/. Accessed 17 Nov 2015.
Cotton RG, Auerbach AD, Beckmann JS, Blumenfeld OO, Brookes AJ, Brown AF, Carrera P, Cox DW, Gottlieb B, Greenblatt MS, et al. Recommendations for locus-specific databases and their curation. Hum Mutat. 2008;29(1):2–5.
Fanin M, Freda MP, Vitiello L, Danieli GA, Pegoraro E, Angelini C. Duchenne phenotype with in-frame deletion removing major portion of dystrophin rod: threshold effect for deletion size? Muscle Nerve. 1996;19(9):1154–60.
Seligmann H, Pollock DD. The ambush hypothesis: hidden stop codons prevent off-frame gene reading. DNA Cell Biol. 2004;23(10):701–5.
Sun Q, Mayeda A, Hampson RK, Krainer AR, Rottman FM. General splicing factor SF2/ASF promotes alternative splicing by binding to an exonic splicing enhancer. Genes Dev. 1993;7(12B):2598–608.
Cartegni L, Wang J, Zhu Z, Zhang MQ, Krainer AR. ESEfinder: a web resource to identify exonic splicing enhancers. Nucleic Acids Res. 2003;31(13):3568–71.
Xu EX. Professor Shi-Wen Wu: one city, one doctor-building up the national DMD registry network. Ann Transl Med. 2015;3(14):204.
GNU Perl. http://www.perl.org/. Accessed 21 Jan 2013.
GNU R. http://www.r-project.org/. Accessed 25 Sept 2013.
Heitjan DF. Annotation: what can be done about missing data? Approaches to imputation. Am J Public Health. 1997;87(4):548–50.
Fujikawa K, Sasaki M, Itoh T, Arai Y, Ogawa O, Yoshida O. Combining volume-weighted mean nuclear volume with Gleason score and clinical stage to predict more reliably disease outcome of patients with prostate cancer. Prostate. 1998;37(2):63–9.
Bies RD, Caskey CT, Fenwick R. An intact cysteine-rich domain is required for dystrophin function. J Clin Invest. 1992;90:666–72.
Rafael JA, Cox GA, Corrado K, Jung D, Campbell KP, Chamberlain JS. Forced expression of dystrophin deletion constructs reveals structure–function correlations. J Cell Biol. 1996;134:93–102.
Flanigan KM, Dunn DM, von Niederhausern A, Soltanzadeh P, Gappmaier E, Howard MT, Sampson JB, Mendell JR, Wall C, King WM, et al. Mutational spectrum of DMD mutations in dystrophinopathy patients: application of modern diagnostic techniques to a large cohort. Hum Mutat. 2009;30(12):1657–66. doi:10.1002/humu.21114.
Arikawa-Hirasawa E, Koga R, Tsukahara T, Nonaka I, Mitsudome A, Goto K, Beggs AH, Arahata K. A severe muscular dystrophy patient with an internally deleted very short (110 kD) dystrophin: presence of the binding site for dystrophin-associated glycoprotein (DAG) may not be enough for physiological function of dystrophin. Neuromuscul Disord. 1995;5:429–38.
Beggs AH, Hoffman EP, Snyder JR, Arahata K, Specht L, Shapiro F, Angelini C, Sugita H, Kunkel LM. Exploring the molecular basis for variability among patients with Becker muscular dystrophy: dystrophin gene and protein studies. Am J Hum Genet. 1991;49:54–67.
Novakovic I, Bojic D, Todorovic S, Apostolski S, Lukovic L, Stefanovic D, Milasin J. Proximal dystrophin gene deletions and protein alterations in Becker muscular dystrophy. Ann NY Acad Sci. 2005;1048:406–10.
Vainzof M, Passos-Bueno MR, Takata RI, Pavanello Rde C, Zatz M. Intrafamilial variability in dystrophin abundance correlated with difference in the severity of the phenotype. J Neurol Sci. 1993;119(1):38–42.
England SB, Nicholson LV, Johnson MA, Forrest SM, Love DR, Zubrzycka-Gaarn EE, Bulman DE, Harris JB, Davies KE. Very mild muscular dystrophy associated with the deletion of 46% of dystrophin. Nature. 1990;343(6254):180–2.
Mirabella M, Galluzzi G, Manfredi G, Bertini E, Ricci E, De Leo R, Tonali P, Servidei S. Giant dystrophin deletion associated with congenital cataract and mild muscular dystrophy. Neurology. 1998;51(2):592–5.
Kerr TP, Sewry CA, Robb SA, Roberts RG. Long mutant dystrophins and variable phenotypes: evasion of nonsensemediated decay? Hum Genet. 2001;109:402–7.
Zhao JH. Pedigree-drawing with R and graphviz. Bioinformatics. 2006;22(8):1013–4.
Stojanova D, Ceci M, Malerba D, Dzeroski S. Using PPI network autocorrelation in hierarchical multi-label classification trees for gene function prediction. BMC Bioinformatics. 2013;14:285.
Widmer G, Horn W, Nagele B. Automatic knowledge base refinement: learning from examples and deep knowledge in rheumatology. Artif Intell Med. 1993;5(3):225–43.
The authors thank all members of the Department of Neurology, at the General Hospital of Chinese People’s Armed Police Forces, for support and ideas during development. We thank Dr. Ching H. Wang and Dr. Jinqian Zhang for their critical reading and revisions of this manuscript.
This work was supported by Capital Characteristic Clinic Project (grant No. Z151100004015025). This funding supported the design of the study and collection, analysis, and interpretation of data.
JZ and SW developed the theory. All authors carried out the project. JZ wrote the software, which all authors tested and debugged. JX and YN collected the DMD clinical data. JZ wrote the draft of the manuscript, which SW revised and approved. All authors agreed to this publication. All authors read and approved the final manuscript.
The authors declare that they have no competing interests.
Consent for publication
The parents of the patient (i.e., the example in Table 1) gave informed written consent to publish this information.
Ethics approval and consent to participate
Throughout this paper the TREAT-NMD data set is anonymized and has been obtained from the TREAT-NMD DMD Global database. The Flanigan’s data set is anonymized and has been obtained from their published research (PMID: 19937601). The GHCPAPF data set is anonymized and has been obtained from the General Hospital of Chinese People’s Armed Police Forces. Signed informed consents were obtained from all parents of the DMD/BMD children and BMD patients in adult in the GHCPAPF group. This research was approved by research ethics committee and medical ethics committee, General Hospital of Chinese People’s Armed Police Forces.
codes_DMDtoolkit. This folder includes all the codes, manual (“Manual.docx”) and required databases/annotations of DMDtoolkit. For example, “ESE matrices.txt” contains the matrices of serine/arginine-rich (SR) proteins. (RAR 2602 kb)
data_DMDtoolkit. This folder includes all the data used in the present manuscript, such as “Flanigan’s_DMD_patients.txt” for aided diagnosis for DMD/BMD, “DMDsamples.txt” for drawing mutated protein, and “pedigree.txt” for drawing pedigree of DMD family. (RAR 28 kb)
results_DMDtoolkit. This folder includes all the results produced by DMDtoolkit, such as “prediction results of DMD patients.xlsx” which contains the original prediction results of DMD patients from TREAT-NMD, Flanigan’s and GHCPAPF, “case7-1 (combination of multiple mutations).pdf” which is a vector illustration of a combination of multiple mutations. “supplement1.docx”. Examples of basic statistics includes calculation of summary, correlation coefficient, regression coefficient, and t test. “supplement2.docx”. Examples of basic graphs include pedigree, histogram, scatter plot with trend line, stem and leaf plot, and cluster dendrogram. (RAR 5435 kb)