Proceedings of the 16th Annual UT-KBRIN Bioinformatics Summit 2016: bioinformatics

I1 Proceedings of the Sixteenth Annual UTKBRIN Bioinformatics Summit 2017 Eric C Rouchka, Julia H Chariker, David A Tieri, Juw Won Park Department of Computer Engineering and Computer Science, University of Louisville, Duthie Center for Engineering, Louisville, KY 40292, USA; Kentucky Biomedical Research Infrastructure (KBRIN) Bioinformatics Core, 522 East Gray Street, Louisville, KY 40292, USA; Department of Psychological and Brain Sciences, University of Louisville, Louisville, KY 40292, USA; Department of Anatomical Sciences and Neurobiology, University of Louisville, Louisville, KY 40292, USA Correspondence: Eric C Rouchka (eric.rouchka@louisville.edu) BMC Bioinformatics 2017, 18(Suppl 9):I1

Oguz Akbilgic (University of Tennessee Health Science Center -UTHSC) followed with a talk titled "Probabilistic Symbolic Pattern Recognition (PSPR) in Clinical Decision Making." This presentation focused on the use of the PSPR method for identifying predictors of pathophysiology from a number of clinical features. He gave a specific example of their use in detecting paroxysmal atrial fibrillation using clustering of ECGs. Arash Shaban-Nejad (UTHSC) continued the session with the presentation "Urban Health Intelligence for Public Health Planning and Policy Development." This talk discussed the correlations between socioeconomic status and population health. Dr. Shaban-Nejad discussed PopHR [3], a knowledge-based platform for integrating, analyzing, and visualizing population health data. Rishi Kamalsweran (University of Tennessee Health Science Center) closed the Biomedical Informatics Session with the presentation "Dynamic Visual Analytics and Event Stream Processing." In this talk, Dr. Kamalsweran discussed the prospects of bringing analytics to the bedside in order to predict the onset of disease. A number of approaches that have been developed for bringing earlier, personalized care to patients [4][5][6] were discussed in addition to methods for visualizing dynamic streaming data [7]. Session II: Systems Biology Session II: Systems Biology began with a presentation by Qui Liu (Vanderbilt University) on "Translating Multi-Omics Data into Colorectal Cancer Biology." In this presentation, Dr. Liu focused on the two aims of integration of multi-dimensional data, including understanding the relationships between the different types of data and understanding both the latent and observable phenotypes. She discussed integrative techniques they have used for transcriptomics, proteomics, and miRNA [8] as well as other methods for integrating multiomics data with clinical applications, such as the determination of the cause of resistance to chemotherapeutics in colon cancer [9,10]. Bruce Ramshaw (University of Tennessee, Knoxville) followed with the presentation "Complex Systems Science Applied to Health Care." Dr. Ramshaw discussed how many of the issues with health care today are due to a reductionist view, which leads to increasing fragmentation and administration. He suggested that rather than view clinical health care through a reductionist view, a complex systems science view is needed in order to change the assumptions and resulting tools for clinicians and clinical researchers. He showed how implementation of such a collaborative team led to substantial savings in a health care system due to decreased length in postoperative stay and reduction in material costs [11]. The third speaker in the Systems Biology section, Rachel McCord (University of Tennessee-Knoxville), presented "The 3D Genome: Folding, Misfolding, and Unfolding." In this talk, Dr. McCord discussed how DNA folding leads to biological function when the genome folds itself into a 3-dimensional shape, leading to a number of different interactions, or 3D compartments, between chromosomes. In cancer and other diseases, translocataions interrupt these interactions. She also introduced the methods they have employed for measuring genome folding, including Hi-C [12,13]. These techniques were used to study the loss of 3D genome compartments in progeria patients [14].
David Ussery (University of Arkansas for Medical Sciences) continued with the presentation "What can 100,000 Bacterial Genomes Teach Us about Evolution?" in which he discussed the genetic diversity in the bacterial genome, and showed that no single protein is conserved among all living organisms, but that functional domains are conserved [15]. This analysis has been made possible through 20 years of bacterial genome sequencing [16] as well as increased availability of high throughput sequencers such as the Minion nanopore sequencers. Robert Flight (University of Kentucky) finished off the Systems Biology session with his presentation "Metaand Multi-Omic Analyses Using Annotations." In this presentation, Dr. Flight discussed the use of annotation enrichment, which can be applied to various -omics data sets, for analysis of a particular phenotype. He discussed categoryCompare [17] an approach he developed for such analysis, and its extended version to show its utility for analyzing the effects of three different gene knockouts on Juvenile Batten Disease. Session III: Metabolomics Richard Higashi (University of Kentucky) kicked off the Metabolomics session on Sunday morning. During his presentation, Dr. Higashi discussed methods developed for measuring the flux of labeled metabolites through a system, and the corresponding issues with modeling such flux as well as visualizing the resulting data sets [18-23] with a specific example of cancer. Christine Fillmore Brainson (University of Kentucky) closed the Metabolomics session with a presentation "Integrating Epigenetics, Transcriptomics and Metabolomics Datasets from a Lung Cancer Model." This presentation, which takes a systems-approach to modeling disease, looked at how different tri-methylation events are affected by carcinomas and how they correlate with transcription. In addition, she discussed how metabolism affects stability. Session IV: Single Cell Omics and Other NGS The final scientific session of the summit focused on the use of high-throughput sequencing datasets to analyze data in such a way that could not previously be studied. Corey Watson (University of Louisville) began this session with a talk "Genomics of the Functional Antibody Response in Human." During this presentation, Dr. Watson discussed the high degree of variability determined within the IgH region, both in terms of longer variants and SNPs, and how high throughput sequencing can be used to resolve some of these [24][25][26][27][28]. He discussed how these regions also have high variation in copy number, and conveyed how his lab is beginning to use long sequencing reads to address issues with reassembling this highly variable region. Eric Rouchka (University of Louisville) followed with the presentation "Identification of Cleavage Site Intervals for Alternative 3' UTR dynamics." During this presentation, Dr. Rouchka discussed development of an algorithm for detecting 3' UTR lengthening and shortening events [29], which was motivated by some previous findings for localization based on alternative 3' UTR usage [30] and detection of a number of alternative 3' UTR events within nervous system processes [31,32]. Arthur Hunt (University of Kentucky) followed with the presentation "The Intersection of Alternative Polyadenylation and RNA Quality Control." During this presentation, Dr. Hunt described the prevalence of polyadenylation within plant genomes [33][34][35]. He also introduced methods his lab has developed for inexpensive library construction for high-throughput sequencing [36,37]. Juw Won Park (University of Louisville) ended the session with his talk on alternative splicing and circular RNAs. He showed that an organism's protein diversity is not determined solely by the number of genes it possesses, but also by its ability to utilize alternative splicing of its genes. It was also shown that circular RNAs, which participate in biological function such as gene regulation via modulating micro-RNAs activity [38], can exhibit alternative splicing [39]. He discussed the software that he developed that can detect differential alternative splicing events from RNA-Seq data [40][41][42]. He also introduced an approach that can estimate the abundance of circular RNAs with respect to linear forms from RNA-Seq data.

Poster Session
A poster session and reception was held on Saturday evening with a total of 41 posters presented across 14 categories. The largest represented categories included transcriptomics, bioinformatics algorithms, phylogenetics, protein structure and proteomics, and systems biology and networks. Nineteen of the poster abstracts along with one speaker abstract are highlighted within this supplement. Prior to the poster session, 34 of the posters were presented during a oneminute blitz session used to introduce the posters and their topics.   In recent years, knowledge extraction from the biomedical data has become major challenge [1]. Machine learning has presented advanced tools for representation learning in biomedical field. But the performance of conventional machine learning algorithms is feature dependent. These features are designed by a human expert in those domains, and identifying which features are more appropriate for the given task remains a difficult problem. Deep learning is an advancement in machine leaning to deal with such a problem.

Materials and methods
We have used a deep neural network using a Takagi-Sugeno fuzzy inference system to learn data representation in the form of fuzzy structures [2]. A generic architecture built from connecting layers of Takagi-Sugeno fuzzy inference system as nodes is elaborated and various parameters involved in it are discussed. This architecture has an input layer, multiple hidden layers and an output layer. But the last two layers of the network have a Takagi-Sugeno fuzzy inference system as its fundamental building unit [3]. Training is carried out using gradient descent to achieve the identification of all parameters in the architecture according to training data. The proposed architecture is implemented in two class Prostate Cancer data [4] containing 102 samples and 10509 genes. Individual training error ranking is used for selecting best features. These features are then passed to the network to learn intricate fuzzy representation in the form of multiple distinct fuzzy rule bases which are intelligible to a human. The identified fuzzy rule bases consist of linguistic information of IF-THEN rules which may turn out to be helpful in diagnosis of disease at the time of examination of patient.

Results and conclusions
The result of the proposed network is compared on the basis of AUC (Area under ROC curve) performance with respect to deep learning model using neural network with softmax fine tuning. The use of Takagi-Sugeno fuzzy inference system may improve the performance of deep neural network on transcriptome-based cancer classification.

Background
Socio-economic risk factors -race, urban residence and poverty-significantly contribute to pediatric asthma prevalence [1][2][3]. However, direct assessments on built-environment and neighborhood effects have not been thoroughly examined due to the scarcity and heterogeneity of available data. According to the theory of Social determinants of health [4], in order to systematically analyze the prevalence of asthma in children and understand its underlying etiology, direct examination of residential factors is crucial. Using knowledge-based platforms enables integration of multiple data sources into a smart and consistent population health surveillance system [5].

Methods and Results
Using a knowledge-based population health analytics platform we compare localized pediatric asthma prevalence in 32 zip-code areas in Memphis, TN, combining 6,538 encounter data from Le Bonheur children's hospital in Memphis and Shelby County Health Department, US census data, and neighborhood quality survey data provided from our partners. Expanding the existing socio-economic models, we explore the neighborhood effects on localized asthma prevalence, which is measured through a number of pediatric asthma encounters observed in each zip-code area. We find that asthma encounters are disproportionately distributed. We use the population size of each zip-code area as control variable. Poverty is known to have a positive association with asthma in the U.S. [4]. Thus, we include the poverty ratio of each zip-code area into socioeconomic model. Furthermore, the encounter data shows that 86.2% of encounters are associated with African-American and 8.5% cases are of White. Correspondingly, we add the African-American population ratio into the socio-economic model to control the unique composition of the urban area and our sample. For the neighborhood model, we introduce blight and broken window variables whether living condition of the neighborhood have significant influence to the degree of pediatric asthma prevalence. To predict the prevalence, we run multivariable regression models ( Table 1). The baseline model agrees with previous studies showing that the poverty level is positively associated with the asthma prevalence. However, when the racial factor is introduced, it loses its statistical significance. Furthermore, in the neighborhood models, blight phenomenon and broken window variables are positively and significantly associated with the prevalence even after controlling all socio-economic variables. We found that the asthma prevalence is more sensitive to environments. Explanatory power of neighborhood models also increases to 77% approximately.

Conclusions
The integration of multiple data sources allows us to unpack the systematic prevalence patterns and broaden our comprehension of asthma epidemic in urban area. Pediatric Asthma is disproportionately prevalent in poor and bad quality neighborhood. Using the socio-environmental indicators public health organizations can implement intelligent surveillance systems for neighborhood-level monitoring of major upstream determinants of health. It is worth mentioning that our sample is an exaggerated composition considering that 53.5% of African American and 42.0% of White at the city-level racial composition (2015 Census) and, therefore, much caution is needed when making inferences about broader contexts. We are in the process of acquiring additional data sets to resolve the current limitation of the study. Background Direct infusion Fourier-transform mass spectrometry (FTMS) allows for high-throughput detection of thousands of metabolites. Typically, the majority of the observed spectral features does not correspond to known metabolites and thus cannot be placed into existing metabolic networks. Without accurate metabolite assignment, discerning their roles in biological systems is not possible. MS Assignment remains difficult due to the low abundance of some detected metabolites, the volume of data produced by FTMS, the small m/z differences between isotopologues, and the lack of sufficient chemical structural information. Additional phenomena producing large numbers of spectral artifacts further complicate FTMS assignment. False assignments including those made on artifact peaks can create large interpretative errors.

Materials and Methods
Through manual inspection of FTMS spectra, we identified FTMSunique artefacts that result in regions of abnormally high peak density (HPD) that we collectively refer to as HPD artefacts. We have implemented an algorithm in Python3 to identify regions of spectra with the HPD property and likely contain a large number of artefactual peaks. First, our algorithm divides a spectrum into a number of overlapping chunks approximately 1 m/z in width and for each window, the peak density is calculated (number of peaks/window width in m/z). Second, the peak density of each chunk is then compared against the peak density of neighboring sets of chunks and a modified chi-squared statistic calculated for each comparison. High statistic values correspond to regions of spectra with the HPD property. This approach robustly identifies HPD artefacts and is tolerant to changes in signal-to-noise, peak densities, etc. that can vary between different FTMS instruments and experimental designs. Once identified, these artefacts can be excluded from subsequent analyses. However, in the case that HPD artefact location correlates with sample class or other experimental variable, more complex methods of artefact removal must be employed to avoid confounds and additional interpretative errors.

Results and conclusions
Using our HPD detector, we have identified three types of HPD artefacts:: i) fuzzy sites representing small regions of m/z space with a 'fuzzy' appearance due to the extremely high number of peaks present; ii) ringing due to a very intense peak producing side bands of decreasing intensity that are symmetrically distributed around the main peak; and iii) partial ringing where only a subset of the side bands are observed for an intense peak. Fuzzy sites and partial ringing appear to be novel artifacts previously unreported in the literature and we hypothesize that all three artifact types derive from Fourier transformation-based issues. We have developed a set of tools to detect these artifacts and are developing new methods to mitigate or eliminate their effects on FTMS spectra and downstream analyses.

P5
Large-scale microarray data based feature selection for improved molecular classification Liangqun Lu 1 , Bernie J Daigle, Jr. Background Supervised feature selection for high-dimensional biological data is a critical component in the development of accurate diagnostic/ prognostic molecular classifiers for complex diseases. Wrapper methods and other embedded techniques closely linked to learning algorithms have been widely applied to this task, while feature selection methods incorporating prior biological knowledge are less commonly used. However, these knowledge-driven methods have the potential to simultaneously improve classification performance as well as model interpretability.

Materials and methods
We adopted a Bayesian strategy for knowledge-driven feature selection to improve gene expression-based classification. By collecting and analyzing microarray gene expression profiles across hundreds of thousands of samples from the Gene Expression Omnibus (GEO), we have estimated prior probabilities of differential expression for each gene in the human genome. Using these probabilities, we have created a novel feature selection scheme based on the empirical Bayesian limma framework. Use of this knowledge-driven approach leads to the selection of qualitatively different features compared to those selected by knowledgeagnostic approaches.

Results
We have applied our feature selection approach to two publicly available gene expression datasets studying leukemia and asthma. Using both our knowledge-driven feature selection approach as well as a knowledge-agnostic method, we applied supervised support vector machine and logistic regression classifiers. We evaluated classification performance by measuring the area under the receiver operating characteristic curve (AUC). In the asthma dataset, our preliminary results suggest an improvement in AUC resulting from knowledge-driven feature selection. Current work involves applying our method to additional high-dimensional datasets, including recently collected data interrogating posttraumatic stress disorder (PTSD).

Background
Protein Nuclear Magnetic Resonance (NMR) plays an important role in the biophysical analysis of proteins, especially in the determination and study of their 3D structure. The accuracy of chemical-shifts assignments is a vital requirement for many aspects of NMR, especially protein structure determination. Traditional protein NMR technology relies on manual chemical shift referencing procedures that are prone to human error [ Fig. 1] and cannot be validated until after the resonance assignment step. We present a Bayesian Model Optimized Reference Correction approach (BaMORC) that can provide correction to referencing before resonance assignment.

Materials and methods
We are developing a statistical-based algorithm to correct referencing by: 1. Computing composition probabilities of 20 amino acids of investigating protein C α and C β resonance pairs from the NMR data; 2. Summing the probabilities across all resonance pairs to give an estimate of amino acid (AA) composition; and 3. Minimize L1 errors between predicted and actual protein AA composition via a grid search method to estimate a minimum difference (correct referencing value) between.

Results and conclusions
From our results, we identified that cysteine residues should be treated separately basing on its oxidized/reduced states [ Fig. 2]. And the covariance between C α and C β resonance is a potent but long ignored statistic that should be utilized in the NMR referencing methodology. We have demonstrated that the overall approach is feasible. With applying BaMORC to the Re-referenced Protein Chemical shift Database RefDB [1], the 90% confidence range is 0.60 ppm, which suggest the estimated reference value is between -0.24 ppm and 0.45 ppm and assuming correct reference value is at 0 ppm. Currently we are developing a shiny web app that will further simplify this protein NMR reference correction procedure. In the web interface, users can upload or paste their NMR peal list data directly into the app. The web app automatically groups the peaks into spin systems and applies the reference correction algorithm I have developed. The results of the analysis are returned as an html report and corrected peak list file. The shiny web app will provide the biomolecular NMR field with a unique tool that allows NMR protein spectra referencing to be corrected and refined at the beginning of NMR protein experiments without using chemical shift assignments or protein 3D structure, which is the current retrospective referencing correction paradigm. Therefore, our method should improve both the speed and quality of protein resonance assignment and downstream NMR-based analyses including structure determination.

Background
We present a bioinformatics application, MutChart, which streamlines the mutation identification and verification processes. We are using polymerase chain reaction-based random mutagenesis to generate a comprehensive library of mutations in the KCNH2 potassium channel gene that is responsible for ensuring proper heart rhythmicity. As part of this project, it is necessary to sequence a large number of PCR products to assess mutation density and spectrum. While candidate mutations can be identified by comparing the sequence data to a reference, each mutation should be manually validated to ensure its veracity.

Materials and methods
Many software programs are available for viewing raw sequence data for manual verification but none are designed in a way that facilitates high throughput visualization and validation steps. MutChart takes as input raw sequence trace data and the results of a blast search against the reference sequence. It then displays each mutation in a window that provides relevant information about the reference and alternate allele, the sequence quality score and, most importantly, a sequence trace plot for a few nucleotides on either side of the query nucleotide. The user views the trace plot to assess the candidate mutation's veracity and then accepts the mutation, rejects it, or marks it as questionable. This action automatically advances the plot window to the next mutation, thereby eliminating the need for further user navigation.

Conclusion
MutChart is a result of a collaboration between computer scientists, who solved a number of challenges related to the processing and visualization of large datasets, and biologists who provided domain expertise in DNA sequence analysis and interpretation. Among interesting software solutions, the implementation utilizes caching to Fig. 1 (abstract P6). Traditional Protein NMR Referencing workflow leads to "chicken-egg dilemma" vs. proposed automatic Bayesian Model Optimized Reference Correction approach Example alpha and beta carbon shift bivariate distributions. C α shifts are shown along the x-axis, while C β shifts are shown along the y-axis. Distributions are shown cysteine distribution is dramatically different from the rest amino acid types, and treated it as single residue is incorrect. The bottom row shows the separated cysteine basing on oxidation states, which provide more usage to estimating reference of protein NMR prevent from redundantly parsing previously used datasets, and uses dynamic loading to render the voluminous datasets.

P8
Identifying heteroplasmy in D. carota using whole genome shotgun sequencing without known variants Background Organellar genomes are commonly inherited uniparentally, leading to a single genome being passed down without variation. Any recombination, biparental inheritance, or mutation of the organellar genomes leads to variation within the individual, known as heteroplasmy. Heteroplasmy has been observed in many species and is known to have phenotypic consequences, often resulting in reductions of fitness. In humans, it is associated with mitochondrial diseases and cancer. Research on the effects of heteroplasmy on the fitness of plants is limited, but studies suggest such genomic variation to be pervasive; wild carrot (Daucus carota) was found to be 60% heteroplasmic [1].

Materials and methods
MToolBox is an automated pipeline for the identification of heteroplasmy in humans that requires a complete human reference genome [2]. We have adapted this pipeline for more generalized use by allowing the input of any reference nuclear genome. Using a highquality mapper (Bowtie 2), a duplicate marker (Picard Tools), and the assembler and VCF output generator from MToolBox, we are able to identify heteroplasmy frequencies and locations in a sample without requiring a reference of known heteroplasmic variants.

Results
Using whole genome shotgun (WGS) sequencing of four individuals of wild carrot, D. carota, we have identified high-confidence heteroplasmic sites in the mitochondrial and chloroplast genomes. Ongoing work involves searching for patterns of heteroplasmy within the population (e.g., if it is more prevalent in exons or introns) and documenting the effects of heteroplasmy on fitness. In the future, we plan to scale up our analysis to over 190 samples of D. carota.

P9
Applying deep learning to predict phenotype based on genetic Background GeneNetwork [www.genenetwork.org] is a web tool that enables analysis of genetic and gene expression datasets across large panels of recombinant inbred mice [1]. Analysis of GeneNetwork data is challenging due to variability in microarray platforms, normalization methods, and biological factors. The goal of this project was to develop an analysis pipeline using literature-derived functional cohesion to evaluate GeneNetwork output and to extract meaningful insights.

Materials and methods
The workflow for our analysis pipeline is shown in Fig. 3. Using Gene-Network, we identified the top 200 genes whose expression levels correlated with Sirt3 expression in liver tissues across BXD recombinant inbred mice. We examined Sirt3 correlated gene networks in seven liver datasets derived from different microarray platforms and normalization methods. For two datasets, two different Sirt3 probesets were analyzed. Literature cohesion p-values (LPv) were calculated for the top 200 Sirt3 correlated genes using GeneSet Cohesion Analysis Tool [http://binf1.memphis.edu/gcat/] that was developed by our group previously [2]. To evaluate our approach, we used a gold-standard set of 429 Sirt3 target proteins, which were previously reported to be differentially acetylated in liver tissues from Sirt3 knockout mice compared with wildtype controls [3]. Recall refers to the number of overlapping genes between Sirt3-correlated gene network and the gold-standard set. Functional enrichment analysis was performed using DAVID [https://david.ncifcrf.gov/].

Results
We found a very high correlation (R 2 = 0.97) between literature cohesion of Sirt3-correlated gene networks and recall of the goldstandard set (Fig. 4). Functional enrichment analysis of the network with the lowest LPv revealed that the Sirt3 correlated genes belong to the following Gene Ontology classifications among many others: Mitochondrion (p-value = 4.3E-42), Oxidoreductase Activity (p-value = 2.3E-40), Lipid Metabolism (p-value = 1.2E-12), and Synthesis of Amino Acid (p-value = 1.7E-7). These results are consistent with previous reports that Sirt3 is a key regulator of mitochondrial metabolic processes [3].

Conclusions
Our results provide proof-of-concept that literature cohesion analysis can rapidly identify biologically meaningful gene networks from the vast amount of genomic data accumulating in publicly available resources such as Genenetwork.org and Gene Expression Omnibus (GEO). We posit that our approach will facilitate discovery from high throughput genomic data. § Equally contributing authors Background As disease states are either precipitated by or result in metabolic dysregulation, metabolite concentrations can be utilized for determining physiological processes that are differentially impacted across disease states. For example, while coagulation is a homeostatic response to vascular injury, dysregulation can lead to pathological thrombosis, the cause of acute myocardial infarction-a leading cause of death in humans. To determine such dysregulation a representation of the metabolome in a non-pathological state or a reference phenotype is needed. We sought a Gaussian Graphical Modeling (GGM) approach for constructing a reference metabolome that incorporates prior knowledge of biochemical structural similarity. A full joint distribution representation was sought to facilitate inference regarding partial correlation structure. We evaluated the method for constructing a plasma metabolome for a stable, yet diseased state from human subjects presenting with Coronary Artery Disease (CAD). This representation will provide a reference for systems-level comparisons across the disease state transition from stable to acute myocardial infarction.

Materials and methods
The Graphical Lasso (gLASSO: graphical least absolute shrinkage and selection operator) was proposed for the estimation of sparse inverse covariance matrices for multivariate Gaussian distributions. The gLASSO algorithm estimates the inverse of the covariance matrix by maximizing the L1 penalized log-likelihood function via coordinate descent. Ambroise, et al. proposed modifying the regularization term to incorporate an adaptive penalty. In previous applications, the adaptive penalization for estimating concentration matrices was predicated on assuming a latent clustering of the variables, to Fig. 3 (abstract P10). Analysis workflow Fig. 4 (abstract P10). Correlation between literature cohesion and gold-standard gene recall for Sirt3 gene networks in liver. Literature cohesion p-value (LPv) and recall were calculated for nine Sirt3 gene networks (each included 200 Sirt3 correlated genes) derived from seven different BXD recombinant inbred mouse datasets be estimated by expectation maximization or other clustering approaches. We instead devise an adaptive penalty that varies inversely with molecular similarity. Molecular similarity was defined via the Tanimoto distance measure using bitwise atom-pair fingerprinting and was used for generating adaptive penalties in constructing a plasma metabolome for the phenotype of interest.

Results
We constructed a reference plasma metabolome for a CAD phenotype. We observed that for a fixed number of edges in the Gaussian Graphical Model (GGM). As expected, adaptive penalization increased the likelihood of edge was formation between metabolites that are structurally related.

Conclusions
While our evaluation does not provide evidence that biochemical fingerprint-based adaptive penalization increases the overall likelihood of a GGM in representing a metabolome, a theoretical evaluation is needed. A framework for the probabilistic integration of prior biochemical knowledge in constructing metabolomics-based graphical models remains desirable for facilitating pathway and biochemical process level inference.
Publisher's Note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
• We accept pre-submission inquiries • Our selector tool helps you to find the most relevant journal Submit your next manuscript to BioMed Central and we will help you at every step: