Non-synonymous variations in cancer and their effects on the human proteome: workflow for NGS data biocuration and proteome-wide analysis of TCGA data
- Charles Cole†1,
- Konstantinos Krampis†2,
- Konstantinos Karagiannis1,
- Jonas S Almeida3,
- William J Faison1,
- Mona Motwani1,
- Quan Wan1,
- Anton Golikov4,
- Yang Pan1,
- Vahan Simonyan4 and
- Raja Mazumder1, 5Email author
© Cole et al.; licensee BioMed Central Ltd. 2014
Received: 5 November 2013
Accepted: 22 January 2014
Published: 27 January 2014
Next-generation sequencing (NGS) technologies have resulted in petabytes of scattered data, decentralized in archives, databases and sometimes in isolated hard-disks which are inaccessible for browsing and analysis. It is expected that curated secondary databases will help organize some of this Big Data thereby allowing users better navigate, search and compute on it.
To address the above challenge, we have implemented a NGS biocuration workflow and are analyzing short read sequences and associated metadata from cancer patients to better understand the human variome. Curation of variation and other related information from control (normal tissue) and case (tumor) samples will provide comprehensive background information that can be used in genomic medicine research and application studies. Our approach includes a CloudBioLinux Virtual Machine which is used upstream of an integrated High-performance Integrated Virtual Environment (HIVE) that encapsulates Curated Short Read archive (CSR) and a proteome-wide variation effect analysis tool (SNVDis). As a proof-of-concept, we have curated and analyzed control and case breast cancer datasets from the NCI cancer genomics program - The Cancer Genome Atlas (TCGA). Our efforts include reviewing and recording in CSR available clinical information on patients, mapping of the reads to the reference followed by identification of non-synonymous Single Nucleotide Variations (nsSNVs) and integrating the data with tools that allow analysis of effect nsSNVs on the human proteome. Furthermore, we have also developed a novel phylogenetic analysis algorithm that uses SNV positions and can be used to classify the patient population. The workflow described here lays the foundation for analysis of short read sequence data to identify rare and novel SNVs that are not present in dbSNP and therefore provides a more comprehensive understanding of the human variome. Variation results for single genes as well as the entire study are available from the CSR website (http://hive.biochemistry.gwu.edu/dna.cgi?cmd=csr).
Availability of thousands of sequenced samples from patients provides a rich repository of sequence information that can be utilized to identify individual level SNVs and their effect on the human proteome beyond what the dbSNP database provides.
KeywordsSRA TCGA nsSNV SNV SNP Next-gen NGS Phylogenetics Cancer
Researchers in cancer biology are well aware of the possibilities that Big Data can offer and there are already several efforts underway to generate relevant large-scale cancer genomics data [1–3]. International and national networks of collaboration, such as International Cancer Genome Consortium (ICGC) , Global Cancer Genomics Consortium (GCGC) , Early Detection Research Network (EDRN)  and other NCI programs (like The Cancer Genome Atlas program (TCGA) ), are generating increasingly large amounts of data, the vast majority of which is from next-generation sequencing (NGS) technologies. TCGA data is expected to surpass 100 petabytes by the completion of the project. In a recent survey of the files generated by the TCGA initiative we found that its file count has been doubling every 7 months since 2010, with a total count above 700,000 files. Although, it is desirable that cancer biologists will use this data to develop and test hypotheses, realistically, few wet-laboratory researchers have the infrastructure or knowledge regarding scores of complex bioinformatics tools to glean a higher understanding from the disparate sequence files and complex, scattered annotations. These challenges are leading to the development of tools and secondary databases which are expected to democratize Big Data use [8–10], and initiatives such as the Human Variome Project has started playing an important role by providing guidelines that encourage standardizing and sharing of information related to human genetic variation [11–13].
Biological information is usually concentrated in databases mainly of two types: primary databases comprised of raw data submitted by researchers, and secondary databases that extract and filter the information available from the primary databases and add additional annotations generated either manually or automatically through the efforts of biocurators [14–16]. One of the problems often faced by end users of Big Data is the lack of curated information in primary NGS data repositories such as the NCBI Short Read Archive (NCBI-SRA)  and The Cancer Genomics Hub (CGHub) . It is expected that curated secondary databases will help organize the Big Data and make it more user-friendly, similar to what secondary database development efforts like RefSeq  and UniProtKB/Swiss-Prot  have done and are still doing for GenBank . Additional higher level databases like Pfam , PIRSFs , KEGG  and others organize objects into functional groups and provide information on biological function, networks and processes. It is clear that knowledge can be gained when raw data moves in a vertical fashion, from millions of bases of DNA or RNA into proteins, then to protein families, and finally into networks of interrelated biological processes. Currently, NGS data in public repositories are not well connected to molecular biology resources and reference datasets, and validated methods for data processing and filtering are not always available, necessitating significant bioinformatics expertise to use and analyze such information. In this paper, we describe a workflow to curate and analyze NGS data from control (normal tissue) and case (tumor) samples derived from cancer patients. We chose to curate publicly available NGS data to provide users an unbiased view of variation that is present at the individual person level and is not yet completely captured in dbSNP, thereby providing a better understanding of the human variome. Additionally, based on our previous work on functional analysis of non-synonymous Single Nucleotide Variations (nsSNVs) from dbSNP , UniProt  and COSMIC  we show how proteome-wide analysis of variation can provide a detailed view of the distribution of variation and possible functional impact [26–29].
Currently, there are thousands of large-scale sequence data from cancer case and control samples that are available from primary short read data repositories such as NCBI SRA and TCGA. It is expected that comprehensive and integrated analysis of this data will lead to novel discoveries. We believe that computational and manual curation of this data will provide unprecedented value in cancer research that will eventually lead to better cancer detection, therapies, and care. As proof-of-concept, we have analyzed nsSNVs from 55 samples (22 cases and 33 controls) obtained from 20 breast cancer patients and have recorded the analysis results in Curated Short Read (CSR) archive. The samples provide a rich source of sequence data that can be mined to extend and compliment mutation and single-nucleotide polymorphism (SNP) information available from dbSNP , UniProt , COSMIC  and other variation databases. We intend to curate and analyze representative samples from all datasets that are available through TCGA. For our initial study, through focused analysis of the breast cancer samples, we show how a workflow that identifies novel variations, explores the effects of nsSNVs on the human proteome and classification of patients based on Single Nucleotide Variations (SNVs) can provide a higher level of information that can be used by researchers to evaluate experimental targets and also to generate and test hypothesis related to personalized medicine. To facilitate implementation of this workflow by other users, we provide nsSNV analysis tool - SNVDis that can be used by researchers or biocurators interested in evaluating the effects of variation. With large-scale informatics fast becoming an integral component of cancer research, the workflow described here can be easily applied to other datasets.
Architecture and computational environment
GRCh37 Genome Reference Consortium Human Reference 37 (GCA_000001405.1) downloaded from UCSC (http://hgdownload.cse.ucsc.edu/goldenPath/hg19/chromosomes/). UniProtKB protein amino acid position and ID mapping is done using SNVDis and ID Mapping services [28, 32]. Functional and sequence information is obtained from RefSeq , Conserved Domain Database , UniProtKB/Swiss-Prot  and CCDS .
Data selection criteria
For this study, we concentrate on sequence data from breast cancer cases and controls. Data sets derived from twenty patients are selected for analysis. The criterion for selection is based on the availability of clinical information, race, paired case and control samples. In addition to this, the presence of both exome and RNA-Seq data from the same patient is also included as a criterion for selection because they are deemed to be high-priority datasets by our users because many hypothesis driven questions can be answered through the comparative analysis of these datasets. Information is retrieved using the data matrix available at https://tcga-data.nci.nih.gov/tcga/dataAccessMatrix.htm. “BRCA-Breast invasive carcinoma” is selected as “Disease type” and “Clinical” from the “Data Type”. Patients from three different races are included: three patients from African-American, one Asian, and the rest White. All patients are females and had no previous history of malignancy. The tumor and the matched control samples can be identified by the TCGA barcode associated with each sample. A TCGA barcode is a collection of identifiers such as in the sample ID (TCGA-CH-5739-01A-11D-1576-08), where 5739 is the participant number, 01 is the sample type. Tumor sample range from 01–09 and matched normal from 10 – 19 depending on the type of tumor and normal sample. The short reads are then mapped to the human reference sequence and further analyzed to identify variations.
SNV filtration and annotation
After the raw SNV data is generated using Bowtie  and SAMtools , filters are used to select high quality SNVs which are of desirable coverage (>10 reads) and quality score (>20). The filtration process also rejects detected SNVs falling out of the exome regions, which may be caused by non-unique regions in the genome.
nsSNV distribution on functional sites
Experimental Post Translational Modification (PTM) sites are obtained from UniProtKB/Swiss-Prot  and dbPTM 3.0 , which provides experimentally verified PTM sites. Python scripts are used to remove the redundancy from the nsSNVs dataset, map the resulting unique nsSNVs to PTM motifs in UniProtKB proteins, and calculate how many PTM sites are detected to have nsSNV. Any nsSNV derived unacceptable change on the given PTM motif is considered as a loss of PTM site. To evaluate the effects of nsSNVs obtained from case and control samples, statistics were generated and heatmap was constructed using R package . To obtain the heatmap, the binary matrix data of presence or absence of SNV in a particular position is loaded into R and using the packages ggplot2 and reshape, the binary heatmap is obtained by converting the values of each cell into either red ‘present in only case or control’ or green ‘present in both case and control’. This information is then plotted with the amino acid positions across the vertical axis.
The CGquery and Gene Torrent utilities (https://cghub.ucsc.edu/software/downloads.html) are used to search for and retrieve BAM files from CGHub. Alignment is performed using Bowtie version 0.12.5 . SNVs are calculated using SAMtools version 0.1.18 and bcftools version 0.1.17 . The pipeline consists of a series of Perl scripts and the above-mentioned software which are called using a wrapper script. The wrapper is accessed via the command line and accepts two arguments, a list of fastq files that needs to be analyzed and the location where the output should be placed. The statistics are generated with R .
Phylogenetic analysis and SNV visualization
A novel algorithm (phyloSNP) has been developed to create SNV-based phylogentic trees. The first step involves creating an alignment that contains genomic sequence around SNVs. For this study we chose to include zero, one and two nucleotides upstream and downstream of every SNV to create SNV-shrunk genome. More specifically, the SNV-shrunk genome alignments are created using phyloSNP (https://hive.biochemistry.gwu.edu/hive/dna.cgi?cmd=phylosnp) by concatenating regions of the genome that has SNV for each sample. If one sample has a SNV in a particular position then all of the other SNV-shrunk genomes from the other samples include that region in their SNV-shrunk genome. Therefore the output of all SNV-shrunk genomes is an alignment. This alignment is then used to generate neighbor joining phylogenetic trees using Clustal with 100 bootstrap values . Bootstrap values indicate the confidence of the branches in the estimated trees. The trees are viewed in TreeView .
Filtered SNVs are submitted to Seattleseq 137 Annotation Service  to get positional and functional annotation. Additional functional analysis of proteins affected by nsSNVs is performed based on methods described earlier [26–28]. Briefly, nsSNV data is uploaded into SNVDis database and integrated with protein sequence features obtained from UniProtKB/Swiss-Prot , Conserved Domain Database (CDD) and RefSeq of NCBI . SNVDis provides graphical and tabular output of variations that affect functionally annotated sequence sites. Additionally, SNVDis also provides information if there is an over- or under-representation of certain pathways and domains that are affected by nsSNVs. Gene Ontology analysis of genes affected by rare variants is performed using PANTHER tools [43, 44].
Case and control – Samples derived from paired tumor (case) and normal (control) tissue; dbSNP overlap – SNVs that are also found in dbSNP; novel SNV – found only in the analyzed dataset; rare SNV – found in less than 10% of the samples analyzed; common SNV – found in 90% of the samples analyzed.
Results and discussion
SNVs are widely used to identify disease causing genes and history of populations [45–47]. Many advances in the diagnosis and treatment of cancers have been made through such mutation discovery and analysis [48–52]. Combining the results of several studies (meta-analysis) can increase the power of the analysis . These meta-analyses combine the results (SNVs) from multiple studies and, using different statistical tools, identify the SNVs most associated with a specific disease or phenotype. Analysis of samples across different studies would provide a glimpse of the heterogeneity that is present in the population and this information can then be used by researchers to connect genomic changes to diseases. Additionally, availability of variation data from control samples can provide a more comprehensive understanding of the human variome in addition to what has been determined by projects such as the 1000 Genomes project . The effects that a specific variation has on a protein function have been the focus of studies for quite some time with several tools that predict SNV effects [54–57]. Proteome-wide analysis of variation that affects known functional sites [26–28, 58] is another way of estimating how variation can affect function at a system level and if there are specific domains or pathways that are more prone to having variations.
There is a great need in biological research and discovery for curated metadata that is associated with short sequence reads. Just like GenBank, NCBI-SRA and other public repository of short sequence reads around the world are all primary databases with minimal or no curation. This means that it is extremely difficult for users to search for and retrieve studies that can be used for additional analysis or browse analysis results that are associated with specific genes of interest.
Sequencing has identified key disease specific mutations in many cancers where the authors filter variation information from dbSNP to identify cancer specific variations . The data in dbSNP does not yet capture all possible individual level variation. Hence we intend to focus on analyzing and curating samples which can be used in conjunction with dbSNP data to better understand the human variome. For comparison purposes both cases and controls are analyzed. The key fields that we focus our curation efforts on are as follows: 1) Study, experiment and sample title, type, abstract and associated publications; 2) Organism name and taxonomy ID; 3) Additional information wherever applicable such as sample type, tissue site, clinical status, age, gender, ethnicity and gleason score; 4) Identification of nsSNVs; 5) Mapping of nsSNVs to dbSNP. In this project for tasks one, two and three data is obtained from TCGA files and manually verified. Publications that use TCGA data files are searched for in PubMed  and manually checked to confirm that they report analysis of cancer specific data that is under consideration. For tasks four and five a computational approach that involves read mapping and SNV calling followed by spot checks is performed. All metadata data in CSR is manually verified and entered. Samples which do not have the acceptable GC content of between 38-48% are not processed for curation.
Snapshot of information obtained upon searching the CSR database with a protein or gene accession number
RefSeq nucleotide AC
Breast Cancer Case
Breast Cancer Case
Breast Cancer Case
After the SNVs are called, filtering procedures as described in materials and methods are used to identify high-quality SNVs. In order to investigate the distribution of the variants from 55 samples (some patients have more than one control or case) derived from 20 patients, we perform two types of comparison: 1) We compare with dbSNP to calculate the proportion of known and novel variants that we identify through our pipeline. 2) Within this study comparison is conducted by calculating the common and rare SNVs and the concordance (see descriptions of concordance, novel, common and rare SNVs in Methods) between cases and control sets. It is possible that sequencing errors can lead to identification of SNVs which in reality may not be present. Liu et al.  performed a comprehensive study where they showed that read preprocessing step did not improve the accuracy of variant calling but ability to flag duplication, local realignment and recalibration steps helped reduce false positive and also sequencing depth was important. The study also noticed SAMtools performed quite well in identifying SNVs. Nonetheless, validation of the novel nsSNVs identified through NGS analysis can be performed using traditional Sanger sequencing of PCR products. For example, novel variations found in this study if identified in the NCI-60 exome samples for breast cancer cell lines  can be easily validated using the procedure mentioned above. Further validation can also be performed using peptide mass-spectrometry  for the nsSNVs. Such validation will also become critical if any of the novel nsSNVs that is identified through this study is found in several samples and is hypothesized to be related to the disease.
Overlap with dbSNP and analysis of novel variations
The total pool of SNVs is further grouped by their calculated frequency among samples in this study. The SNVs with frequency higher than 10% are defined as common SNVs while the SNVs lower than this frequency are grouped into rare SNVs. The different percentage between novel and known (not present or present in dbSNP) SNVs is illustrated in Figure 3B. As shown, in both cases and control samples, the common SNVs are more likely to have higher (almost 99%) overlap with dbSNP, while rare SNVs, which are present in less than 10% of the samples, have around 90% overlap with dbSNP. It is indeed possible that some of the novel mutations identified can be cancer drivers as suggested in a recent paper by Khurana et al. .
In order to explore the distribution preference of SNVs of novel and known SNVs on genomic functional regions, all the SNVs were annotated using Seattleseq Annotation 137 web service . Although not statistically significant, from our dataset it appears that novel SNVs are more prone to affect protein coding regions such as missense, stop-gain, and splicing (Figure 3C).
Distribution of SNVs in cases and controls
The methodology adopted for functional analysis is based on our earlier work [26, 27]. In summary, we first evaluated the overall impact of the nsSNVs from the case and control samples on the entire proteome (proteome-wide analysis) in terms of effects on functional sites such as active sites, binding sites, co- and post-translational modification sites. Then we evaluate which domains and pathways are most affected by variation. Additionally, we also perform Gene Ontology, pathway and keyword analysis of the novel nsSNVs to better understand the effects of variations which are presumably rare.
Proteome-wide analysis of the effects of nsSNVs
A broad analysis of all identified nsSNVs and also novel nsSNVs (variations not found in dbSNP and other variation databases) was undertaken get a comprehensive overview of the functional impact of variation. For this analysis all the nsSNVs derived from the CSR project are integrated into a proteome-wide analysis CSR companion tool SNVDis . SNVDis is integrated into a High Performance Integrated Virtual Environment (HIVE)  that allows proteome-wide analysis of the nsSNVs. The SNVDis tool home page shows nine sources of variation data with two of them coming from this study (TCGA-Breast-Control and TCGA-Breast-Case). The default proteome that the analysis is performed on is UniProtKB/Swiss-Prot defined human proteome. In the analysis box in SNVDis one can choose what type of analysis they wish to perform. For example, selecting TCGA-Breast-Control and TCGA-Breast-Case and binding site will retrieve all nsSNVs that alter protein binding sites (as defined by UniProtKB/Swiss-Prot  and CDD  curators). This tool provides a comprehensive overview of how the nsSNVs affects active sites, binding sites, N-linked glycosylation sites, protein domains and pathways. For example, selecting the active sites (includes site annotations from both UniProt and CDD) that are affected by nsSNVs from the breast cancer case and control samples retrieves 56 sites in 44 proteins.
For pathway and domain analysis SNVDis estimates the number of expected variations to find in the domain or pathway based on uniform distribution of nsSNVs. For pathway analysis the UniProtKB/Swiss-Prot is selected and from the ‘Select dataset’ box TCGA-Breast-Control and TCGA-Breast-Cancer is selected and from the ‘nsSNV analysis on’ box ‘Pathways’ tab is selected followed by selected of the ‘by significance’ option. A p-value cutoff of 0.0000001 is chosen. The top five pathways that are affected when both cases and controls are taken together are Nicotine degradation (observed: 81, expected: 33.2, p value: 1.11E-16), FAS signaling pathway (observed: 81, expected: 33.2, p value: 1.11E-16), DPP-SCW signaling pathway (observed: 76, expected: 37.9, p value: 5.83E-10), Cadherin signaling pathway (observed: 433, expected: 280.0, p value: 3.93e-20) and Blood coagulation (observed: 156, expected: 87.2, p value: 1.69e-13). No immediate correlation to cancer pathways can be drawn from this analysis other than the fact that changes to signaling pathways are considered to be important in cancer progression .
Using a similar protocol as described above the top five functional domains (sorted based on p-value) that are affected in case and control samples are found to be almost identical: the transmembrane olfactory receptor (Pfam ID: PF13853), the cysteine rich domain that occurs alongside the TIL domain (PF12714), the glycoprotein-fucosylgalactoside a-N-acetylgalactosaminyltransferase domain (PF03414), a mammalian taste receptor protein domain (PF05296) and a protein kinase domain (PF00069) for the case sample and for the control samples instead of the kinase domain the glycoside hydrolase family 18 domain (PF00704) has over-representation of nsSNVs. Interestingly the greatest difference observed between control and sample analysis was the fact that the hyaluronan/mRNA binding domain (PF04774) is significantly affected more in the case samples (observed: 17; expected: 3.6; p value 1.26E-12) than the control samples (observed: 9; expected: 3; p value 6.41E-03). Increased levels of hyalunonan have already been correlated to breast cancer and often are used as a marker . More samples would need to be analyzed to confirm this correlation.
As described earlier the number of case only or control only nsSNVs are less than 20% of the total variations and analyzing for over- or under-representation for cases and controls separately did not result in any appreciable differences in the highly affected pathways and domains. Therefore, a more detailed analysis of just the novel nsSNVs and their effects on functional sites was undertaken, results of which are described in the next section.
Enrichment analysis of novel nsSNV affected proteins
In addition to the proteome-wide analysis of all nsSNVs an analysis of genes that are impacted by novel nsSNVs was also performed. For this analysis from case samples a total 17,177 novel nsSNVs containing gene accession numbers are mapped to 8,896 UniProtKB/Swiss-prot proteins and for novel nsSNVs in controls 13,523 gene identifiers are mapped to 6,961 UniProtKB/Swiss-Prot proteins. A decrease in the number of proteins compared to the number of initial RefSeq gene identifiers is because UniProtKB/Swiss-Prot entries represent the canonical protein whereas the genes in RefSeq can represent different isoforms. An initial analysis using the UniProtKB/Swiss-Prot keyword ‘Disease’ shows that the keyword is over-represented in the gene list having novel nsSNVs from case (observed: 1524; expected: 1211; p-value: 5.22E-21). Novel nsSNVs in the controls also appear to be over-represented albeit with a less significant p-value (observed: 1155; expected: 948; p-value: 6.46E-09). Based on UniProtKB/Swiss-Prot protein entry annotations of genes that are considered oncogenes, proto-oncogenes and tumor suppressors; for novel nsSNVs that are only found in case samples there seems to be a slight over-representation of tumor suppressor genes (observed: 43; expected: 27; p-value: 3.84E-03). It is important to note that cancer disease annotations in UniProtKB/Swiss-Prot (or in any other database) are far from being comprehensive. As more patient samples are analyzed and the disease specific annotations improve it will be possible to identify through this type of analysis if specific genes that are involved in cancer do indeed have higher level of mutations both in the controls and cases.
Functional analysis of novel nsSNV containing genes
Integrin signaling pathway
Cadherin signaling pathway
Endothelin signaling pathway
Gonadotropin releasing hormone receptor pathway
Nicotinic acetylcholine receptor signaling pathway
PANTHER protein classification
Cell adhesion molecule
GO biological process
Cellular component organization
Protein modification process
GO molecular function
Transmembrane transporter activity
Enzyme regulator activity
GO cellular component
MHC protein complex
Over- and under-representation of Gene Ontology (GO) terms, PANTHER pathways and protein classification in the list of genes which have novel nsSNVs provides an overview of what broad effects these novel variations might have. The major terms for Gene Ontology (GO) Biological Processes that are highly over-represented (p-value >1E-24) are metabolic process, cellular process and transport (primarily protein and ion transport). Table 2 provides a breakdown of the major terms. For GO Molecular Function the major terms include catalytic activity, binding (includes protein binding) and transporter activity (p-value >1E-9). Other notable GO Molecular Function includes kinase activity (observed: 432; expected: 314.80; p-value: 1.60E-08) and ion channel activity (observed: 242; expected: 186.80; p-value: 7.75E-03). Many of the pathways identified as over-represented in the gene list are associated with cancer [67, 68]. For example, it is known that integrins and cadherins initiate signaling pathways that control the activity of Rho family GTPases . Other terms that are over-represented are related to cell adhesion, cell communication, and signal transduction all of which are associated proteins that play active roles during tissue development and tumour metastasis . Additional over-represented GO terms such as protein modification (e.g. post-translational modification (PTM) and enzymatic activity was further investigated to see if any of the mutations were resulting in loss of function PTM motif or active or binding sites of proteins.
Based on our previous analysis results on the effects of variation on active sites of proteins and N-linked glycosylation (NLG) sites [26, 27, 29] we here provide additional details on these two important functional sites. Out of the four proteins whose active site is disrupted (protein arginine N-methyltransferase 6 (PRMT6), chymase (CMA1), kallikrein-5 (KLK5), sphingomyelin phosphodiesterase 3 (SMPD3)) two of the proteins have them disrupted in both cases and controls. PRMT6 active site variation is detected in 3 samples (one case and two controls from one patient (TCGA-BH-A0DK); CMA1 variation is detected in two case samples from same patient (TCGA-A7-A0DB-01A-11D-A272-09,TCGA-A7-A0DB-01C-02D-A272-09); KLK5 exists in 75% of the samples (16 case samples and 25 control samples); SMPD3 exists in only one patient sample (TCGA-A7-A0DB-01C-02D-A272-09). KLK5 and its trypsin-like serine protease paralogs have been associated with several cancers . KLK5 active site variation which possibly deactivates the enzyme in a high percent of samples (both cases and controls) is most likely possible because one of the other paralogs (there are 15 members in the Kallikrein subfamily according to UniProtKB/Swiss-Prot) might have the ability to compensate for the loss of activity of one its members. For CMA1and SMPD3, the active site disruptions are found only in cases. Chymases are known to convert angiotensin I to angiotensin II, and influences of angiotensin I-converting enzyme gene polymorphisms on gastric cancer risks has been proposed before [75, 76]. Finding this mutation in breast cancer tumor provides a novel target for further investigation on the role of this gene in breast cancer. Similarly, SMPD3 is known to catalyze the hydrolysis of sphingomyelin to form ceramide and phosphocholine and ceramide mediates cellular functions, such as apoptosis and growth arrest [20, 77]. Mutations in SMPD3 have implicated the ceramide pathway in human leukemias . Identification of loss of function mutation of SMPD3 due to a mutation in the active site of the enzyme in breast cancer tumor cells provides for the first time a potential association of this gene with breast cancer.
The asparagine-X-serine/threonine (NXS/T) motif, where X is any amino acid except proline, is the consensus motif for NLG. Therefore, mutations in this motif can lead to loss of NLG. Previously, we have shown through proteome-wide analysis how germline mutations can result in loss-of-glycosylation (LOG) . In analyzing the somatic mutations of breast cancer patients we find 56 such LOG mutations in cases and 64 LOGs in controls (Additional file 1: Table S1). Out of these LOG sites 5 are unique to case samples (P08F94, Fibrocystin/PKHD1, position 830; P15151, Poliovirus receptor/PVR, position 122; P40126, L-dopachrome tautomerase/DCT, position 170; P52797, Ephrin-A3/EFNA3, position 102; P58170, Olfactory receptor 1D5/OR1D5, position 7; Q9NYQ6, Cadherin EGF LAG seven-pass G-type receptor 1/CELSR1, position 1289. All of these mutations appear to be novel except for Fibrocystin N- > S mutation at position 830 which amongst other mutations has been implicated in polycystic kidney disease [79, 80] but not cancer.
It is our hypothesis that phylogenetic analysis of SNVs can provide a better way of characterizing control and case samples. SNV based characterization methods have been used before [81, 82] but currently there are no phylogenetic analysis tools that allow for direct generation of trees from human SNVs identified in a next-generation sequence analysis pipeline. As described in Materials and Methods we have developed phyloSNP, which allows generation of SNV-shrunk genome alignments which can be used to create phylogenetic trees using existing tree building software such as MEGA , Clustal  etc.
Web interface and usage
The CSR metadata home page (http://hive.biochemistry.gwu.edu/dna.cgi?cmd=csr) contains all available data related to this study. RefSeq gene and protein accession numbers can be used to retrieve results for single genes and SNV downloads are also available from the website. Variation results for single proteins are integrated with variation data from dbSNP, COSMIC and UniProt to easily identify variations that are sample specific and if they overlap with any known variations. Additionally, there are links from the sample pages to CBio cancer genomics portal  which provides additional information on the TCGA samples. Annotation data from TCGA data portal files are manually checked and combined with computationally analyzed mapping and SNV data and entered into CSR database. Clicking on the ‘Reviewed short read data’ brings the user to the ‘Curated SRA Browsing Interface’ which contains headers linked to ‘study’, ‘experiment’ and ‘sample’. Clicking on each of these links takes the user to the respective curated datasets which can be browsed. The study, experiment and samples are hierarchical. Whenever possible, identifiers are inherited from the primary database (in this case TCGA) to ensure easy tracking of data. For the experiments new identifiers were created to indicate if the samples are cases on control.
The described approach can be used by others by to analyze and document variations from other individuals available from TCGA and many other datasets available via dbGaP  thereby providing a better view of the human variome. It is expected that this type of bioinformatic analysis in conjunction with phenotypic information will allow researchers to correlate variation information with functional changes. Additional immediate impact of the CSR database will be providing access to variation data from individuals about specific genes. For example, a user studying the NM_001099771.2 gene (UniProt accession A5A3E0; POTE ankyrin domain family member F) might be interested in knowing what variations are present in this gene. A survey of variations for this gene in dbSNP shows that there are 186 known SNPs. From our analysis we find an additional 22 SNVs out of which there is one nsSNV which results in the loss of a phosphorylation site (amino acid position 918 Y|F mutation). The protein is known to be expressed in breast cancer cell lines . In studies such as these if a rare nsSNV is observed in a critical functional site one would most likely try to estimate its impact through additional experimentation. An example of such follow up study where the variation impacts an active site of an enzyme could be assaying for buildup of substrate or decrease in the level of product . If the nsSNV affects a PTM site  such as N-linked glycosylation then one would need to perform possibly additional analytical studies to evaluate the impact . System level impact of nsSNVs and additional network analysis can also be applied to such analysis to better understand the personal genome [88–90]. For additional proteome-wide analysis of the impact of nsSNVs one can use SNVDis  as described in earlier sections.
CloudBiolinux  (http://www.cloudbiolinux.org), is a bioinformatics Virtual Machine (VM) that is implemented to run on Amazon EC2, on the open-source, private Cloud platform Eucalyptus (http://open.eucalyptus.com/), and on the desktop using VirtualBox (http://www.virtualbox.org). A VM is a fully-featured UNIX server, in a format of a single, downloadable binary file that executes on Clouds and desktop virtualization platforms. Cloud BioLinux includes a full Ubuntu Linux (http://www.ubuntu.com) and Galaxy Bioinformatics workbench interface , while users can start the VMs with a few clicks on the Amazon EC2 Cloud  without the need for any advanced technical knowledge. The Cloud BioLinux VM includes a suite of bioinformatic programming libraries for R, Perl, Ruby, and Python in addition to more than 100 pre-configured bioinformatics tools including BLAST, Glimmer, HMMER, PHYLIP, RasMol, Genespring, Clustalw, and the EMBOSS analysis suite. Given that the VM is accessible on the Amazon Cloud, it provides researchers with a large-scale, virtualized informatics infrastructure without the financial or time burden of owning and maintaining hardware and can therefore democratize access to computational resources for smaller laboratories which use NGS or other high throughput genomics technologies for biological experiments.
We leveraged the software libraries and tools that are available in the Cloud BioLinux VM, in order to install and run our pipeline with minimal effort as most dependencies were already available inside the VM. Furthermore, the VM format allows us to distribute the pipelines pre-configured and ready to execute in a single binary VM file that users can download and run on their desktop computers. This allows other researchers in the community to utilize our pipelines for their data analysis needs, without being required to spend time performing installations of bioinformatics tools or software libraries to configure the dependencies of the pipeline. The VM is available for download at (https://s3.amazonaws.com/cloudbiolinuxvms/cloudbiolinuxsra/cloudbiolinuxsra.ova) and users can boot it on their desktop following the instructions on the VirtualBox website (http://www.virtualbox.org/manual/ch01.html#ovf), while technical help is available through the Cloud BioLinux user group forum (https://groups.google.com/forum/?fromgroups#!forum/cloudbiolinux).
This workflow takes an SRA file and a reference sequence and calculates the coverage and SNVs of the reads when aligned against the reference sequence. The program is run from the command line using Perl and accepts as its parameters the base name and location of the reference index file, the name and location of the short read file; the name and location of the reference, and the output directory. The output will include a file containing putative SNPs, sequence coverage, and alignment statistics. The pipeline was run with a test input dataset from the NCBI-SRA (SRR052047.sra) and sample Bowtie2 indexes of the reference genome. The total run-time using a Cloud BioLinux Virtual Machine (VM, see Methods for more details) that use one CPU core out of four and two Gigabytes (GB) of memory on a laptop computer was approximately fifteen minutes.
In this study we developed a workflow that involves identification and analysis of nsSNVs and curation of the metadata associated with TCGA samples. This information is available for browsing and downloads from CSR database which we plan to continuously update with representative datasets from all types of cancer with initial focus on curation of data from patients with both exome and RNA-sequencing data with matched cases and controls. We consider these datasets important and currently there are 615 such patients with all the data and sample types mentioned above. We plan to adhere to the evolving Human Variome Project recommendations and guidelines in terms of data formats and sharing. We also provide a CloudBiolinux and proteome-wide analysis platform to allow users to analyze NGS data in their research or biocuration pipelines. It is our belief that as more datasets are curated by our group and others, we will get a better understanding of the variability in human populations. One limitation of adopting the workflow proposed here is the ability of individual researchers to have the resources and expertise to perform next-generation sequence analysis. To overcome this limitation our group in collaboration with US Food and Drug Administration has been implementing the High-performance Integrated Virtual Environment (HIVE) which provides novel and known sequence read mapping and variation calling algorithms in a highly parallelized environment. It is our goal to provide both enterprise level HIVE for institutions and HIVE-in-a-box for individual users thereby democratizing the ability of scientists to work on Big Data.
Consensus coding sequence
Curated short reads
The cancer genome atlas
Non-synonymous single-nucleotide variation.
We want to thank P Satti and JH Yu for help with database and interface development. We thank the TCGA tumor-specific groups for providing the data. We also thank Robert Foreman and Garrett Fields for providing HIVE system support. All computations were performed at High-performance Integrated Virtual Environment (HIVE) located at The George Washington University and implemented/co-developed by Drs. Raja Mazumder and Vahan Simonyan. This project is supported in part by U01 CA168926 and Research Participation Program at the Center for Biologics Evaluation and Research administered by the Oak Ridge Institute for Science and Education through an interagency agreement between the U.S. Department of Energy and the U.S. Food and Drug Administration.
- Hudson TJ, Anderson W, Artez A, Barker AD, Bell C, Bernabe RR, Bhan MK, Calvo F, Eerola I, Gerhard DS, et al: International network of cancer genome projects. Nature. 2010, 464 (7291): 993-998. 10.1038/nature08987.View ArticlePubMedGoogle Scholar
- NCI-TCGA: The Cancer Genome Atlas [TCGA]. 2012, http://cancergenome.nih.gov,Google Scholar
- Boehm JS, Hahn WC: Towards systematic functional characterization of cancer genomes. Nat Rev Genet. 2011, 12 (7): 487-498. 10.1038/nrg3013.View ArticlePubMedGoogle Scholar
- ICGC: International Cancer Genome Consortium. 2012, http://www.icgc.org/,Google Scholar
- Eswaran J, Gupta S, Dutt A, Toi M, Pillai M, Costa L, Knapp S, Badwe R, R K: The global cancer genomics consortium: interfacing genomics and cancer medicine. Cancer Res. 2012, 72 (15): 3720-3724.View ArticleGoogle Scholar
- Srivastava S: The early detection research network: 10-year outlook. Clin Chem. 2013, 59 (1): 60-67. 10.1373/clinchem.2012.184697.View ArticlePubMedGoogle Scholar
- TCGA: TCGA Data Primer. 2012, https://wiki.nci.nih.gov/display/TCGA/TCGA+Data+Primer,Google Scholar
- Deus HF, Veiga DF, Freire PR, Weinstein JN, Mills GB, Almeida JS: Exposing the cancer genome atlas as a SPARQL endpoint. J Biomed Inform. 2010, 43 (6): 998-1008. 10.1016/j.jbi.2010.09.004.View ArticlePubMed CentralPubMedGoogle Scholar
- Schroeder MP, Gonzalez-Perez A, Lopez-Bigas N: Visualizing multidimensional cancer genomics data. Genome medicine. 2013, 5 (1): 9-10.1186/gm413.View ArticlePubMed CentralPubMedGoogle Scholar
- DE R, Gruneberg A, HF D, MM T, JS A: A self-updating roadmap of the cancer genome atlas. Bioinformatics. 2013, in pressGoogle Scholar
- Patrinos GP, Smith TD, Howard H, Al-Mulla F, Chouchane L, Hadjisavvas A, Hamed SA, Li XT, Marafie M, Ramesar RS, et al: Human variome project country nodes: documenting genetic information within a country. Hum Mutat. 2012, 33 (11): 1513-1519. 10.1002/humu.22147.View ArticlePubMedGoogle Scholar
- Kohonen-Corish MR, Smith TD, Robinson HM, delegates of the 4th Biennial Meeting of the Human Variome Project Consortium: Beyond the genomics blueprint: the 4th human variome project meeting, UNESCO, Paris, 2012. Genet Med. 2013, 15 (7): 507-12. 10.1038/gim.2012.174.View ArticlePubMedGoogle Scholar
- Celli J, Dalgleish R, Vihinen M, Taschner PE, den Dunnen JT: Curating gene variant databases (LSDBs): toward a universal standard. Hum Mutat. 2012, 33 (2): 291-297. 10.1002/humu.21626.View ArticlePubMedGoogle Scholar
- Gaudet P, Arighi C, Bastian F, Bateman A, Blake JA, Cherry MJ, D’Eustachio P, Finn R, Giglio M, Hirschman L, et al: Recent advances in biocuration: meeting report from the fifth international biocuration conference. Database (Oxford). 2012, 2012: bas036-Google Scholar
- Gaudet P, Mazumder R: Biocuration virtual issue 2012. Database (Oxford). 2012, 2012: bas011-Google Scholar
- Burge S, Attwood TK, Bateman A, Berardini TZ, Cherry M, O’Donovan C, Xenarios L, Gaudet P: Biocurators and biocuration: surveying the 21st century challenges. Database (Oxford). 2012, 2012: bar059-Google Scholar
- Kodama Y, Shumway M, Leinonen R: The sequence read archive: explosive growth of sequencing data. Nucleic Acids Res. 2012, 40 (Database issue): D54-D56.View ArticlePubMed CentralPubMedGoogle Scholar
- CGHub: The Cancer Genomics Hub. 2013, https://cghub.ucsc.edu/,Google Scholar
- Pruitt KD, Tatusova T, Brown GR, Maglott DR: NCBI Reference Sequences (RefSeq): current status, new features and genome annotation policy. Nucleic Acids Res. 2012, 40 (Database issue): D130-D135.View ArticlePubMed CentralPubMedGoogle Scholar
- UniProt_Consortium: Reorganizing the protein space at the Universal Protein Resource (UniProt). Nucleic Acids Res. 2013, 40 (Database issue): D71-D75.Google Scholar
- Sayers EW, Barrett T, Benson DA, Bolton E, Bryant SH, Canese K, Chetvernin V, Church DM, DiCuccio M, Federhen S, et al: Database resources of the national center for biotechnology information. Nucleic Acids Res. 2011, 39 (Database issue): D38-D51.View ArticlePubMed CentralPubMedGoogle Scholar
- Punta M, Coggill PC, Eberhardt RY, Mistry J, Tate J, Boursnell C, Pang N, Forslund K, Ceric G, Clements J, et al: The Pfam protein families database. Nucleic Acids Res. 2012, 40 (Database issue): D290-D301.View ArticlePubMed CentralPubMedGoogle Scholar
- Wu CH, Nikolskaya A, Huang H, Yeh LS, Natale DA, Vinayaka CR, Hu ZZ, Mazumder R, Kumar S, Kourtesis P, et al: PIRSF: family classification system at the protein information resource. Nucleic Acids Res. 2004, 32 (Database issue): D112-D114.View ArticlePubMed CentralPubMedGoogle Scholar
- Tanabe M, Kanehisa M: Using the KEGG database resource. Curr Protoc Bioinformatics. 2012, Chapter 1: Unit1 12-PubMedGoogle Scholar
- Forbes SA, Tang G, Bindal N, Bamford S, Dawson E, Cole C, Kok CY, Jia M, Ewing R, Menzies A, et al: COSMIC (the Catalogue of Somatic Mutations in Cancer): a resource to investigate acquired mutations in human cancer. Nucleic Acids Res. 2010, 38 (Database issue): D652-D657.View ArticlePubMed CentralPubMedGoogle Scholar
- Dingerdissen H, Motwani M, Karagiannis K, Simonyan V, Mazumder R: Proteome-wide analysis of non-synonymous single-nucleotide variations in active sites of human proteins. FEBS J. 2013, 280 (6): 1542-62. 10.1111/febs.12155.View ArticlePubMedGoogle Scholar
- Mazumder R, Morampudi KS, Motwani M, Vasudevan S, Goldman R: Proteome-wide analysis of single-nucleotide variations in the N-glycosylation sequon of human genes. PloS one. 2012, 7 (5): e36212-10.1371/journal.pone.0036212.View ArticlePubMed CentralPubMedGoogle Scholar
- Karagiannis K, Simonyan V, Mazumder R: SNVDis: a proteome-wide analysis service for evaluating nsSNVs in protein functional sites and pathways. Genomics Proteomics Bioinformatics. 2013, 11 (2): 122-6. 10.1016/j.gpb.2012.10.003.View ArticlePubMed CentralPubMedGoogle Scholar
- Lam PV, Goldman R, Karagiannis K, Narsule T, Simonyan V, Soika V, Mazumder R: Structure-based comparative analysis and prediction of N-linked glycosylation sites in evolutionarily distant eukaryotes. Genomics Proteomics Bioinformatics. 2013, 11 (2): 96-104. 10.1016/j.gpb.2012.11.003.View ArticlePubMed CentralPubMedGoogle Scholar
- Satti P, Simonyan V, Mazumder R: Storage and biocuration of extra-large (XL) data sets from next-generation sequencing technologies. 5th International Biocuration Conference: April 2–4. 2012, ; Washington DCGoogle Scholar
- Afgan E, Chapman B, Jadan M, Franke V, Taylor J: Using cloud computing infrastructure with CloudBioLinux, CloudMan, and Galaxy. Curr Protoc Bioinformatics. 2012, Chapter 11: Unit11 19-Google Scholar
- Huang H, McGarvey PB, Suzek BE, Mazumder R, Zhang J, Chen Y, Wu CH: A comprehensive protein-centric ID mapping service for molecular data integration. Bioinformatics. 2011, 27 (8): 1190-1191. 10.1093/bioinformatics/btr101.View ArticlePubMed CentralPubMedGoogle Scholar
- Marchler-Bauer A, Anderson JB, DeWeese-Scott C, Fedorova ND, Geer LY, He S, Hurwitz DI, Jackson JD, Jacobs AR, Lanczycki CJ, et al: CDD: a curated Entrez database of conserved domain alignments. Nucleic Acids Res. 2003, 31 (1): 383-387. 10.1093/nar/gkg087.View ArticlePubMed CentralPubMedGoogle Scholar
- Pruitt KD, Harrow J, Harte RA, Wallin C, Diekhans M, Maglott DR, Searle S, Farrell CM, Loveland JE, Ruef BJ, et al: The consensus coding sequence (CCDS) project: Identifying a common protein-coding gene set for the human and mouse genomes. Genome Res. 2009, 19 (7): 1316-1323. 10.1101/gr.080531.108.View ArticlePubMed CentralPubMedGoogle Scholar
- Langmead B, Trapnell C, Pop M, Salzberg SL: Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol. 2009, 10 (3): R25-10.1186/gb-2009-10-3-r25.View ArticlePubMed CentralPubMedGoogle Scholar
- Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R: The sequence alignment/Map format and SAMtools. Bioinformatics. 2009, 25 (16): 2078-2079. 10.1093/bioinformatics/btp352.View ArticlePubMed CentralPubMedGoogle Scholar
- Lee TY, Huang HD, Hung JH, Huang HY, Yang YS, Wang TH: dbPTM: an information repository of protein post-translational modification. Nucleic Acids Res. 2006, 34 (Database issue): D622-D627.View ArticlePubMed CentralPubMedGoogle Scholar
- R_Development_Core_Team: A Language and Environment for Statistical Computing. 2005, Vienna, Austria: R Foundation for Statistical ComputingGoogle Scholar
- Larkin MA, Blackshields G, Brown NP, Chenna R, McGettigan PA, McWilliam H, Valentin F, Wallace IM, Wilm A, Lopez R, et al: Clustal W and Clustal X version 2.0. Bioinformatics. 2007, 21: 2947-2948.View ArticleGoogle Scholar
- Page RD: TreeView: an application to display phylogenetic trees on personal computers. Comput Appl Biosci. 1996, 12 (4): 357-358.PubMedGoogle Scholar
- Ng SB, Turner EH, Robertson PD, Flygare SD, Bigham AW, Lee C, Shaffer T, Wong M, Bhattacharjee A, Eichler EE, et al: Targeted capture and massively parallel sequencing of 12 human exomes. Nature. 2009, 461 (7261): 272-276. 10.1038/nature08250.View ArticlePubMed CentralPubMedGoogle Scholar
- National Center for Biotechnology Information. [http://www.ncbi.nlm.nih.gov]
- Mi H, Dong Q, Muruganujan A, Gaudet P, Lewis S, Thomas PD: PANTHER version 7: improved phylogenetic trees, orthologs and collaboration with the gene ontology consortium. Nucleic Acids Res. 2010, 38 (Database issue): D204-D210.View ArticlePubMed CentralPubMedGoogle Scholar
- Mi H, Muruganujan A, Casagrande JT, Thomas PD: Large-scale gene function analysis with the PANTHER classification system. Nature protocols. 2013, 8 (8): 1551-1566. 10.1038/nprot.2013.092.View ArticlePubMedGoogle Scholar
- Collins FS, Guyer MS, Charkravarti A: Variations on a theme: cataloging human DNA sequence variation. Science. 1997, 278 (5343): 1580-1581. 10.1126/science.278.5343.1580.View ArticlePubMedGoogle Scholar
- Risch N, Merikangas K: The future of genetic studies of complex human diseases. Science. 1996, 273 (5281): 1516-1517. 10.1126/science.273.5281.1516.View ArticlePubMedGoogle Scholar
- Altshuler D, Pollara VJ, Cowles CR, Van Etten WJ, Baldwin J, Linton L, Lander ES: An SNP map of the human genome generated by reduced representation shotgun sequencing. Nature. 2000, 407 (6803): 513-516. 10.1038/35035083.View ArticlePubMedGoogle Scholar
- Shu XO, Long J, Lu W, Li C, Chen WY, Delahanty R, Cheng J, Cai H, Zheng Y, Shi J, et al: Novel genetic markers of breast cancer survival identified by a genome-wide association study. Cancer Res. 2012, 72 (5): 1182-1189. 10.1158/0008-5472.CAN-11-2561.View ArticlePubMed CentralPubMedGoogle Scholar
- Penney KL, Schumacher FR, Kraft P, Mucci LA, Sesso HD, Ma J, Niu Y, Cheong JK, Hunter DJ, Stampfer MJ, et al: Association of KLK3 (PSA) genetic variants with prostate cancer risk and PSA levels. Carcinogenesis. 2011, 32 (6): 853-859. 10.1093/carcin/bgr050.View ArticlePubMed CentralPubMedGoogle Scholar
- Negm RS, Verma M, Srivastava S: The promise of biomarkers in cancer screening and detection. Trends Mol Med. 2002, 8 (6): 288-293. 10.1016/S1471-4914(02)02353-5.View ArticlePubMedGoogle Scholar
- Diamandis M, White NM, Yousef GM: Personalized medicine: marking a new epoch in cancer patient management. Mol Cancer Res. 2010, 8 (9): 1175-1187. 10.1158/1541-7786.MCR-10-0264.View ArticlePubMedGoogle Scholar
- Garnett MJ, Edelman EJ, Heidorn SJ, Greenman CD, Dastur A, Lau KW, Greninger P, Thompson IR, Luo X, Soares J, et al: Systematic identification of genomic markers of drug sensitivity in cancer cells. Nature. 2012, 483 (7391): 570-575. 10.1038/nature11005.View ArticlePubMed CentralPubMedGoogle Scholar
- Begum F, Ghosh D, Tseng GC, Feingold E: Comprehensive literature review and statistical considerations for GWAS meta-analysis. Nucleic Acids Res. 2012, 40 (9): 3777-3784. 10.1093/nar/gkr1255.View ArticlePubMed CentralPubMedGoogle Scholar
- Ng PC, Henikoff S: SIFT: predicting amino acid changes that affect protein function. Nucleic Acids Res. 2003, 31 (13): 3812-3814. 10.1093/nar/gkg509.View ArticlePubMed CentralPubMedGoogle Scholar
- McLaren W, Pritchard B, Rios D, Chen Y, Flicek P, Cunningham F: Deriving the consequences of genomic variants with the Ensembl API and SNP Effect Predictor. Bioinformatics. 2010, 26 (16): 2069-2070. 10.1093/bioinformatics/btq330.View ArticlePubMed CentralPubMedGoogle Scholar
- Bromberg Y, Rost B: SNAP: predict effect of non-synonymous polymorphisms on function. Nucleic Acids Res. 2007, 35 (11): 3823-3835. 10.1093/nar/gkm238.View ArticlePubMed CentralPubMedGoogle Scholar
- Ramensky V, Bork P, Sunyaev S: Human non-synonymous SNPs: server and survey. Nucleic Acids Res. 2002, 30 (17): 3894-3900. 10.1093/nar/gkf493.View ArticlePubMed CentralPubMedGoogle Scholar
- Konstantinos K, Simonyan V, Goldman R, Mazumder R: NVDis: a proteome-wide analysis service for evaluating nsSNVs in protein functional sites and pathways. 2nd Annual Beyond The Genome Conference: September 19-22. 2011, Washington DCGoogle Scholar
- Kumar A, White TA, MacKenzie AP, Clegg N, Lee C, Dumpit RF, Coleman I, Ng SB, Salipante SJ, Rieder MJ, et al: Exome sequencing identifies a spectrum of mutation frequencies in advanced and lethal prostate cancers. Proc Natl Acad Sci USA. 2011, 108 (41): 17087-17092. 10.1073/pnas.1108745108.View ArticlePubMed CentralPubMedGoogle Scholar
- McEntyre J, Lipman D: PubMed: bridging the information gap. CMAJ. 2001, 164 (9): 1317-1319.PubMed CentralPubMedGoogle Scholar
- Liu Q, Guo Y, Li J, Long J, Zhang B, Shyr Y: Steps to ensure accuracy in genotype and SNP calling from Illumina sequencing data. BMC Genomics. 2012, 13 (Suppl 8): S8-PubMed CentralPubMedGoogle Scholar
- Abaan OD, Polley EC, Davis SR, Zhu YJ, Bilke S, Walker RL, Pineda M, Gindin Y, Jiang Y, Reinhold WC, et al: The exomes of the NCI-60 panel: a genomic resource for cancer biology and systems pharmacology. Cancer Res. 2013, 73 (14): 4372-4382. 10.1158/0008-5472.CAN-12-3342.View ArticlePubMedGoogle Scholar
- Tanner S, Shen Z, Ng J, Florea L, Guigo R, Briggs SP, Bafna V: Improving gene annotation using peptide mass spectrometry. Genome Res. 2007, 17 (2): 231-239. 10.1101/gr.5646507.View ArticlePubMed CentralPubMedGoogle Scholar
- Lam HY, Clark MJ, Chen R, Natsoulis G, O’Huallachain M, Dewey FE, Habegger L, Ashley EA, Gerstein MB, Butte AJ, et al: Performance comparison of whole-genome sequencing platforms. Nat Biotechnol. 2012, 30 (1): 78-82.View ArticlePubMed CentralGoogle Scholar
- Jones S, Zhang X, Parsons DW, Lin JC, Leary RJ, Angenendt P, Mankoo P, Carter H, Kamiyama H, Jimeno A, et al: Core signaling pathways in human pancreatic cancers revealed by global genomic analyses. Science. 2008, 321 (5897): 1801-1806. 10.1126/science.1164368.View ArticlePubMed CentralPubMedGoogle Scholar
- Huang L, Grammatikakis N, Yoneda M, Banerjee SD, Toole BP: Molecular characterization of a novel intracellular hyaluronan-binding protein. J Biol Chem. 2000, 275 (38): 29829-29839. 10.1074/jbc.M002737200.View ArticlePubMedGoogle Scholar
- Juliano RL: Integrin signals and tumor growth control. Princess Takamatsu Symp. 1994, 24: 118-124.PubMedGoogle Scholar
- Arthur WT, Noren NK, Burridge K: Regulation of Rho family GTPases by cell-cell and cell-matrix adhesion. Biol Res. 2002, 35 (2): 239-246.View ArticlePubMedGoogle Scholar
- Fukata M, Kaibuchi K: Rho-family GTPases in cadherin-mediated cell-cell adhesion. Nat Rev Mol Cell Biol. 2001, 2 (12): 887-897. 10.1038/35103068.View ArticlePubMedGoogle Scholar
- Gruber AD, Pauli BU: Tumorigenicity of human breast cancer is associated with loss of the Ca2 + -activated chloride channel CLCA2. Cancer Res. 1999, 59 (21): 5488-5491.PubMedGoogle Scholar
- Koo BH, Hurskainen T, Mielke K, Aung PP, Casey G, Autio-Harmainen H, Apte SS: ADAMTSL3/punctin-2, a gene frequently mutated in colorectal tumors, is widely expressed in normal and malignant epithelial cells, vascular endothelial cells and other cell types, and its mRNA is reduced in colon cancer. International journal of cancer Journal international du cancer. 2007, 121 (8): 1710-1716. 10.1002/ijc.22882.View ArticlePubMedGoogle Scholar
- Ranney MK, Ahmed IS, Potts KR, Craven RJ: Multiple pathways regulating the anti-apoptotic protein clusterin in breast cancer. Biochim Biophys Acta. 2007, 1772 (9): 1103-1111. 10.1016/j.bbadis.2007.06.004.View ArticlePubMed CentralPubMedGoogle Scholar
- Patsialou A, Wyckoff J, Wang Y, Goswami S, Stanley ER, Condeelis JS: Invasion of human breast cancer cells in vivo requires both paracrine and autocrine loops involving the colony-stimulating factor-1 receptor. Cancer Res. 2009, 69 (24): 9498-9506. 10.1158/0008-5472.CAN-09-1868.View ArticlePubMed CentralPubMedGoogle Scholar
- Batra J, O’Mara T, Patnala R, Lose F, Clements JA: Genetic polymorphisms in the human tissue kallikrein (KLK) locus and their implication in various malignant and non-malignant diseases. Biol Chem. 2012, 393 (12): 1365-1390.View ArticlePubMedGoogle Scholar
- Sugimoto M, Furuta T, Shirai N, Ikuma M, Sugimura H, Hishida A: Influences of chymase and angiotensin I-converting enzyme gene polymorphisms on gastric cancer risks in Japan. Cancer Epidemiol Biomarkers Prev. 2006, 15 (10): 1929-1934. 10.1158/1055-9965.EPI-06-0339.View ArticlePubMedGoogle Scholar
- Zhang Y, He J, Deng Y, Zhang J, Li X, Xiang Z, Huang H, Tian C, Huang J, Fan H: The insertion/deletion (I/D) polymorphism in the Angiotensin-converting enzyme gene and cancer risk: a meta-analysis. BMC Med Genet. 2011, 12: 159-10.1186/1471-2350-12-159.View ArticlePubMed CentralPubMedGoogle Scholar
- Marchesini N, Osta W, Bielawski J, Luberto C, Obeid LM, Hannun YA: Role for mammalian neutral sphingomyelinase 2 in confluence-induced growth arrest of MCF7 cells. J Biol Chem. 2004, 279 (24): 25101-25111. 10.1074/jbc.M313662200.View ArticlePubMedGoogle Scholar
- Kim WJ, Okimoto RA, Purton LE, Goodwin M, Haserlat SM, Dayyani F, Sweetser DA, McClatchey AI, Bernard OA, Look AT, et al: Mutations in the neutral sphingomyelinase gene SMPD3 implicate the ceramide pathway in human leukemias. Blood. 2008, 111 (9): 4716-4722. 10.1182/blood-2007-10-113068.View ArticlePubMed CentralPubMedGoogle Scholar
- Bergmann C, Senderek J, Sedlacek B, Pegiazoglou I, Puglia P, Eggermann T, Rudnik-Schoneborn S, Furu L, Onuchic LF, De Baca M, et al: Spectrum of mutations in the gene for autosomal recessive polycystic kidney disease (ARPKD/PKHD1). J Am Soc Nephrol. 2003, 14 (1): 76-89. 10.1097/01.ASN.0000039578.55705.6E.View ArticlePubMedGoogle Scholar
- Furu L, Onuchic LF, Gharavi A, Hou X, Esquivel EL, Nagasawa Y, Bergmann C, Senderek J, Avner E, Zerres K, et al: Milder presentation of recessive polycystic kidney disease requires presence of amino acid substitution mutations. J Am Soc Nephrol. 2003, 14 (8): 2004-2014. 10.1097/01.ASN.0000078805.87038.05.View ArticlePubMedGoogle Scholar
- Pandya GA, Holmes MH, Petersen JM, Pradhan S, Karamycheva SA, Wolcott MJ, Molins C, Jones M, Schriefer ME, Fleischmann RD, et al: Whole genome single nucleotide polymorphism based phylogeny of Francisella tularensis and its application to the development of a strain typing assay. BMC Microbiol. 2009, 9: 213-10.1186/1471-2180-9-213.View ArticlePubMed CentralPubMedGoogle Scholar
- Van Geystelen A, Decorte R, Larmuseau MH: AMY-tree: an algorithm to use whole genome SNP calling for Y chromosomal phylogenetic applications. BMC Genomics. 2013, 14: 101-10.1186/1471-2164-14-101.View ArticlePubMed CentralPubMedGoogle Scholar
- Tamura K, Peterson D, Peterson N, Stecher G, Nei M, Kumar S: MEGA5: molecular evolutionary genetics analysis using maximum likelihood, evolutionary distance, and maximum parsimony methods. Mol Biol Evol. 2011, 28 (10): 2731-2739. 10.1093/molbev/msr121.View ArticlePubMed CentralPubMedGoogle Scholar
- Kaufman JS, Cooper RS: Commentary: considerations for use of racial/ethnic classification in etiologic research. Am J Epidemiol. 2001, 154 (4): 291-298. 10.1093/aje/154.4.291.View ArticlePubMedGoogle Scholar
- Cerami E, Gao J, Dogrusoz U, Gross BE, Sumer SO, Aksoy BA, Jacobsen A, Byrne CJ, Heuer ML, Larsson E, et al: The cBio cancer genomics portal: an open platform for exploring multidimensional cancer genomics data. Cancer discovery. 2012, 2 (5): 401-404. 10.1158/2159-8290.CD-12-0095.View ArticlePubMedGoogle Scholar
- Lee Y, Ise T, Ha D, Saint Fleur A, Hahn Y, Liu XF, Nagata S, Lee B, Bera TK, Pastan I: Evolution and expression of chimeric POTE-actin genes in the human genome. Proc Natl Acad Sci USA. 2006, 103 (47): 17885-17890. 10.1073/pnas.0608344103.View ArticlePubMed CentralPubMedGoogle Scholar
- Nakagawa H, Wakabayashi-Nakao K, Tamura A, Toyoda Y, Koshiba S, Ishikawa T: Disruption of N-linked glycosylation enhances ubiquitin-mediated proteasomal degradation of the human ATP-binding cassette transporter ABCG2. FEBS J. 2009, 276 (24): 7237-7252. 10.1111/j.1742-4658.2009.07423.x.View ArticlePubMedGoogle Scholar
- Khurana E, Fu Y, Chen J, Gerstein M: Interpretation of genomic variants using a unified biological network approach. PLoS Comput Biol. 2013, 9 (3): e1002886-10.1371/journal.pcbi.1002886.View ArticlePubMed CentralPubMedGoogle Scholar
- Kamphans T, Krawitz PM: GeneTalk: an expert exchange platform for assessing rare sequence variants in personal genomes. Bioinformatics. 2012, 28 (19): 2515-2516. 10.1093/bioinformatics/bts462.View ArticlePubMed CentralPubMedGoogle Scholar
- Capriotti E, Nehrt NL, Kann MG, Bromberg Y: Bioinformatics for personal genome interpretation. Brief Bioinform. 2012, 13 (4): 495-512. 10.1093/bib/bbr070.View ArticlePubMed CentralPubMedGoogle Scholar
- Goecks J, Nekrutenko A, Taylor J: Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences. Genome Biol. 2010, 11 (8): R86-10.1186/gb-2010-11-8-r86.View ArticlePubMed CentralPubMedGoogle Scholar
- Afgan E, Baker D, Coraor N, Chapman B, Nekrutenko A, Taylor J: Galaxy CloudMan: delivering cloud compute clusters. BMC Bioinformatics. 2010, 11 (Suppl 12): S4-10.1186/1471-2105-11-S12-S4.View ArticlePubMed CentralPubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.