In the OMIM database on September 30, 2004, CGMIM identified 1943 genes related to cancer. BRCA2 (OMIM *164757), BRAF (OMIM *164757) and CDKN2A (OMIM *600160) were each related to 14 types of cancer. The OMIM entries for all three genes mention leukemia, melanoma, breast cancer, colorectal cancer, pancreatic cancer, stomach cancer, ovarian cancer and prostate cancer. The entry for BRCA2 also mentions cancer of the brain, larynx, cervix, uterus, thyroid and kidney. The entry for BRAF also mentions lymphoma and cancer of the lung, bladder, testes, cervix and uterus. The entry for CDKN2A also mentions lymphoma and cancer of the lung, bladder, brain, esophagus and kidney. Each gene defines a large group of related cancers.
The numbers of genes associated with each pair of cancer types are summarized in the siteXsite matrix (Figure 1). Diagonal cells in the matrix contain the total numbers of genes identified for each cancer type; off-diagonal cells are the numbers of genes identified by both the row and the column titles. For example, there were 45 genes related to cancer of the esophagus, 121 genes related to cancer of the stomach, and 21 genes related to both. The cancer mentioned by the greatest number of OMIM entries was leukemia, and the greatest number of OMIM gene entries that mention a combination of two cancers was 143 for lymphoma and leukemia. For some pairs of cancer sites, no genes were identified.
The numbers in the off-diagonal cells depend on the number of genes related to the individual cancers. Based on the number of OMIM entries that mention leukemia and lymphoma individually, the number expected to mention both is 98.3 and the ratio of the observed and expected values is 1.5 (95% CI 1.3–1.7). (In equation (1), GLEUKEMIA = 643, GLYMPHOMA = 297 and N = 1943.) This indicates there are 50% more genes related to both cancers than would be expected by chance. Table 1 provides a list of 20 pairs of cancer types where the ratio of the observed and expected number of genes in the siteXsite matrix is greatest. The table indicates that fewer than three genes in OMIM should mention both cancer of the esophagus and cancer of the stomach by chance, but 21 entries mention both cancers. This more than seven-fold discrepancy suggests that cancers of the esophagus and stomach might be more related than current literature suggests. Similar conclusions might be made for the other pairs of cancer types in Table 1.
We randomly selected 25 genes related to cancer and manually reviewed text of the corresponding OMIM entries. All of the entries correctly mention one or more types of cancer, but for 20% of those entries, one of the cancers was only mentioned in the context of evidence suggesting no association.
CGMIM can assist in designing effective studies of genetically-related cancers. CGMIM uses a high-quality database of genetic information to produce a summary of gene and cancer associations. A group of cancer types might be related by physical proximity in the body (e.g., prostate and bladder cancer), a shared physiologic function (e.g., cancers involving the digestive tract), a common exposure (e.g., cancers caused by air pollution) or a common genetic characteristic (e.g., cancers in tissues that express BRCA1). The identification of such groups becomes more difficult and time-consuming as the literature about genes and cancer expands, and efficient text-mining tools have increasing value.
In several ways, groups of cancers that have shared genetic factors are anticipated to lead to further etiologic hypotheses and advances regarding environmental agents. First, grouping cancers will be especially useful if a group combines several cancers that are rare and difficult to study individually. Second, knowledge of genetic pathways might suggest an environmental factor associated with all of the cancers. For example, a grouping defined by a vitamin receptor gene would suggest vitamin intake as a possible environmental agent in the etiology of all of the cancers. Third, CGMIM will allow us to design studies that might extend gene-cancer associations to include cancers at other sites. The groups can also be used to identify cancers that should be considered together in a definition of family history, and in selection of genetic tests that might be adopted for high-risk families. During development of CGMIM, we observed changes in OMIM and the cancer groups that it produced from one week to another. This illustrates the need for a tool that can routinely perform the analysis, as opposed to a set of results based on the OMIM contents from a particular day.
OMIM is based on published material from the scientific literature. The number of genes identified by our program does not necessarily indicate the relatedness of two or more cancer types, but rather what is known about those cancers. This reflects what research has been funded, performed and published. There is more funding for certain types of cancer, there are more journals that address certain types of cancer, and there are more people studying certain types of cancer. Published information reflects our knowledge base and the scientific literature is hence a valid basis for identifying cancer groups and genes for further study. In some cases, evidence about an association was based on studies of cell lines or non-human organisms. In other cases, evidence was based on anecdotal observations in a small number of people. Some associations were based on several independent studies that each involved hundreds of patients.
There are sentences in OMIM that contain phrases such as "is not related to breast cancer". We could not create an algorithm that recognized all negative references without overlooking positive valid ones. Some OMIM entries report both negative and positive evidence of an association. These "mixed" entries are tallied as positive reports by CGMIM, consistent with our interest in positive associations. Other sentences in OMIM describe evidence of gene expression in both cancerous and normal tissue. E.g., "... has been shown to be expressed in breast cancer cells and prostate cells". The sentences are incorrectly interpreted as mentions of prostate cancer. Manual review of OMIM indicated that a minority of apparent associations (about 20%) between a gene and specific type of cancer were the result of negative evidence and are thus "false-positive" text-mining associations. We suggest that a manual review of OMIM associations always precede subsequent study design and analysis. We assume the excess 20% is included in every cell of the siteXsite matrix. Thus expected values also include the 20% excess, and the O/E ratios are not affected.
Other databases might be used as the basis for assessing scientific knowledge regarding genetic cancer groupings, but OMIM offers several advantages. OMIM is based on all publications in the PubMed database that are related to a specific human gene or trait. Results based on mining all of PubMed would be of interest, but would involve a much larger volume of literature and lack the expert review that is characteristic of OMIM. More specialized cancer groupings also might be created using computerized conference proceedings or journal contents. Likewise, a list of synonyms might be determined from other sources such as the UMLS (Unified Medical Language System) Specialist Lexicon of the National Cancer Institute. We used ICD-O terminology because it is the basis for most scientific writing on cancer.
This project used resources that have been developed by the US National Institutes of Health and Human Genome Project.[3] Our approach is exhaustive of the information reported in OMIM, will produce a computer algorithm for near-automatic updating of the review, and has the potential to be extended to other computerized databases. We will use CGMIM along with other criteria to guide the design of studies of genes and environment in cancer etiology.