Development of our prioritization model required a controlled vocabulary to describe phenotypes and knowledge of the semantic relationships between the vocabulary terms. Additionally, the genes that are to be ranked must be annotated with the appropriate vocabulary terms. The first two requirements are satisfied using an ontology that includes a shared vocabulary and defined relationships between vocabulary terms. Appropriate annotation of ontology terms to genes satisfies the last requirement. For the work presented here, we used the Human Phenotype Ontology (HPO) [26–28], which was specifically developed to represent human disease phenotypic abnormality knowledge. HPO concepts are organized in a strict hierarchy, where child terms assume “is-a” relationships with their ancestor terms. In addition to the ontology, the HPO resources include a gene-to-phenotype mapping that provides known associations between Entrez gene IDs [29, 30] and HPO terms. The true-path rule applies to all annotations, such that if a gene is annotated with a specific HPO term, it inherits all ancestors of that term. The specific files used for this work were hp.obo build 687, and genes-to-phenotype build 26. This HPO version contains 9,965 phenotype terms, of which 6,346 either directly or indirectly annotate at least one gene. The genes-to-phenotype file provides 61,784 direct annotations for 2,488 genes.
Semantic similarity
Our model is based on a concept known as semantic similarity. We assume that ontology terms are used to annotate a set of objects, and that the true-path rule applies for all term assignments. In this case, a given term will be more or less specific than others with respect to the number of objects it annotates. Consider two terms sets, P
1
= {p
1
,p
2
,…,p
K
} and P
2
= {p
1
,p
2
,…,p
J
}, selected from the ontology. Given the semantic structure of the ontology and the specificity of the terms, we can compare the similarity, or likeness, of these sets with an appropriate semantic similarity model.
Our semantic similarity model requires a mathematical description of each ontology term’s specificity. Given an ontology term, p, we quantify the term’s specificity using its information content, IC, as defined by Resnik [31]:
where N is the total number of objects annotated by all terms and n
p
is the number of objects annotated either directly or indirectly, via the true-path rule, by the term p. Given two terms p
1
and p
2
, let A(p
1
,p
2
) be the set of all common ontological ancestors of p
1
and p
2
, noting that p
1
and p
2
are not included in A unless p
1
= p
2
. We define the most informative common ancestor of p
1
and p
2
, p
MICA
, to be the term in A with the maximum information content. We define the termwise semantic similarity as [21]:
(2)
where n
pMICA
is the number of objects annotated directly or indirectly by the term p
MICA
.
In order to compare two term sets, we require a term-set semantic similarity model. Given two arbitrary term sets, P
1
= {p
1
,p
2
,…,p
K
} and P
2
= {p
1
,p
2
,…,p
J
}, we define the asymmetric term-set semantic similarity as:
(3)
where sim
X
is a termwise similarity function such as Equation 2. Note, Equation 3 is asymmetric, i.e., its output is dependent on the order of the arguments P
1
and P
2
. A symmetric term-set semantic similarity model can then be constructed as follows:
(4)
Note that Equations 3 and 4 describe a family of semantic similarity models that depend on the termwise comparison function sim
X
, a number of which are described in the literature [21, 31, 32].
Given a term-set semantic similarity model, we can rank an annotated object set relative to some query set. Here, we wish to rank a list of genes relative to a query set that represents the patient’s phenotype. We first assume that a set of HPO phenotype terms, P
p
= {p
1
,p
2
,…,p
K
}, is used to characterize patient abnormalities, and that any given gene, g, is annotated with a set of phenotype terms, P
g
= {p
1
, p
2
, . . . , p
J
}, that indicate a known link between mutations in the gene and the associated phenotype abnormalities. For each gene, we assign a score computed as the semantic similarity between P
p
and P
g
and then sort the genes in descending order based on the assigned score. In the event of identical values across multiple genes, all equivalent genes are assigned a rank equal to the mean of the sequential positions. For example, if four genes all have the same similarity score and the four genes occupy positions {i, i + 1, i + 2, i + 3} in the sorted list, then each gene is assigned the rank (4i + 6)/4, rounded down to the nearest integer.
Statistical significance
To compensate for annotation bias, objects can be ordered by some statistical significance measure of the semantic similarity observed between the query term set and the terms annotated to each object. Specifically, we computed the one-sided p-value associated with an observed semantic similarity score. A similarity score distribution model is required for each annotated object and for each fixed query term set length to determine the p-value of an observed similarity score. We consider two objects, g
1
and g
2
, each annotated with different ontology terms, assuming K
1
ontology terms are randomly selected to represent the query set and that the semantic similarity between the query set and the terms annotated to g
1
and g
2
are recorded. The score distribution observed for g
1
will differ from that observed for g
2
. Similarly, the distribution obtained for g
1
with randomly selected query sets with K
2
terms will differ from the distribution for g
1
with query sets with K
1
terms provided K
2
≠ K
1
.
Assuming we have similarity score distribution models for each gene annotated by HPO terms for each phenotype query set length, K, and given an HPO phenotype query set representing a patient’s phenotypes, we can compute a semantic similarity score for each gene in some gene set as described in the previous section. If we now assign to each gene the p-value associated with the semantic similarity score observed for that gene, we rank the genes in ascending order by p-value. If two or more genes have the same p-value, their relative rank is determined using the similarity scores. For example, if four genes have the same p-value and occupy positions {i, i + 1, i + 2, i + 3} in the sorted list, each gene is assigned the rank i - 1 + s
r
where s
r
is the gene’s position amongst the four genes when sorted by similarity score.
It is not computationally feasible to sample the entire similarity score space for all genes annotated by HPO terms because the number of possible query phenotype sets grows exponentially with query set length, K. Consequently, we estimated the similarity distribution models using Monte Carlo sampling. For each combination of gene and query set length in the range [2, 10], we randomly selected 100,000 query phenotype sets and computed the corresponding semantic similarity to the terms annotated to the gene. Note that for k = 1, we sampled the entire space of 6,346 query terms. The Monte Carlo sampling was conducted using a custom Scala Akka application deployed on Amazon’s Elastic Compute Cloud [33–35].
From the Monte Carlo samples, we created tables representing the similarity score cumulative distribution function (CDF) for each gene and patient phenotype set length combination. Given a similarity score, s, for gene, g, and patient phenotype set length, K, we estimate the one-sided p-value for the probability of randomly obtaining a similarity score greater than or equal to the observed score from the CDF as:
For query phenotype sets of length greater than 10, we used the tables derived for K = 10.
Data generation
For a given disease, we formed a baseline disease phenotype feature set as the intersection of the disease terms supplied in the online version of [21] and those annotated to the Online Mendelian Inheritance in Man (OMIM)-indicated causative gene, per the HPO resource file gene-to-phenotypes. The penetrance of each disease phenotype was also taken from the online version of [21]. A complete list of the disease causative genes and their associated phenotypes is available as an additional file [see Additional file 1].
We simulated 100 baseline patients per disease. To account for gender-specific phenotypes, we first generated a standard uniform random number and designated the patient as male if the number was greater than 0.5, and female if ≤0.5. Then, for each baseline phenotype feature, p, associated with the patient’s disease, we generated a new standard uniform pseudo-random number, α
p
, and added p to the patient’s phenotype set if α
p
≤ f
p
, where f
p
is the phenotype penetrance. For example, assume a given disease has two phenotypes, p
1
and p
2
, with penetrance 0.25 and 0.5, respectively, then for each simulated patient we generated two random numbers, α
p1
and α
p2
. We assigned p
1
to the patient if α
p1
≤ 0.25 and likewise assigned p
2
if α
p2
≤ 0.5. We tacitly assumed that disease phenotypes occur independently of each other because joint distribution data is not currently available. We ensured that every simulated patient had at least one phenotype term. We denote this patient cohort as “optimal” to indicate that the patient phenotype characteristics were selected from those directly annotated to the causative gene.
In actual clinical practice, patient phenotype sets will often include noise and imprecision. Imprecision occurs when a less specific phenotype term is used to annotate the patient in place of the phenotype term annotated to the causative gene. For example, assume the phenotype, p
1
, is directly annotated to the causative gene, g. Let p
2
be any ontological ancestor of p
1
other than the ontology root term, and assume that p
2
is not annotated to g. We defined imprecision as present if p
2
is used to annotate the patient instead of p
1
. We defined noise as occurring when a phenotype that does not directly or indirectly annotate a gene is nevertheless assigned to the patient phenotype set.
To examine the impact of noise, randomly selected noise terms were added to each patient’s baseline optimal phenotype set. Noise terms were generated separately for each patient. Each disease was associated with an allowable set of HPO noise terms, namely all terms that do not directly or indirectly annotate the disease causative gene. For each patient, noise terms were selected randomly from the disease specific list so that different patients had different noise terms. The number of noise terms per patient was set to half the number of optimal terms, i.e. those derived from terms annotating the causative gene. For example, if a given patient had 4 optimal terms that patient was assigned 2 additional randomly selected noise terms. If a given patient had 10 optimal terms that patient was assigned 5 additional randomly selected noise terms. To examine the impact of imprecision, the patient’s optimal phenotypes were each replaced with a randomly selected ancestor term, other than the ontology root. For each simulated disease, the terms directly annotated to the causative gene, i.e. the optimal terms, were assigned an allowed imprecision term set composed of all ancestor terms of the optimal term. For each patient, imprecision was added by replacing every optimal term by a randomly selected term from the imprecision set for that optimal term so that different patients had different imprecision terms. Finally, to add both noise and imprecision, the noise terms generated for a given patient in the “noise only” case were added to the patient terms generated in the “imprecision only” case.
In addition to the simulated data, we also obtained data for actual clinical patients taken from a research cohort currently used as part of a clinical genomics workflow validation project at The Children’s Hospital of Philadelphia and the Perelman School of Medicine at the University of Pennsylvania. The purpose of this project is to validate the ability of the exome sequencing workflow to enable the identification of causative variants. The cohort subjects all have known genetic diseases with a previously identified causative variant confirmed by Sanger sequencing. In the validation study, retrospective exome sequencing was performed for each subject. Genetic counselors, who were not provided the known causative variant, then examined the exome sequence results to identify the causative variant. For a number of these retrospective cases, a genetic counselor provided a list of patient HPO phenotype characteristics deemed relevant to the patient’s condition, which were used to evaluate the semantic similarity algorithm. This research was approved by the Institutional Research Board of The Children's Hospital of Philadelphia. Written informed consent was obtained from the patient for the publication of this report and any accompanying images.
Data access
Two additional files accompany this article. “Additional file 1” is a .pdf file that contains the OMIM disease ID, name, causative gene, associated HPO terms and penetrance values used for the 33 simulated Mendelian diseases used in this study. “Additional file 2” is a .txt file containing the hearing loss phenotype gene list developed at The Children’s Hospital of Philadelphia and used for evaluation of the clinical cases in this study. The custom source code used to implement the algorithm and generate the simulated data in this work is publically available at: https://github.com/cbmi/phenomantics.