An integrated pharmacokinetics ontology and corpus for text mining

Background Drug pharmacokinetics parameters, drug interaction parameters, and pharmacogenetics data have been unevenly collected in different databases and published extensively in the literature. Without appropriate pharmacokinetics ontology and a well annotated pharmacokinetics corpus, it will be difficult to develop text mining tools for pharmacokinetics data collection from the literature and pharmacokinetics data integration from multiple databases. Description A comprehensive pharmacokinetics ontology was constructed. It can annotate all aspects of in vitro pharmacokinetics experiments and in vivo pharmacokinetics studies. It covers all drug metabolism and transportation enzymes. Using our pharmacokinetics ontology, a PK-corpus was constructed to present four classes of pharmacokinetics abstracts: in vivo pharmacokinetics studies, in vivo pharmacogenetic studies, in vivo drug interaction studies, and in vitro drug interaction studies. A novel hierarchical three level annotation scheme was proposed and implemented to tag key terms, drug interaction sentences, and drug interaction pairs. The utility of the pharmacokinetics ontology was demonstrated by annotating three pharmacokinetics studies; and the utility of the PK-corpus was demonstrated by a drug interaction extraction text mining analysis. Conclusions The pharmacokinetics ontology annotates both in vitro pharmacokinetics experiments and in vivo pharmacokinetics studies. The PK-corpus is a highly valuable resource for the text mining of pharmacokinetics parameters and drug interactions.


Background
Pharmacokinetics (PK) is a very important translational research field, which studies drug absorption, disposition, metabolism, excretion, and transportation (ADMET). PK systematically investigates the physiological and biochemical mechanisms of drug exposure in multiple tissue types, cells, animals, and human subjects [1]. There are two major molecular mechanisms of a drug's PK: metabolism and transportation. The drug metabolism mainly happens in the gut and liver; while drug transportation exists in all tissue types. If the PK can be interpreted as how a body does on the drug, pharmacodynamics (PD) can be defined as how a drug does on the body. A drug's pharmacodynamics effect ranges widely from the molecular signals (such as its targets or downstream biomarkers) to clinical symptoms (such as the efficacy or side effect endpoints) [1].
Drug-drug interaction (DDI) is another important pharmacology concept. It is defined as whether one drug's PK or PD response is changed due to the presence of another drug. PD based drug interaction has a wide range of interpretations (i.e. from molecular markers to clinical endpoints). PK based drug interaction mechanism is very well defined: metabolism enzyme based and transporter based DDIs. Pharmacogenetic (PG) variations in a drug's PK and PD pathways can also affect its responses [1]. In this paper, we will focus our discussion on the PK, PK based DDI, and PK related PG.
Although significant efforts have been invested to integrate biochemistry, genetics, and clinical information for drugs, significant gaps exist in the area of PK. For example DrugBank (http://www.drugbank.ca/) doesn't have in vitro PK and its associated DDI data; DiDB (http://www.druginteractioninfo.org/) doesn't have sufficient PG data; and PharmGKB (http://www.pharmgkb.org/) doesn't have sufficient in vivo and in vitro PK and its associated DDI data. As an alternative approach to collect PK from the published literature, text mining has just started to be explored [1][2][3][4].
From either database construction or literature mining, the main challenge of PK data integration is the lack of PK ontology. This paper developed a PK ontology first. Then, a PK corpus was constructed. It facilitated DDI text mining from the literature.

Construction and content
PK Ontology is composed of several components: experiments, metabolism, transporter, drug, and subject (Table 1). Our primary contribution is the ontology development for the PK experiment, and integration of the PK experiment ontology with other PK-related ontologies.
Experiment specifies in vitro and in vivo PK studies and their associated PK parameters. Table 2 presents definitions and units of the in vitro PK parameters. The PK parameters of the single drug metabolism experiment include Michaelis-Menten constant (K m ), maximum velocity of the enzyme activity (V max ), intrinsic clearance (CL int ), metabolic ratio, and fraction of metabolism by an enzyme (fm enzyme ) [5]. In the transporter experiment, the PK parameters include apparent permeability (Papp), ratio of the basolateral to apical permeability and apical to basolateral permeability (Re), radioactivity, and uptake volume [6]. There are multiple drug interaction mechanisms: competitive inhibition, non-competitive inhibition, uncompetitive inhibition, mechanism based inhibition, and induction [7]. IC 50 is the inhibition concentration that inhibits to 50% enzyme activity; it is substrate dependent; and it doesn't imply the inhibition mechanism. K i is the inhibition rate constant for competitive inhibition, noncompetitive inhibition, and uncompetitive inhibition. It represents the inhibition concentration that inhibits to 50% enzyme activity, and it is substrate concentration independent. K deg is the degradation rate constant for the enzyme. K I is the concentration of inhibitor associated with half maximal Inactivation in the mechanism based inhibition; and K inact is the maximum degradation rate constant in the presence of a high concentration of inhibitor in the mechanism based inhibition. E max is the maximum induction rate, and EC 50 is the concentration of inducer that is associated with the half maximal induction The in vitro experiment conditions are presented in Table 3. Metabolism enzyme experiment conditions include buffer, NADPH sources, and protein sources. In particular, protein sources include recombinant enzymes, microsomes, hepatocytes, and etc. Sometimes, genotype information is available for the microsome or hepatocyte samples. Transporter experiment conditions include bi-directional transporter, uptake/efflux, and ATPase. Other factors of in vitro experiments include pre-incubation time, incubation time, quantification methods, sample size, and data analysis methods. All these info can be found in the FDA website (http://www. abclabs.com/Portals/0/FDAGuidance_DraftDrugInterac-tionStudies2006.pdf ).
The in vivo PK parameters are presented in Table 4. All of the information are summarized from two text books [1,8]. There are several main classes of PK parameters. Area under the concentration curve parameters are (AUC inf , AUC SS , AUC t , AUMC); drug clearance parameters are (CL, CL b , CL u , CL H , CL R , CL po , CL IV , CL int , CL 12 ); drug concentration parameters are (C max , C SS ); extraction ratio and bioavailability parameters are (E, E H , F, F G , F H , F R , f e , f m ); rate constants include elimination rate constant k, absorption rate constant ka, urinary excretion rate constant ke, Michaelis-Menten constant Km, distribution rate constants (k 12 , k 21 ), and two rate constants in the two-compartment model (λ 1 , λ 2 ); blood flow rate (Q, Q H ); time parameters (t max , t 1/2 ); volume distribution parameters (V, V b , V 1 , V 2 , V ss ); maximum rate of metabolism, Vmax; and ratios of PK parameters that present the extend of the drug interaction, (AUCR, CL ratio, Cmax ratio, C ss ratio, t 1/2 ratio).
It is also shown in Table 4 that two types of pharmacokinetics models are usually presented in the literature: non-compartment model and one or two-compartment models. There are multiple items need to be considered in an in vivo PK study. The hypotheses include the effect  points. The sample type includes blood, plasma, and urine. The drug quantification methods include HPLC/UV, LC/ MS/MS, LC/MS, and radiographic. CYP450 family enzymes predominantly exist in the gut wall and liver. Transporters are tissue specific. Table 5 presents the tissue specific transports and their functions. Probe drug is another important concept in the pharmacology research. An enzyme's probe substrate means that this substrate is primarily metabolized or transported by this enzyme. In order to experimentally prove whether a new drug inhibits or induces an enzyme, its probe substrate is always utilized to demonstrate this enzyme's activity before and after inhibition or induction. An enzyme's probe inhibitor or inducer means that it inhibits or induces this enzyme primarily. Similarly, an enzyme's probe inhibitor needs to be utilized if we investigate whether a drug is metabolized by this enzyme. Table 6 presents all the probe inhibitors, inducers, and substrates of CYP enzymes. Table 7 presents all the probe inhibitors, inducers, and substrates of the transporters. All these information were collected from industry standard (http://www.fda.gov/Drugs/Guidance ComplianceRegulatoryInformation/Guidances/ucm064982. htm), reviewed in the top pharmacology journal [9].
Metabolism The cytochrome P450 superfamily (officially abbreviated as CYP) is a large and diverse group of enzymes that catalyze the oxidation of organic substances. The substrates of CYP enzymes include metabolic intermediates such as lipids and steroidal hormones, as well as xenobiotic substances such as drugs and other toxic chemicals. CYPs are the major enzymes involved in drug metabolism and bioactivation, accounting for about 75% of the total number of different metabolic reactions [10]. CYP enzyme names and genetic variants were mapped from the Human Cytochrome P450 (CYP) Allele Nomenclature Database (http://www.cypalleles.ki.se/). This site contains the CYP450 genetic mutation effect on the protein sequence and enzyme activity with associated references.
Transport Proteins are proteins which serves the function of moving other materials within an organism. Transport proteins are vital to the growth and life of all living things. Transport proteins involved in the movement of ions, small molecules, or macromolecules, such as another protein, across a biological membrane. They are integral membrane proteins; that is they exist within and span the membrane across which they transport substances. Their names and genetic variants were mapped from the Transporter  Classification Database (http://www.tcdb.org). In addition, we also added the probe substrates and probe inhibitors to each one of the metabolism and transportation enzymes (see prescribed description). Drug names was created using the drug names from DrugBank 3.0 [11]. DrugBank consists of 6,829 drugs which can be grouped into different categories of FDAapproved, FDA approved biotech, nutraceuticals, and experimental drugs. The drug names are mapped to generic names, brand names, and synonyms.

PK corpus
A PK abstract corpus was constructed to cover four primary classes of PK studies: clinical PK studies (n = 56); clinical pharmacogenetic studies (n = 57); in vivo DDI studies (n = 218); and in vitro drug interaction studies (n = 210). The PK corpus construction process is a manual process. The abstracts of clinical PK studies were selected from our previous work, in which the most popular CYP3A substrate, midazolam was investigated [4]. The clinical pharmacogenetic abstracts were selected based on the most polymorphic CYP enzyme, CYP2D6. We think these two selection strategies represent very well all the in vivo PK and PG studies. In searching for the drug interaction studies, the abstracts were randomly selected from a PubMed query, which used probe substrates/inhibitors/inducers for metabolism enzymes reported in the Table 6. Pharmacokinetics Models

Non-Compartment
Use drug concentration measurements directly to estimate PK parameters, such as AUC, CL, C max , T max , t 1/2 , F, and V.

GP p409
One Compartment Model It assumes the whole body is a homogeneous compartment, and the distribution of the drug from the blood to tissue is very fast. It assumes either a first order or a zero order absorption rate and a first order eliminate rate. Its PK parameters include (ka, V, CL, F).

RT p34
GP p1 Two Compartment Model It assumes the whole body can be divided into two compartments: central compartment (i.e. systemic compartment) and peripheral compartment (i.e. tissue compartment). It assumes either a first order or a zero order absorption rate and a first order eliminate and distribution rates. Its PK parameters include (ka, V 1 , V 2 , CL, CL 12 , F).

GP p84
Study Designs Hypothesis Bioequivalence, drug interaction, pharmacogenetics, and disease conditions.

Design
Single arm or multiple arms; cross-over or fixed order design; with or without randomization; with or without stratification; prescreening or no-prescreening; prospective or retrospective studies; and case reports or cohort studies.

Sample size
The number of subjects, and the number of plasma or urine samples per subject. Once the abstracts have been identified in four classes, their annotation is a manual process (Figure 1). The annotation was firstly carried out by three master level annotators (Shreyas Karnik, Abhinita Subhadarshini, and Xu Han), and one Ph.D. annotator (Lang Li). They have different training backgrounds: computational science, biological science, and pharmacology. Any differentially annotated terms were further checked by Sara K. Quinney and David A. Flockhart, one Pharm D. and one M.D. scientists with extensive pharmacology training background. Among the disagreed annotations between these two annotators, a group review was conducted (Drs Quinney, Flockhart, and Li) to reach the final agreed annotations. In addition a random subset of 20% of the abstracts that had consistent annotations among four annotators (3 masters and one Ph.D.), were double checked by two Ph.D. level scientists.
A structured annotation scheme was implemented to annotate three layers of pharmacokinetics information: key terms, DDI sentences, and DDI pairs ( Figure 2). DDI sentence annotation scheme depends on the key terms; and DDI annotations depend on the key terms and DDI sentences. Their annotation schemes are described as following.
Key terms include drug names, enzyme names, PK parameters, numbers, mechanisms, and change. The boundaries of these terms among different annotators were judged by the following standard.
Drug names were defined mainly on DrugBank 3.0 [11]. In addition, drug metabolites were also tagged, because they are important in in vitro studies. The metabolites were judged by either prefix or suffix: oxi, hydroxyl, methyl, acetyl, N-dealkyl, N-demethyl, nor, dihydroxy, O-dealkyl, and sulfo. These prefixes and suffixes are due to the reactions due to phase I metabolism (oxidation, reduction, hydrolysis), and phase II metabolism (methylation, sulphation, acetylation, glucuronidation) [13]. Enzyme names covered all the CYP450 enzymes. Their names are defined in the human cytochrome PK parameters were annotated based on the defined in vitro and in vivo PK parameter ontology in Table 2  The middle level annotation focused on the drug interaction sentences. Because two interaction drugs were not necessary all presented in the sentence, sentences were categorized into two classes: Clear DDI Sentence (CDDIS): two drug names (or drug-enzyme pair in the in vitro study) are in the Once DDI sentences were labeled, the DDI pairs in the sentences were further annotated. Because the fundamental difference between in vivo DDI studies and in vitro DDI studies, their DDI relationships were defined differently. In in vivo studies, three types of DDI relationships were defined (Table 8): DDI, ambiguous DDI (ADDI), and non-DDI (NDDI). Four conditions are Table 6 In vivo probe inhibitors/inducers/substrates of CYP enzymes CYP enzymes

Inhibitors
Inducers Substrates specified to determine these DDI relationships. Condition 1 (C1) requires that at least one drug or enzyme name has to be contained in the sentence; condition 2 (C2) requires the other interaction drug or enzyme name can be found from the context if it is not from the same sentence; condition 3 (C3) specifies numeric rules to defined the DDI relationships based on the PK parameter changes; and condition 4 (C4) specifies the language expression patterns for DDI relationships. Using the rules summarized in Table 8, DDI, ADDI, and NDDI can be defined by C1^C2^(C3F igure 1 PK corpus annotation flow chart.  C4). The priority rank of in vivo PK parameters is AUC > CL > t 1/2 > C max . In in vitro studies, six types of DDI relationships were defined (Table 8). DDI, ADDI, NDDI were similar to in vivo DDIs, but three more drug-enzyme relationships were further defined: DEI, ambiguous DEI (ADEI), and non-DDI (NDEI). C1, C2, and C4 remained the same for in vitro DDIs. The main difference is in C3, in which either Ki or IC50 (inhibition) or EC50 (induction) were used to defined DDI relationship quantitatively. The priority rank of in vitro PK parameters is Ki > IC50. Table 9 Figure 2 A three level hierarchical PK and DDI annotation scheme. Significant, obviously, markedly, greatly, pronouncedly and etc.

Ambiguous DDI (ADDI)
The PK parameter with the highest priority* in the conditions of p-value <0.05 but 0.67 < FC < 1.50; or FC >1.50 or FC <0.67, but p-value > 0.05.

Non-DDI (NDDI)
The PK parameter with the highest priority*are in the condition of p-value > 0.05 and 0.67 < FC < 1.50 Minor significance, slightly, little or negligible effect, doesn't interact etc.

Non-DEI (NDEI)
Note: C1: At least one drug or enzyme name has to be contained in the sentence. C2: Need to label the drug name if it is not from the same sentence. C3: PK-parameter and value dependent. C4: Significance statement. *Priority issue: When C3 and C4 occur and conflict, C3 dominates the sentence.**For the priority of PK parameters: AUC > CL > t 1/2 > C max; ; the priority of in vitro PK parameters: Ki>IC50.
presented eight examples of how DDIs or DEIs were determined in the sentences.
Krippendorff 's alpha [14] was calculated to evaluate the reliability of annotations from four annotators. The frequencies of key terms, DDI sentences, and DDI pairs are presented in Table 10. Their Krippendorff's alphas are 0.953, 0.921, and 0.905, respectively. Please note that the total DDI pairs refer to the total pairs of drugs within a DDI sentence from all DDI sentences.
The PK corpus was constructed by the following process. Raw abstracts were downloaded from PubMed in XML format. Then XML files were converted into GENIA corpus format following the gpml.dtd from the GENIA corpus [15]. The sentence detection in this step is accomplished by using the Perl module Lingua::EN:: Sentence, which was downloaded from The Comprehensive Perl Archive Network (CPAN, www.cpan.org). GENIA corpus files were then tagged with the prescribed three levels of PK and DDI annotations. Finally, a cascading style sheet (CSS) was implemented to differentiate colours for the entities in the corpus. This feature allows the users to visualize annotated entities. We would like to acknowledge that a DDI Corpus was recently published as part of a text mining competition DDIExtraction 2011 (http://labda.inf.uc3m.es/ DDIExtraction2011/dataset.html). Their DDIs were clinical outcome oriented, not PK oriented. They were extracted from DrugBank, not from PubMed abstracts. Our PK corpus complements to their corpus very well.

Example 1: An annotated tamoxifen pharmacogenetics study
This example shows how to annotate a pharmacogenetics studies with the PK ontology. We used a published tamoxifen PG study [16]. The key information from this Because of the words, "significantly", (Verapamil, lovastatin) is a DDI.
20209646 The clearance of mitoxantrone and etoposide was decreased by 64% and 60%, respectively, when combined with valspodar.
20012601 The (AUC (0-infinity)) of norverapamil and the terminal half-life of verapamil did not significantly changed with lovastatin coadministration.
17304149 Compared with placebo, itraconazole treatment significantly increase the peak plasma concentration (Cmax) of paroxetine by 1.3 fold (6.7 2.5 versus 9. AUC has a higher rank than Cmax, and it had a 1.5 fold-change and less than 0.05 p-value, thus, (itraconazole, paroxetine) is a DDI.
13129991 The mean (SD) urinary ratio of dextromethorphan to its metabolite was 0.006 (0.010) at baseline and 0.014 (0.025) after St John' s wort administration (P=.26) The change in PK parameter is more than 1.5 fold but P-value is >0.05. Thus, (dextromethorphan, St John's wort) is an ADDI. 19904008 The obtained results show that perazine at its therapeutic concentrations is a potent inhibitor of human CYP1A2.
19230594 After human hepatocytes were exposed to 10 microM YM758, microsomal activity and mRNA level for CYP1A2 were not induced while those for CYP3A4 were slightly induced.
19960413 From these results, DPT was characterized to be a competitive inhibitor of CYP2C9 and CYP3A4, with K(i) values of 3.5 and 10.8 microM in HLM and 24.9 and 3.5 microM in baculovirus-insect cell-expressed human CYPs, respectively.
Because K was larger than 10microM, (DPT, CYP2C9) and (DPT, CYP3A4) are ADEIs. tamoxifen PG trial was extracted as a summary list. Then the pre-processed information was mapped to the PK ontology (column 2 in Additional file 1: Table S1). This PG study investigates the genetics effects (CYP3A4, CPY3A5, CYP2D6, CYP2C9, CYP2B6) on the tamoxifen pharmacokinetics outcome (tamoxifen metabolites) among breast cancer patients. It was a single arm longitudinal study (n = 298), patients took SOLTA-MOX TM 20mg/day, and the drug steady state concentration was sampled (1,4,8,12) months after the tamoxifen treatment. The study population was a mixed Caucasian and African American. In additional file 1: Table S1, the trial summary is well organized by the PK ontology.

Example 2 midazolam/ketoconazole drug interaction study
This was a cross-over three-phase drug interaction study [17] (n = 24) between midazolam (MDZ) and ketoconazole (KTZ). Phase I was MDZ alone (IV 0.05 mg/kg and PO 4mg); phase II was MDZ plus KTZ (200mg); and phase III was MDZ plus KTZ (400mg). Genetic variable include CYP3A4 and CYP3A5. The PK outcome is the MDZ AUC ratio before and after KTZ inhibition. Its PK  ontology based annotation is shown in Additional file 1: Table S1 column three.

Example 3 in vitro Pharmacokinetics Study
This was an in vitro study [18], which investigated the drug metabolism activities for 3 enzymes, such as CYP3A4, CYP3A5, and CYP3A7 in a recombinant system. Using 10 CYP3A substrates, they compared the relative contribution of 3 enzymes among 10 drug's metabolism. Its PK ontology based annotation is shown in Additional file 1: Table S2.

Example 4 A drug interaction text mining example
We implemented the approach described by [19] for the DDI extraction. Prior to performing DDI extraction, the testing and validation DDI abstracts in our corpus was pre-processed and converted into the unified XML format [19]. The following steps were conducted: Drugs were tagged in each of the sentences using dictionary based on DrugBank. This step revised our prescribed drug name annotations in the corpus. One purpose is to reduce the redundant synonymous drug names. The other purpose is only keep the parent drugs and remove the drug metabolites from the tagged drug names from our initial corpus, because parent drugs and their metabolites rarely interacts. In addition, enzymes (i.e. CYPs) were also tagged as drugs, since enzymedrug interactions have been extensively studied and published. The regular expression of enzyme names in our corpus was used to remove the redundant synonymous gene names.  Each of the sentences was subjected to tokenization, PoS tags and dependency tree generation using the Stanford parser [20]. C 2 n drug pairs form the tagged drugs in a sentence were generated automatically, and they were assigned with default labels as no-drug interaction. Please note that if a sentence had only one drug name, this sentence didn't have a DDI. This setup limited us considering only CDDI sentence in our corpus. The drug interaction labels were then manually flipped based on their true drug interaction annotations from the corpus. Please note that our corpus had annotated DDIs, ADDIs, NDDIs, DEIs, ADEIs, and NDEIs. Here only DDIs and DEIs were labeled as true DDIs. The other ADDIs, NDDIs, DEIs, and ADEIs were all categorized into the nodrug interactions.
Then sentences were represented with dependency graphs using interacting components (drugs) (Figure 3). The graph representation of the sentence was composed of two items: i) One dependency graph structure of the sentence; ii) a sequence of PoS tags (which was transformed to a linear order "graph" by connecting the tags with a constant edge weight). We used the Stanford parser [20] to generate the dependency graphs. Airola et al. proposed to combine these two graphs to one weighted, directed graph. This graph was fed into a support vector machine (SVM) for DDI/non-DDI classification. More details about the all paths graph kernel algorithm can be found in [19]. A graphical representation of the approach is presented in Figure 3.
DDI extraction was implemented in the in vitro and in vivo DDI corpus separately. Table 11 presented the training sample size and testing sample size in both corpus sets. Then Table 12 presents the DDI extraction performance. In extracting in vivo DDI pairs, the precision, recall, and F-measure in the testing set are 0.67, 0.79, and 0.73, respectively. In the in vitro DDI extraction analysis, the precision, recall, and F-measure are 0.47, 0.58, 0.52 respectively in the in vitro testing set. In our early DDI research published in the DDIExtract 2011 Challenge [21], we used the same algorithm to extract both in vitro and in vivo DDIs at the same time, the reported F-measure was 0.66. This number is in the middle of our current in vivo DDI extraction F-measure 0.73 and in vitro DDI extraction F-measure 0.52.
Error analysis was performed in testing samples. Table 13 summarized the results. Among the known reasons for the false positives and false negatives, the most frequent one is that there are multiple drugs in the sentence, or the sentence is long. The other reasons include that there is no direct DDI relationship between two drugs, but the presence of some words, such as dose, increase, and etc., may lead to a false positive prediction; or DDI is presented in an indirect way; or some NDDI are inferred due to some adjectives (little, minor, negligible).
annotator; DF confirmed the disagreed annotations and double checked the PK terminologies and study design; and LLi contributed the idea, guide this research, and wrote the manuscript. All authors read and approved the final manuscript.