LINCS cell line information extraction and mapping from different resources
As of June 15, 2017, 1097 cell lines were extracted from the LINCS Data Portal. Out of these LINCS cell lines, 794 cell lines could be directly mapped to CLO based on exact name matching and manual verification. A cell line may have different synonyms. The name matching used the default label and different synonyms for lexical mapping between these two resources. The data types related to these cell lines are listed in Fig. 2a. Meanwhile, the ChEMBL database included 637 cell line entries that have LINCS IDs. Out of these cell lines, 451 cell lines also have CLO_IDs, and 51 out of the remaining 186 cell lines could be mapped to CLO using name matching. The data types available related to these cell lines in ChEMBL are shown in Fig. 2b.
Among the 1097 LINCS cell lines each with a unique LINCS cell line ID (e.g., LCL-1512 for HeLa cell), 466 had ChEMBL, LINCS, and CLO IDs, 279 had both LINCS and CLO IDs, and 352 LINCS cell lines did not have any CLO IDs.
Note that sometimes one LINCS ID maps to multiple CLO IDs. For example, the Hep G2 cell line (http://www.atcc.org/products/all/HB-8065.aspx) has the LINCS ID LCL-1925, and it is mapped to three CLO IDs: CLO_0003704 (term label: Hep G2 cell), CLO_0050856 (label: RCB1648 cell), and CLO_0050858 (RCB1886 cell). Although they are all for Hep G2 cell based on their annotations,CLO_0003704 was the originally assigned based on an annotation from the ATCC cell line repository, and the other two come from the Japan RIKEN cell line bank with different registry information. In current CLO, we assert the two Japan cell bank cell line cell terms as subclasses of the CLO_0003704 with the consideration that the two Japan cell bank cell line cell types may have genetic variations given their long time of passages. In this case, all the three CLO cell line cell terms have the same LINCS cell line ID, which is defined using an annotation property ‘Cell line LINCS ID’.
CLO modeling and design pattern generation
In CLO, the basic unit for representing a cell line is the term ‘cell line cell’, which is defined as “a cultured cell that is part of a cell line - a stable and homogeneous population of cells with a common biological origin and propagation history in culture” [8]. As shown in Fig. 1, the new cell line information identified from the LINCS project and ChEMBL database is of different types of names/description and data resource IDs. Such information can be effectively represented as specific annotation types. The strategy is reflected in a simple CLO design pattern model (Fig. 3), which was generated based on the general CLO design pattern reported in the original CLO paper [8].
For example, for the HeLa cell (CLO_0003684), based on the updated design pattern, we added the following information to CLO: ‘Cell line LINCS ID: LCL-1512’ and ‘seeAlso: EFO: EFO_0001185; CHEMBL: CHEMBL3308376; CVCL: CVCL_0030’.
Most LINCS cell lines were originally derived from human patients with some specific cancer diseases, and many of these diseases were not included in CLO. In this study, we imported corresponding disease terms from the Human Disease Ontology (DOID) [20]. To represent the relation between a cell line cell and a disease, we generated a new object property called ‘is disease model for’ (CLO_0000179). For example, for the HeLa cell, an OWL SubClassOf axiom was generated to represent its usage in studying cervical adenocarcinoma:
‘HeLa cell’ (CLO_0003684): ‘is disease model for’ some ‘cervical adenocarcinoma’
It is noted that in CLO, the new object property ‘is disease model for’ is equivalent to the original object property ‘is model for’, an object property originated by the EBI cell line project (http://www.ebi.ac.uk/cellline#is_model_for) [8]. The EBI cell line project relation is obsolete. Replacing the obsolete legacy object property ‘is model for’ with the new CLO relation supports the ontology updating and standardization.
The direct link between a disease and a cell line as a model to study the disease is required by LINCS data structure. In addition to this direct link, CLO also presents the origination of a cell line from a formalized ontological representation. The disease that is modeled by a cell line is often the disease of the particular human patient from whom the first passage of the cell line was originally generated. For example, the HeLa cell’s origin was the cervical adenocarcinoma cells separated from a cervical cancer patient, an African American woman in 1951 [21]. To represent the relation between the disease and the patient (original source for the cell line), the Fig. 4 design pattern was applied. For example, the following OWL SubClassOf axiom represents a human-cell relation for the HeLa cell in CLO:
‘HeLa cell’: ‘derives from’ some (‘epithelial cell’ and (part_of some (‘uterine cervix’ and (part_of some (‘Homo sapiens’ and (‘has disease’ some ‘cervical adenocarcinoma’)))))
It is noted that HeLa cell is listed in CLO as a subclass of ‘immortal human uterine cervix-derived epithelial cell line cell’ (CLO_0000636), where the relation between human and the cell is clearly stated.
It is also noted from the above axiom that the long chain of axiom (i.e., cell line cell – cell type – tissue – organ – organism – disease) shown above becomes technically inefficient to query the relation between the cell line cell and the disease ‘cervical adenocarcinoma’. A shortcut relation (or object property) is a relation that is used to replace the usage of a chain of multiple relations and classes to represent the complex relations between two classes. Therefore, a new shortcut relation (or called object property) ‘derives originally from patient having disease’ (Fig. 4) was generated to directly link the cell line cell and disease as shown in the following OWL SubClassOf axiom:
‘HeLa cell’: ‘derives originally from patient having disease’ some ‘cervical adenocarcinoma’
Although ‘is disease model for’ and ‘derives originally from patient having disease’ both represent a relation between a cell line cell and a disease, these two relations differ in their meaning. The shortcut relation ‘derives originally from patient having disease’ represents that the cell line cell was originally derived from a patient with a specific disease. The relation ‘is disease model for’ indicates that the cell line can be used to study a specific disease, and the disease can but does not have to, be the same as the disease of the patient from whom the cell line cell was derived. For example, HeLa cell can be used as a cell line model to study cervical adenocarcinoma, but it can also be used to study many other diseases such as polio and NewCastle Disease [21, 22].
CLO modeling of cell features under regular cell culture conditions
In this study, we used the MCF 10A cell line cell as an example to show how CLO can be used to model cell features.
MCF 10A cell line cell is non-tumorigenic [23]. CLO represents such knowledge using the following OWL SubClassOf axiom:
‘MCF 10A’: has_qality some non-tumorigenic
where the non-tumorigenic is represented as a quality, and the relation between MCF 10A cell line cell and the quality can be represented using the object property has_quality.
MCF 10A cell line cells exhibit three dimensional growth in collagen and form domes in confluent cultures [23]. We can use the following OWL SubClassOf axiom to represent such knowledge:
‘MCF 10A’: ‘participates in’ some (‘three dimensional cell growth’ and (‘has participant’ some collagen)
In this case, ‘three dimensional cell growth’ (CLO_0037311) is a process, and both MCF 10A cell line cell and collagen are participants of such a process. Since GO does not have such a ‘three dimensional cell growth’ term, we generated the term using a tentative CLO ID (CLO_0037311) and listed it as a subclass of the ‘cell growth’. Here the collagen is a component needed for the three dimensional cell growth. Collagen (CHEBI_3815) is a group of fibrous proteins of very high tensile strength that form the main component of connective tissue in animals.
CLO modeling of cellular responses to special agent treatments
How to represent a cellular response of a cell line cell to a specific agent that is not part of regular cell culture media? Here we again use MCF 10A cell line cell response modeling as an example study.
It is known that MCF 10A mammary epithelial cells undergo apoptosis following actin depolymerization. The MCF 10A response can be represented in the following OWL SubClassOf axiom:
‘MCF 10A’: ‘participates in’ some (‘apoptotic process’ and ‘preceded by’ some ‘actin depolymerization’ and (‘induced by cell culture reagent’ some latrunculin-A))
In this case, apoptotic process is represented as a GO term (GO_0006915). This process in MCF10A cells occurs after actin depolymerization (GO_0030042) is induced by a cell culture reagent Latrunculin A (CHEBI_69136), a bicyclic macrolide natural product consisting of a 16-membered bicyclic lactone attached to the rare 2-thiazolidinone moiety [24].
Sometimes, cell line cells were genetically engineered to generate a new cell line by a transfection process. Basically, a transfection process deliberately introduces naked or purified nucleic acids into eukaryotic cells such as cell line cells. For example, MCF10A-Er-Src cell line cell is a MCF10A cell derived cell through transfection. As a result, MCF10A-Er-Src cell has the part of ER-Src, a derivative of the Src kinase onco-protein that is fused to the ligand-binding domain of the estrogen receptor (ER). It is clear that MCF10A-Er-Src cell line cell is not a subtype of MCF 10A cell. The transfection process makes the new cell a MCF 10A–derived cell type instead of a subtype of MCF 10A per se. Specifically, CLO represents the new MCF10A-Er-Src cell line cell formation as shown in the following OWL SubClassOf axiom:
‘MCF10A-Er-Src cell’: ‘is specified output of’ some (‘cell line cell transfection’ and (‘has specified input of’ some ‘MCF 10A cell’))
LINCS-CLOview: LINCS cell line subset of CLO
Based on the mapping and the design pattern models (Figs. 3 and 4), extra data available in the LINCS Data Portal and ChEMBL were integrated into to CLO. To improve the efficiency, a combination of manual annotation/edition and Ontorat [16]-assisted automated process was conducted.
The new information added to CLO includes two parts:
-
(1)
Existing 795 CLO cell line cell items were added with newly obtained data (Fig. 2), e.g., LINCS cell line IDs and disease information. All the disease information was mapped to the Human Disease Ontology (DOID) [20].
-
(2)
352 LINCS cell lines unavailable in CLO were newly added to CLO. Each of these cell lines was assigned a new non-redundant CLO ID based on CLO cell line naming convention [8]. The parent terms of these newly added CLO cell lines were determined by the cell type, tissue, organ, and organism. All the cell lines were found to be derived from human. The diseases in the human patients were primary cancers. Three cell lines were derived from patients with benign tumors.
LINCS-CLOview: LINCS cell line subset of CLO
A CLO subset of LINCS cell lines, abbreviated as LINCS-CLOview, was generated. The LINCS-CLOview can be considered as a “community view” [25] or a slim of the CLO’s implementation of LINCS cell lines for the LINCS research community. As of July 1, 2017, LINCS-CLOview contained 1924 terms, including 1825 classes, 25 object properties, 61 annotation properties, and 13 instances. These terms include 1315 cell line cell terms with CLO IDs. The other terms were imported from 17 other ontologies, for example, the Basic Formal Ontology (BFO) [26], the Cell Type Ontology (CL) [3], and the Ontology for Biomedical Investigations (OBI) [4]. The LINCS-CLOview source code is included in the master CLO GitHub website. The detailed statistics of LINCS-CLOview is available at: http://www.ontobee.org/ontostat/LINCS-CLOview.
As a subset of CLO, LINCS-CLOview has the same hierarchical structure and design patterns as the CLO. BFO is the top-level ontology with which CLO is aligned. Since BFO is also the top-level ontology for over 100 ontologies (e.g., CL and OBI), such an alignment makes LINCS-CLOview easily integrated with other ontologies, such as CL for cell types, and OBI for cell line related processes.
SPARQL query of LINCS-CLOview information
The Ontobee SPARQL web query program can be used to conveniently query detailed information in LINCS-CLOview. For example, Ontobee SPARQL was used to query the number of cell line cells that have the LINCS cell line IDs (i.e., LCL_xxxx) (Fig. 5). The script recursively queries all class terms under the branch of ‘cell line cell’ (CLO_0000001) in LINCS-CLOview, identifies those terms having the ‘Cell line LINCS ID’ (CLO_0000178), and counts the total number of these cell line cell terms. As shown in the figure, the total unique number of these LINCS cell line cells with LINCS cell line IDs in LINCS-CLOview (or CLO) is 1133. This number is greater than 1097 LINCS cell lines extracted from our processes, which is because one LINCS ID may sometimes be mapped to more than one cell line in CLO as indicated at the beginning of the Results section. If we do not consider the LINCS cell line IDs, we would get 1541 cell line cell terms under this cell line cell branch in the LINCS community view of the CLO. The difference between these two numbers reflects the fact that there are many intermediate-layer cell line cell terms between the LINCS cell lines (with LINCS IDs) and the ‘cell line cell’ (CLO_0000001) in the LINCS-CLOview.
In this study, different SPARQL scripts were developed and used to analyze the LINCS cell lines from various aspects. An example of such SPARQL analysis is illustrated in next section.
Analysis of LINCS cell lines by querying LINCS-CLOview
With the availability of LINCS-CLOview, we were able to analyze LINCS cell lines from different aspects. The tools used in our analyses include SPARQL-based queries, Protégé OWL editor visualization, and Ontobee statistics display and queries. Below we describe our analyzed results from three main aspects: related diseases, cell types, and tissues/organs.
Our study found that LINCS cell lines are associated with 121 diseases. These 121 diseases include three benign neoplasms, i.e., breast fibrocystic disease (associated with MCF 10A and MCF 10F cells), kidney angiomyolipoma (associated with 621–101 cell), and male productive organ benign neoplasm (associated with BPH-1 cell). The other 118 diseases are various types of cancers. Fig. 6 is a hierarchical DOID structure of organ system cancers related to these LINCS cell lines.
The hierarchical structure of DOID (Fig. 6) helped the understanding of all the diseases associated with LINCS cell lines. For example, Fig. 6 demonstrates that 8 LINCS cell lines (e.g., HeLa cell) were derived from patients with cervical adenocarcinoma, 1 with cervical clear cell adenocarcinoma (a specific type of cervical adenocarcinoma), and 6 with cervical squamous cell carcinoma. These diseases all belong to cervix carcinoma. In addition, ‘cervix carcinoma’ is directly associated with 2 LINCS cell lines (i.e., C-33 A and C-4 II cell line cells). Therefore, if we plan to study the cellular signatures of cervix carcinoma, we would focus on these 17 cell lines instead of just 2 cell lines directly annotated as derived from a patient having cervix carcinoma.
To further illustrate the usage of LINCS-CLOview, we generated a SPARQL script that queries the cell lines originally derived from human patients having more specific disease names under cervix carcinoma (Fig. 7). Consistent with Fig. 6, our query identified 15 new cell line cell types (e.g., HeLa cell line cell) that belong to this category, and 5 identified cell line cell types are shown in Fig. 7.
We also examined the tissue and organ types from which the LINCS cell lines were derived. In CLO, the multi-species anatomy ontology UBERON [27] is used to represent tissues and organs. In total 131 UBERON terms have been used in LINCS-CLOview to refer to various anatomic locations from which LINCS cell lines were derived. A part of the UBERON structure is illustrated in Fig. 8.
The cell types of LINCS cell lines were analyzed. The Cell Type Ontology (CL) [3] was used in CLO to demonstrate the cell types of different cell lines. In total, 43 CL cell types, such as epithelial cell, B cell, and T cell, are included in LINCS-CLOview. Each of these cell types is linked to different cell line cells or the parent terms of cell line cells. For a project to study cellular signatures related to a specific cell type, the LINCS-CLOview provides a feasible method to identify which cell line cells to use.