Conceptual modeling (CM) is the activity of describing aspects of the world for the purpose of understanding and communication [1]. Regardless of the research area, it answers fundamental questions by identifying what concepts are relevant and their relationships with each other. Conceptual models make mental representations of the world explicit, which helps to establish common ontological frameworks that facilitate both communication and knowledge evolution in complex domains [2].
An example of such a convoluted and vast domain is genomics, where understanding the genome and all of its intrinsic relationships in order to decipher the code of life is a huge challenge. There are two main reasons for the complexity of the genomic domain. The first one is the existence of relevant concepts that are not clearly defined. Even the definition of the most elemental concepts, like the concept of “gene”, are open to discussion [3]. The second one is that it is an ever-changing domain, with new knowledge emerging continuously [4]. Therefore, the genomic domain is a particularly good candidate for applying CM techniques.
Our research group has extensive experience applying CM techniques in the genomic domain. For years, our main goal has been to understand, identify, and conceptualize those elements that are relevant in this domain. As a result of this work, two different conceptual schemes have been generated. The first one is the Conceptual Schema of the Human Genome (CSHG), which is intended to improve Precision Medicine and genetic diagnosis. The second one is the Conceptual Schema of the Citrus Genome (CSCG), intended to identify the genetic cause of relevant phenotypes in the agri-food field.
The CSHG is proof of how CM can help to improve domain understanding and communication. For years, our main research line has focused on humans, and the CSHG has provided a more explicit and precise understanding of the human genome. The CSHG has proven to be valid and useful, and it has been successfully applied in a series of real-world use cases. Among the most recent works, the following can be highlighted:
-
Identifying and managing genomic variations related to the treatment of Alzheimer’s disease [5].
-
Developing a CM-based framework to improve the data quality processes of precision medicine [6].
-
Reporting of early diagnosis of alcohol sensitivity [7].
-
Identifying variations with a relevant role in the development of colorectal cancer [8].
-
Developing Genome Information Systems to: (i) improve the diagnosis of congenital cataracts [9], (ii) support the prioritization of variations [10], (iii) support the variation annotation process [11], and (iv) increase interaction and collaboration in the process of diagnosing genetic diseases [12].
In parallel with the above research, we have carried out additional work in a special use case in a new and different context in which the subject under study is not humans but citrus. A new Conceptual Schema (CS) has been developed for the citrus genome with the collaboration of the Valencian Institute of Agrarian Research (IVIA). This CS serves the IVIA as ontological support to better perform their research regarding the genetic improvement of crops. The workflow developed is composed of the following steps: Step 1: plant genome sequencing is performed; Step 2: variations of interest are identified; Step 3: genes of interest to improve a desired citrus trait (e.g., drought resistance) are identified; Step 4: genetic modification techniques are applied to citrus crops [13].
The genome is what explains what we humans understand by life on our planet. Since we share a common conceptual background, genome representation is a problem that affects all species of living beings. However, having a CS that only focuses on the human genome can be seen as a limitation in this context. A study of any different species could require a new CS to be created, or adapted, in order to appropriately cover its particularities. Our research group has developed two different conceptual schemes: the CSHG and the CSCG. However, their conceptual background is assumed to be the same because in both cases we are talking about the genome.
Conceptual Schema of the Human Genome
For years, the creation of a CSHG has been the main goal and a fundamental tool of our work [14]. The result is a CS that is divided into multiple views, which has provided us with a fundamental tool to communicate more effectively with domain experts and to develop Model-Driven Development Genome Information System. As more knowledge about the genomic fundamentals of life has accumulated, the CSHG has evolved in parallel. The CSHG has had two major updates.
Version 1 of the CSHG was the first attempt to generate a holistic conceptualization of the genomic domain. This version precisely defined the most basic concepts of the domain, ignoring some of the more complex aspects. (e.g., pseudo-genes or proteins coded by multiple genes). Version 1 focused on characterizing genes, their mutations, and their phenotypic aspects. Version 1 was divided into three views: Gene-Mutation, Genome, and Transcription. The Gene-Mutation view modeled the concept of gene. It characterized its fundamental parts, including gene variations (e.g., insertions or deletions), regulatory elements (for instance, promoters or terminators, and sequence parts (e.g., coding regions). The Genome view modeled individual genomes. This view offered a general perspective of the genome, structuring genomes in chromosomes, and structuring chromosomes in segments. Chromosome segments were divided into genic segments and non-genic segments. The transcription view modeled the protein-coding process. This view included primary transcripts, exons, introns, spliced transcripts, open reading frames, and proteins.
The first major update of the CSHG (from version 1 to version 1.1) included the Phenotype view in the model. Version 1.1 provided a more consistent CS. Syndromes (pathologic phenotypes) were included in the model with a multi-level classification based on their severity and characteristics. Genotype information was linked to phenotype information, in order to better model the effects of variations in the genome. The Phenotype view is linked to variations to indicate whether a variation is responsible for modifying a phenotypic aspect.
The second major update of the CSHG (from version 1.1 to version 2), changed how the genome sequence is comprehended and represented. This version changes the perspective from gene-focused to chromosome-focused. The concept of gene is no longer the main element of the genome. The reason for this is that it is not always feasible to describe the DNA structure in terms of genes. The main element is the chromosome element. This change allows any relevant part of the genome (not just genes) to be easily represented and characterized. Three more changes were made to the CSHG. The first change is that the Genome view was removed from Version 2 because the genome of each individual human should be clearly differentiated from the generic human genome. This change allows genomic analysis to be performed more easily. The second change is the explicit representation of Single Nucleotide Polymorphisms (SNPs). The third change is the addition of the pathways. Pathways are represented as inter-dependent events where a set of inputs produces a set of outputs.
This last version of the CSHG is divided into five views: (i) the structural view describes the structural parts that determine the sequence of the genome; (ii) the transcription view models the elements that take part in the protein-coding process; (iii) the variation view focuses on the structural changes in the genome sequence; (iv) the pathway view breaks down metabolic pathways into their fundamental events, specifying the entities that take part in them; and (v) the bibliography and datatabank view provides information regarding the origin of the data.
Conceptual Schema of the Citrus Genome
Citrus is a particularly relevant crop. It is cultivated worldwide with a production of more than 100 million tonnes. The Citrus genus includes oranges, lemons, grapefruits, and pummelos, among others. In total, it is composed of more than 1600 species. Citrus genome resources are abundant. The first citrus variety was sequenced in 2003 [15] and several genomes have been sequenced since then. By the time this article was written, more than 67 species have been sequenced multiple times, with more than 200,000 genes identified (https://www.citrusgenomedb.org/data_overview/1). Comparative genomics is composed of a broad set of analyses, including variations of gene content, large genome rearrangements, structural variants, or small polymorphisms. Our use case focuses on studying small polymorphisms, more specifically, single nucleotide polymorphisms (SNPs) and small insertions and deletions (INDELs). These variations are of great interest for plant breeding. They have proven to be critical determinants for major traits of agricultural interest.
Unlike the CSHG, which was developed to be as generic as possible in order to serve multiple use cases, the CSCG was developed for a specific use case. Because of that, the modeling process and the philosophy of the resulting CS are notably different. The use case that motivated the generation of the CSCG consists of establishing reliable genotype-phenotype relationships, i.e., the observable traits in the varieties (phenotype) that are caused by the genetic code (genotype). An example is the variations in the genetic code that make a variety drought resistant. This is a significantly different type of study compared to the ones that we worked on before when working with the human genome. In the case of the human genome, the studies focus on identifying relevant variations (i.e., variations that are known to cause a given condition) in populations, especially for clinical purposes in precision medicine, where early diagnosis and selection of the right treatment become the main goals. In the case of the citrus genome, experts focus on identifying which variations are relevant (e.g., which variations cause a given condition).
In citrus studies, it is crucial to properly prioritize (e.g., identify and select) those variations that have an impact on the phenotype, specifically focusing on those variations that could have a notorious impact on a trait of interest of citrus plants. The fact is that these analyses are problematic, inefficient, and involves several manual tasks that are slow and difficult to perform, and are prone to human errors. We have grouped these tasks in a four-step workflow:
-
Step 1: Plant genome sequencing. The genome sequence of citrus plants of interest is obtained and compared to a reference sequence. A set of identified variations are associated with each sequenced citrus variety. It is worth mentioning that several crops from the same species that have slightly different characteristics are sequenced.
-
Step 2: Identification of variations of interest. The variations that could have a potential link with phenotypes of interest are identified through orthology prediction and statistical methods. The identification process is divided into three tasks:
-
1
Select Variety Groups: There are tens of sequenced citrus varieties. Working with multiple varieties is a hard task because of the huge amount of data that each one has associated with it. In order to work with such a huge amount of data, bioinformaticians need to work with a subset of the data. The selection of this subset is done based on specific phenotypes of interest. Consequently, two groups are created in this task. The first group is composed of a set of citrus varieties that highly express a phenotype of interest. The second group is composed of a set of citrus varieties that do not express it. Examples of phenotypes of interest include fruit sweetness, resistance to drought, or the absence of premature fruit abscission.
-
2
Compare Groups: The next task is to compare the two groups. There are a plethora of attributes and variables that can be used to filter the data prior to the comparison exercise. This step is crucial for two reasons. The first reason is to remove low-quality data and reduce noise. The second reason is to reduce the amount of genomic data in order to speed up the comparison. Examples of such filtering operations include establishing thresholds for data attributes such as the read depth or delimiting the region of the genome to be analyzed. Although comparing two citrus varieties or applying a single filter are challenging but feasible tasks, as the number of varieties in the defined groups, or the number of applied filter increase, the cost and complexity of these tasks increase dramatically. Even though comparing two citrus varieties or applying a single filter are feasible tasks, as more varieties or filters are included, the cost and complexity of these tasks increase dramatically.
-
3
Visualize Result: The amount of identified variations can be unmanageable after comparing the groups. Users need to examine the data fluidly to identify potential genetic variations of interest. By examine, we mean visualize how the data is distributed based on specific criteria and interactively analyze it (e.g., showing or hiding data columns and performing data operations such as grouping, sorting, pivoting).
-
Step 3: Characterization of genes of interest. Genes of the sequenced citrus varieties that have their expression, efficiency, or functionality modified in a disruptive way by variations of interest are identified and analyzed. As a result, assumptions regarding potential genes of interest that require experimental validation emerge. Genes of interest are those that have a significant role in a phenotype of interest.
-
Step 4: Application of genetic modifications. The previously obtained assumptions are validated by applying genetic modifications through molecular techniques.
The generated knowledge is highly valuable to researchers because it allows citrus varieties to be modified so that they can potentially increase or decrease the level of expression of phenotypes of interest. However, as a consequence of the complexity of these tasks, extracting knowledge is slow, time consuming, and complex and requires considerable effort.
An additional aspect is to study the implications of relevant variations in citrus varieties at an evolutionary level. The origin of citrus has been a matter of controversy [15]. Nevertheless, the phylogeny of ancestral species and their relationship with domesticated varieties have been determined using genomic, phylogenetic, and biogeographic analyses [16]. The findings of Wu et. al. indicate that evolutionary relationships between species of the same genus should be taken into account, which raises the need for conceptualizing their underlying mechanisms. Our approach is the first one that ontologically defines these aspects in a CS. These relationships are modeled by describing the orthology group concept, which allows us to infer relationships between citrus species in both genes and proteins.
When working with citrus domain experts, we noticed that they work more with technologically-oriented data rather than purely biological data. For instance, they rely on the use of variant annotations and functional effect prediction software. This data mixes biological and non-biological information, being much more format-file oriented. This means that citrus data is stored as obtained, which results in the mix of different concepts. In our previous work with human genomic data, the genomic data that we accessed was transformed by theirs maintainers into a specific model, increasing its abstraction and making it technological-agnostic; but the citrus data that we worked with did not undergo this process. Thus, the information is tied to the technologies used and their limitations. For example, there is no distinction between qualitative data that indicates the quality of the sequencing process of variants and their biological significance. Consequently, there is a loss of the semantics that limits domain understanding.
We are perfectly aware that, generally speaking, the genome provides the common, holistic knowledge to understand life as we perceive it on our planet, independently of any particular species. Nevertheless, our experience in the real working domains of human genome-based applications (e.g., precision medicine) in the CSHG case, and our experience in the case of analyzing links between genome variations and their associated phenotypes in the CSCG case have clearly shown us that the conceptual views that are used in these different working environments are different. Depending on the peculiarities of the problem under investigation, the relevant data that must be considered changes.
To deal with these particularities in the case of citrus, the CSCG was developed following a conceptual modeling method that emphasizes explicitly separating biological and non-biological data by adopting a multi-model-oriented approach. It proposes starting with a purely-biological CS to which additional non-biological conceptual schemes are appended. The resulting CS takes into account the intricate relationships between these two types of data, allowing us to recover the previously hidden semantics of the data. A full view of the CSCG can be seen in [17].
Even though the scenarios that have motivated the generation of these conceptual schemes are different in their particularities, they do share common concepts. This led us to the question of whether each species actually need a CS that is adapted to it specifically, or if it is possible to have a single, holistic CS that works for every species and that can adapt to the idiosyncrasies of individual species.
Our work has been limited to the particularities of the selected working domains (the human genome and the citrus genome), where different genome components are considered to be relevant depending on the purpose of the corresponding data analytics. Nevertheless, it seems clear that the inner workings of the eukaryotic genome share the same underlying foundations (i.e., the genome of a eukaryotic cell consists of a set of chromosomes located in the nucleus with extrachromosomal DNA found in the mitochondria) [18]. For example, the spatial arrangement of eukaryotic species shares the same strategy, i.e., linear chromosomes [18]. Besides, centromeres and telomeres are composed of tandem arrays of repetitive sequence in eukaryotic cells [19] and, when compared to prokaryotic cells, eukaryotic cells has led to more complex and versatile regulatory strategies of DNA replication [20]. Also, gene orthology studies show that there are genes with similar functionality among species. In addition, low-level interactions of biological pathways (i.e., interactions among molecules in a cell leading to a specific product or cell change) change very little between closely related species.
In this work, we present a new CS, called the Conceptual Schema of the Genome (CSG), that is species-independent. The CSG provides a holistic perspective of the genome so that any specific working domain could have its conceptual view inferred from that global CS. The CSG is based on the two previously existing ones (i.e., the CSHG and the CSCG). It not only generates conceptual views to work in the human domain and the citrus domain, it also potentially works with any eukaryote species.