The rapid growth of data in the field of biology and biomedicine makes almost impossible for the researcher a complete analysis of this information without the use of automated text-mining tools. It should be especially noted that scientific publications are the main source of information for the vast majority of molecular genetic databases, which together contain only a part of the verified facts about the molecular interactions presented in publications. At the moment, the number of scientific publications (according to the PubMed database) and the number of international patents exceeds 28 million and 40 million, respectively, which allows regarding them as large text data. In fact, today, owing to the information explosion, we have a paradoxical situation in the biological sciences, where part of new knowledge, presented in publications, does not have a real opportunity to reach most researchers. One of the solutions to this is the development of computer tools that implement a full cycle of knowledge engineering and knowledge management, including an automated extraction of knowledge from scientific publications, patents and databases, formalization and accumulation of extracted data, and providing such data to end-users to deal with fundamental and practical tasks .
In particular, the integration of information in knowledge bases with its further analysis can allow the generation of new hypotheses and obtaining novel knowledge about complex processes or mechanisms based on information about the individual subsystems that are part of such processes. This allows the reconstruction of the molecular genetic mechanisms of the functioning of living systems, which is a necessary condition for study of almost any important task in modern biology and biomedicine, including the search for drug targets, an assessment of the potential efficacy and toxicity of new drugs in pre-clinical trials, identification of biomarker molecules to create effective diagnostic systems, the search for candidate genes for genotyping, etc. .
Thus, the methods of automated retrieval of information from unstructured textual data (text-mining) have been actively developed, and, in particular, such methods are popular in the fields of biology, biomedicine and various medical applications, including support for clinical decisions [3, 4], systems and integrative biology , curation of biological/biomedical databases , and pharmacovigilance inspections . Modern text-mining techniques allow performance of an automated analysis of a wide range of information sources, including full-texts of scientific publications, patents and abstracts of articles [8, 9], as well as electronic health records of patients  and health-related data from social networks .
An automated recognition of the names of biological entities in natural language texts is one of the primary tasks of any text-mining-based research . The aim of this task is to identify the names of objects of a specific type (proteins, genes, drugs, etc.) within the raw text. The solution of the name-recognition task is associated with a number of issues, primarily caused by the incompleteness of existing dictionaries of object names, as well as with the lack of universal rules for the naming of newly discovered objects. Additionally, challenges emanate from the high synonymy of biological objects along with different stylistic features often used by authors, including anaphora, epiphora, coreferent mentions, etc. The existing name-recognition approaches can be classified into three main categories: methods based on the use of dictionaries, rules-based methods, and machine-learning methods . The ANDSystem tool combines rules-based and dictionaries-based methods .
Methods for the retrieval of information on the interactions between biological objects can also be divided into three main groups: (1) methods based on the statistically significant values of co-occurrence of object names in texts; (2) methods based on rules; and (3) methods based on machine-learning approaches. The main advantage of the first approach is its ease of implementation and robust completeness of the search, but at the same time, its accuracy is not high . Moreover, such an approach does not allow detection of recently discovered interactions, information about which is not reflected in a large number of articles; also, this approach cannot be employed to determine specific parameters of interactions, such as the type of interaction or its direction.
Rules-based methods allow achievement of a high level of accuracy of information retrieval but, at the same time, have relatively low completeness values . An alternative approach to automated information retrieval that does not require the use of manually created rules is machine-learning methods, which have been widely utilized in recent years. Examples of such methods are the naive Bayesian classifier, decision trees, conditional random fields , and structured support vector machines , as well as deep-learning algorithms based on neural networks . In turn, these methods can be divided into three large categories : 1) learning with the use of labelled data (supervised learning); (2) learning with the use of unlabeled data (semi-supervised learning); and (3) hybrid methods based on integrated training schemes. The “supervised learning” methods commonly require large corpora of textual data with mapped interactions. At the present time, a second group of machine-learning methods is actively developing, and, among them, special attention is being paid to those based on the use of external knowledge bases for training, the so-called methods with remote training. Such methods combine the advantages of both “supervised learning” and “unsupervised learning” . An approach of this kind assumes that any sentence where a pair of objects from the base of knowledge is mentioned most likely describes the interaction between these objects. To reduce errors associated with the fact that certain sentences from a positive training set may feature pairs of objects, but do not describe their interactions, various approaches have been proposed, including multivariate training [19,20,21]. These approaches have proven themselves to perform really well with respect to tasks of identifying protein-protein interactions, gene-disease associations, and analysis of catalytic reactions [13, 22,23,24,25,26].
The most convenient form of representing extracted knowledge in the field of biology and biomedicine is semantic networks, where nodes correspond to biological objects, and edges correspond to interactions between them. The biological objects are molecular-genetic entities (genes, proteins, metabolites, drugs, microRNAs), biological processes, diseases, etc. The interactions between objects are determined by molecular, physical, chemical (protein-protein interactions, catalytic reactions), regulatory (activation, suppression, etc.), and associative (co-occurrence of object names in one sentence or document) relationships. The types of objects used and the relationships between them are specified by the domain ontology, specific for each computer system. On the basis of such ontologies, the molecular, cellular, and physiological mechanisms of functioning of living organisms under normal and pathological conditions can be described and visualized.
Among the computer programs that use text-mining methods for automated extraction of knowledge, special attention should be paid to software and information systems implementing a so-called full cycle of knowledge engineering, including knowledge extraction, integration, and presentation of data to the end user. The well-known examples of such systems are STRING (http://string-db.com), Coremine (https://www.coremine.com/), Pathway Commons (https://www.pathwaycommons.org/), MetaCore (https://clarivate.com/products/metacore/), and Ingenuity (https://www.qiagenbioinformatics.com/products/ingenuity-pathway-analysis/). All the extracted information is automatically stored in special knowledge bases of such systems, while the data representation takes place through the execution of user queries. Moreover, it is often the case that such systems include various statistical methods, for example, the prioritization of genes, the identification of overrepresented processes or diseases, etc., permitting the user to perform an analysis and interpretation of experimental data. It should be especially noted that the depth and detail of the knowledge representation in such systems is determined by the terms and concepts of the used ontology. In particular, the Ingenuity system makes use of seven object types (proteins, genes, complexes, cells, tissues, drugs, and diseases), the relationships between which are described by the following types of interactions: transcriptional regulation, miRNA-mRNA target, phosphorylation cascades, protein-protein, or protein-DNA interactions.
The use of such ontologies facilitates dealing with a wide range of problems related to analysis and visualization of disease mechanisms, gene expression, as well as proteomics and metabolomics data analysis. However, their narrow focus leads to the loss of information owing to the limitations in the descriptions of such ontologies and provides modern systems with the ability to extract just a small part of the knowledge presented in the texts of scientific publications . For example, with Ingenuity and other computer systems, diseases are presented by a kind of generalized object that do not take into account various pathological conditions and dysfunctions, which cannot be considered independent diseases or their symptoms. At the same time, the number of references to such terms in the scientific literature, according to our estimates, can be in the hundreds of thousands. In particular, scientific articles contain a vast amount of information on the characterization of such dysfunctions as biomarkers, predictors, and risk factors for diseases. Huge amounts of information remain unused by the existing systems, including, for example, environmental factors which also are risk factors for many diseases.
Earlier, we developed the ANDSystem tool, a software and information system based on the syntactic-semantic rules and designed for automated extraction of medical and biological knowledge from scientific publications [12, 28, 29]. The original ontology of the ANDSystem provides a highly detailed description of the subject area. In particular, the distinctive characteristic of the ANDSystem is the fact that all interactions are subdivided into organisms and cell lines. Unlike most existing systems, proteins and genes are treated by the ANDSystem as separate entities, and catalytic reactions can be represented by an enzyme, a substrate, and a product, potentially including many participants.
The knowledge base of ANDSystem, created on the basis of the automated analysis of more than 25 million of texts from PubMed abstracts and dozens of biological and biomedical factual databases, is an unique resource featuring formalized information on more than 20 million interactions of various types (more than 25 types) between molecular genetic objects (proteins, genes, metabolites, microRNAs), biological processes, phenotypic signs, drugs and their side effects, and diseases including: physical interactions with forming of molecular complexes (protein/protein, protein/DNA, metabolite/protein); catalytic reactions and proteolytic events involving the substrate/enzyme/product, as well as substrate/product transformation events in the case of complex reactions where there is no description of the involved enzymes; regulatory interactions, divided into positive and negative regulation of gene expression, function/activity, transport, and protein stability, involving proteins, metabolites and drugs, regulation of protein translation involving miRNAs, regulation of biological processes and phenotypic traits involving proteins, metabolites and drugs; associative interactions of genes, proteins, metabolites, biological processes, phenotypic traits with diseases, etc. Each interaction in ANDSystem is characterized by its participants, the type and direction, and also by the organism and cell in which it occurred according to literature sources and external databases. It should be noted that, unlike the similar tools, genes and proteins within ANDSystem are treated as separate objects linked by a directed type of interaction (gene- > protein) - expression, and are classified by individual organisms. Furthermore, every interaction has a cell type attribute in which events related to this interaction were observed. The developed system surpasses its well-known analogues by the number of types of interactions and object types.
Regarding the use of ANDSystem, a number of scientific studies have been carried out, in particular an analysis of the data of high-throughput proteomic experiments related to the study of Helicobacter pylori and their contribution to the development of gastritis and gastric tumors ; the study of the proteomic profile of the urine of a healthy person in normal conditions and under the influence of space-flight factors [31, 32]; an analysis of the tissue-specific knockout effect of genes and a search for potential drug targets ; and the identification of new regulatory molecular genetic mechanisms in the life cycle of the hepatitis C virus . The application of ANDSystem to the analysis of comorbid diseases has allowed predicting new molecular genetic mechanisms of comorbidity of asthma and hypertension , as well as dystropy (reverse comorbidity) that determined the relationship between asthma and tuberculosis ; to study the molecular mechanisms of the comorbid relationship between pre-eclampsia, diabetes (diabetes mellitus and gestational diabetes), and obesity ; to investigate the molecular interactions of glaucoma with more than 20 comorbid diseases , etc. Based on the ANDSystem knowledge base, a number of software tools for analysis of experimental gene sets have been developed, in particular, a web-based FunGeneNet program that performs automated reconstruction of the associative gene network for the user-specified gene set and identifies genes involved in a greater number of interactions than can be expected according to random reasons . In addition, the NACE tool employs the ANDSystem knowledge base to evaluate the effectiveness of potential signaling in gene networks . Using the ANDSystem technology, we developed the Solanum TUBEROSUM knowledge base [41, 42], which is a computer platform for the complex intellectual processing of big data in the field of potato growing providing: 1) automated analysis of texts of scientific publications and factual databases with extraction of knowledge about genetics, markers, breeding, seed production, diagnostics of pathogens, protective equipment, and potato-storage technologies; 2) formalized representation of the extracted information in the knowledge base; 3) user access to this data; and 4) analysis and visualization of user query results. The ontology of the Solanum TUBEROSUM knowledge base contains dictionaries of molecular genetic objects (proteins, genes, metabolites, microRNAs, biomarkers, etc.), potato varieties and their phenotypic traits, potato diseases and pests, biotic and abiotic environmental factors, agrobiotechnology of cultivation, and technologies of potato processing and storage.
Consideration of tissue-specific gene expression in the reconstruction of gene networks is a necessary condition for describing the processes taking place in the cells of this tissue. The ability to filter genes in gene networks by the level of their expression is realized in Ingenuity and some other systems, but absent in ANDSystem. As gene networks reconstructed with ANDSystem differ from similar networks created with other systems, the inclusion of information regarding tissue-specific gene expression in ANDSystem becomes a timely task. Thus, our paper is dedicated to the new version of ANDSystem, providing the reconstruction of tissue-specific gene networks. For this, we expanded ANDSystem with the data describing the expression of human genes in 272 different tissues, extracted from the Bgee database . With regards to the example of the associative gene network of extrinsic apoptotic signaling pathway, it was shown that it was possible to filter the genes by their tissue-specific expression effects, such as network characteristics like the distribution of centrality of the vertices, the clustering coefficient, network centralization, the number of shortest paths, etc. Such effects can determine the difference in the functioning of the same processes in varying tissues. A new version of ANDSystem is available at http://www-bionet.sscc.ru/and/cell/.