Using BioPAX-Parser (BiP) to enrich lists of genes or proteins with pathway data

Background Pathway enrichment analysis (PEA) is a well-established methodology for interpreting a list of genes and proteins of interest related to a condition under investigation. This paper aims to extend our previous work in which we introduced a preliminary comparative analysis of pathway enrichment analysis tools. We extended the earlier work by providing more case studies, comparing BiP enrichment performance with other well-known PEA software tools. Methods PEA uses pathway information to discover connections between a list of genes and proteins as well as biological mechanisms, helping researchers to overcome the problem of explaining biological entity lists of interest disconnected from the biological context. Results We compared the results of BiP with some existing pathway enrichment analysis tools comprising Centrality-based Pathway Enrichment, pathDIP, and Signaling Pathway Impact Analysis, considering three cancer types (colorectal, endometrial, and thyroid), for a total of six datasets (that is, two datasets per cancer type) obtained from the The Cancer Genome Atlas and Gene Expression Omnibus databases. We measured the similarities between the overlap of the enrichment results obtained using each couple of cancer datasets related to the same cancer. Conclusion As a result, BiP identified some well-known pathways related to the investigated cancer type, validated by the available literature. We also used the Jaccard and meet-min indices to evaluate the stability and the similarity between the enrichment results obtained from each couple of cancer datasets. The obtained results show that BiP provides more stable enrichment results than other tools. Supplementary Information The online version contains supplementary material available at 10.1186/s12859-021-04297-z.

and Social Sciences, University "Magna Graecia", Catanzaro, Italy Full list of author information is available at the end of the article To perform pathway enrichment analysis, the users need internet connection, a pathway enrichment analysis framework, Java, R, and a web-browser installed on his/her computer.
• To obtain genes differentially expressed from microarray data, users can use GEO (https://www.ncbi.nlm.nih.gov/geo), a web-repository that collects several microarray data sets for many different diseases.

Data Sets Download
GEO data sets download G.1 Upon connecting to GEO, the user will input the disease of choice in the Search box, paying attention to select GEODataset from the drop-down menu locate at the left of the Search box. As second step, clicking Search the number of founded items will be visualized. Clicking on them will open the search results page. G.2 It is then possible to filter the results according to the researcher interest, making it easier to find the data set user is looking for. G.3 At the information data set page, clicking "Analyze with GEO2R" will open the page needed to obtain the differential genes. G.4 In GEO2R page, the user will need to first set the groups to use in the analysis, clicking on define groups. The user will then select the appropriate samples and link them to their group. G.5 Click on "Analyze" button located at the bottom to run the differential gene expression analysis. G.6 Top results are shown in the table at the bottom of the page. Selecting "Download full table" to obtain the results. The main steps listed above are shown in Figure 1.

TCGA data sets download
T.1 Upon connecting to Genomic Data Commons Portal (https://portal.gdc.cancer.gov), the user will input the disease he/she is looking for in the Search box, or by clicking on the human vignette situated on the right corner. As a result, the Explorer page will be open.
T.2 From the Explorer page, it is then possible to download the genes list selecting the Gene tab. T.3 At the cBioPortal (https://www.cbioportal.org) web page, user can annotate gene lists (if available) by selecting the data set of interest from those available listed in the main page. After selecting the annotation data set, it is then possible to click the "Query By Gene" button located at the bottom of the page. Clicking "Query By Gene" will open the query building page. T.4 In the Query building page, user can paste into the text area the previous downloaded gene lists to annotate, then click "Submit Query". T.5 The results are shown in the table at the bottom of the page. Select "Download full table" to obtain the results. The main steps listed above are shown in Figure 2.

Enrichment
The gene lists obtained from the previous steps are going to be used to perform pathway enrichment analysis (PEA) by using an enrichment tool.

BiP
To perform PEA by using BiP, user must launch BiP and then load the genes or proteins list, and selecting the pathway database to compute the enrichment. Gene list can contain Gene Symbols, or UniProt IDs. User can choose if using any downloaded pathway data in BioPAX format for the analysis. Results will be visualized in a tabular format, that will be saved in a Comma Separated Value (CSV) or txt file. A more detailed vignette of the full BiP analysis capabilities is available at https://gitlab.com/giuseppeagapito/bip.

CePa
To perfom PEA by using CePa it is necessary to write a simple R script. User must run R, load CePa package and then load the gene list, using the "read.csv" command. Gene list can contain Gene Symbols, or UniProt IDs. The pathway enrichment analysis can be performed by using the "cepa.all()" function. CePa will use the embedded KEGG pathway database to compute the enrichment. In the following we show a simple R script to compute pathway enrichment by using CePa. CePa is available at http://cran.r-project.org/web/packages/CePa/. l i brary ( "CePa" ) #read t h e d i s e a s e −g e n e s i n p u t f i l e g e n e s <− read . csv ( "/ g e n e s . t x t " , sep = " \ t " ) colnames ( g e n e s ) <− " l i s t " #add t h e name t o t h e column r e s = cepa . a l l ( d i f = g ene . l i s t $ d i f ) #run t h e PEA a n a l y s i s plot ( r e s ) #d i s p l a y t h e PEA r e s u l t s pathDip To perform PEA by using pathDIP, user must connect to the pathDip web-site and then paste the gene list into the Search box. Gene list can contain Gene Symbols, Entrez Gene IDs or UniProt IDs. It is important to choose the correct pathway sources to use for the analysis. Before run the analysis, user can choose if download or visualize the results. If the user chooses to download the results, they will be included in a txt file. A more detailed vignette of the full pathDip analysis capabilities is available at http://ophid.utoronto.ca/pathDIP/.

SPIA
To perform PEA by using SPIA user must run R, load SPIA package and then write the R scripts. User must load the gene list containing Enterez IDs, using the "read.csv()" command. The pathway enrichment analysis is executed through the "spia()" function. Following we show a simple R script to compute pathway enrichment by using SPIA. SPIA is available at http://bioconductor.org/packages/ SPIA/. l i brary ( SPIA ) #read t h e d i s e a s e −g e n e s i n p u t df <− read . csv ( "/ d a t a s e t . t x t " , sep = " \ t " , hea der = TRUE) d e c o l o n <− df$log2FC #add t h e name t o t h e column names( d e c o l o n ) <− as . vector ( df$ e n t e r e z ) a l l c o l e n <− df$ e n t e r e z #run t h e PEA a n a l y s i s r e s <− s p i a ( de=d e c o l o n , a l l=a l l c o l e n , o r g a nism =" hsa " , nB=2000) plot ( r e s ) #d i s p l a y t h e PEA r e s u l t s The enriched pathways can be used by researchers to give a biological meaning to huge lists of genes proteins of interest detached from their biological context, making easier to use into clinical and therapeutic scenarios.