Hypergraph models of biological networks to identify genes critical to pathogenic viral response

Background Representing biological networks as graphs is a powerful approach to reveal underlying patterns, signatures, and critical components from high-throughput biomolecular data. However, graphs do not natively capture the multi-way relationships present among genes and proteins in biological systems. Hypergraphs are generalizations of graphs that naturally model multi-way relationships and have shown promise in modeling systems such as protein complexes and metabolic reactions. In this paper we seek to understand how hypergraphs can more faithfully identify, and potentially predict, important genes based on complex relationships inferred from genomic expression data sets. Results We compiled a novel data set of transcriptional host response to pathogenic viral infections and formulated relationships between genes as a hypergraph where hyperedges represent significantly perturbed genes, and vertices represent individual biological samples with specific experimental conditions. We find that hypergraph betweenness centrality is a superior method for identification of genes important to viral response when compared with graph centrality. Conclusions Our results demonstrate the utility of using hypergraphs to represent complex biological systems and highlight central important responses in common to a variety of highly pathogenic viruses. Supplementary Information The online version contains supplementary material available at 10.1186/s12859-021-04197-2.


Background
Identifying molecular signatures critical to a biological process requires an accurate model of both the process and the biological system in which it occurs. Thus it is essential that such a model be able to represent its target with complexity commensurate with that of the system itself, rather than presenting only a simplified view. Commonly, biological systems and processes present as complex networks of interacting entities, for example within and between genes, pathways, and complexes. Graphs are frequently used to model these interactions, but since graphs can only capture interactions between pairs of entities, they fall short in many cases and are not able to model the full complexity present in biological systems and processes.
In this paper we investigate the role that hypergraph models, as mathematical generalizations of graph models, can play in providing the necessary complexity to capture multi-way interactions in biological systems inferred from genomic expression data. We assert that the complexity provided by hypergraphs more closely represents the systems being studied. In order to validate this assertion we provide a comparison between the use of graph and hypergraph centrality metrics to identify genes that are critical in host responses to viral infection. Our findings show that the genes identified using a hypergraph model align better with genes previously known to correlate with viral response than do genes identified using similar metrics from graphs or using average fold change for each gene across all experimental conditions.

Hypergraphs for complex network models
While graph-based methods have been quite successful in the biological domain, their ability to model complex relationships amongst entities is necessarily limited. Graphs inherently model relationships (edges) between pairs of entities (vertices). But biological systems are replete with relationships among many entities, for example in protein complexes, transcription factor and microRNA regulation networks, lipid and metabolite enzyme-substrate interactions, metabolic networks, pathways, and protein function annotations. Relationships may be interactions, for example, metabolites working together in a metabolic process, or they may represent some commonality among the entities, like genes being differentially expressed in the same conditions, or regulated by the same transcription factor. In a graph model all of these multi-way relationships would be represented as groups of pairs of subunits, which would not fully capture how groups of components interact or have similar behavior.
Sometimes sets of related components are already understood, and sometimes they need to be discovered in experimental data, like high-throughput 'omics. In either event, a higher order mathematical model is needed. The mathematical object that natively represents multi-way interactions amongst entities is called a "hypergraph". In contrast to a graph, in a hypergraph the relationships amongst entities (still called vertices) are connected generally by "hyperedges", where each hyperedge is an arbitrary subset of vertices. Thus every graph is a hypergraph in which each hyperedge happens to have exactly two vertices. A challenge for scientists is to recognize the presence of hypergraph structure in their data, and to judge the relative value of representing them natively as hypergraphs or reducing them to graph structures. Mathematically, hypergraphs are closely related to "bipartite graphs", those with two distinct sets of nodes so that there are connections only between the two sets, and not within each set. In a bipartite graph corresponding to a given hypergraph one group of nodes plays the role of hypergraph vertices, and the other of the hyperedges themselves. While this observation can be helpful in the design of scalable hypergraph algorithms it falls outside the scope of this paper.
Hypergraph models allow for higher fidelity representation of data that may contain multi-way relationships, albeit at the price of a higher complexity model. An example using a small subset of transcriptomic expression data is shown in Figure 1. In the upper left is an expression matrix with log 2 -fold change values for five genes (rows) across four experimental conditions (columns). The lower left shows a hypergraph representation of the data, with each gene modeled as a hyperedge surrounding those conditions (vertices) for which the log 2 -fold change is greater than 2. Those cells in the expression matrix are shown in bold, distinguishing those conditions which are included in that gene's hyperedge. The upper right of Figure 1 shows a matrix produced from one possible graph-based approach to representing these data. Here each pair of genes is related if there is at least one condition for which both genes have log 2 -fold change greater than 2. This would then be interpreted as an "adjacency matrix" of a graph, which is then shown in the lower right. It can be seen that this graph representation necessarily loses a great deal of information, boiling down the rich interaction structure that we know to be present to a fully connected graph on all five genes. For example, the hypergraph shows that two pairs of genes-AARS and ABHD11, AASDHPPT and ABCB6-are much more related than other pairs. This fact is not apparent in the graph model. Although graphs and graph theory dominate network science applications and methods [3], hypergraphs are well-known objects in mathematics and computer science. They have a history of use in a range of applications [12,16,31], and are seeing increasingly wide adoption [1,13,14,17,28]. In the biological literature we have seen hypergraphs used to model gene and protein interaction networks, pathways, and metabolic networks as derived from a variety of data types. In many of these cases the authors derive hypergraphs from an underlying graph, rather than directly from data. For example, Chitra built a hypergraph model based on an existing graph model of gene interaction networks [6]. They adapt the PageRank algorithm to hypergraphs in order to study disease-gene prioritization, and find that for monogenic diseases hypergraph PageRank noticeably outperforms graph PageRank. Tran studies protein function prediction building a graph from a similarity matrix derived from gene expression data [35] and then applying soft clustering to this graph to produce a hypergraph. Function prediction using this hypergraph is then shown to be superior to predictions based on graphs. Protein-protein interaction networks are studied by Klamt et al. using graph algorithms to find sets of independent elements or tightly connected elements [16]. In those three papers the authors infer a hypergraph from a graph structure rather than directly from data.
Biological researchers may be familiar with the directed variety of hypergraph (e.g., used by Klamt [16]). These hypergraphs can represent "flows" in that each hyperedge is partitioned into two disjoint groups of vertices such that the first group are understood as inputs and the second as outputs. These and other issues in hypergraph theory and hypernetwork science can have great significance for general complex systems modeling, including biological systems, but will fall outside the scope of this paper.
Ramadan et al. use hypergraphs to model the yeast proteome, where proteins are vertices and complexes are hyperedges [32], and apply an algorithm that finds tightly connected vertices to identify the core proteome. Finally, Zhou and Nakhleh study the claim that metabolic networks are hierarchical and small-world [38]. While this claim comes from a graph model of the networks, Zhou and Nakleh instead model the metabolic networks of E. coli as a hypergraph and show that the claimed hierarchy and scaling properties are not supported. This result in particular conveys a critical message: when biological interactions are simplified into pairwise relationships and modeled using a graph, they can exhibit very different structure than when their true complexity is modeled using a hypergraph. Because of this structural variance, conclusions drawn based on the graph could provide misleading results. Although the data we consider are different, our method is similar to these last two papers in that we build hypergraphs directly from biological data rather than inferring a hypergraph from a standard graph model of the data. We have not observed researchers building hypergraphs directly from 'omics data, as we will in this paper.

Modeling host response in viral pathogenesis
Viral infection causes a response in the host cells in which the expression of a variety of cell systems are up-or down-regulated. The pathogenesis of the infection is reflected in the signature of host responses elicited by each virus. Host response to viral infection has been extensively studied for decades, yet the root mechanisms of why some infections are severe and some are not remain poorly understood. However, highthroughput molecular approaches offer a way to discover novel host response genes, proteins, and pathways that contribute to the systems-level development of pathogenesis. A major advantage of such a systems biology approach to pathobiology is the ability to identify novel, key elements of a biological process, such as which regulators are involved in critical processes. High-throughput profiling methods (e.g. transcriptomics) provide powerful tools for examining how entire systems respond to different perturbations such as acute disease. Network reconstruction provides the opportunity to utilize all available data and is a critically important tool for representing complex sets of interactions [7].
In this paper we build and explore a hypergraph model of host response using transcriptomics data from viral infection by five highly pathogenic viruses in a number of biological systems. We found that gene rankings computed using hypergraph centrality were highly enriched for known immune and infection-related genes. While rankings derived from graphs constructed using other traditional computational biology techniques applied to the same infection data also resulted in rankings enriched for critical genes, we demonstrate that hypergraph-based metrics yield superior enrichment results. These results highlight the usefulness of hypergraphs for exploring mechanisms of virus pathobiology.
Influenza Virus: H7N9 (Wild type and two mutants) in human lung epithelial cells, H1N1 (wild type) in human lung epithelial cells, H1N1 (wild type) in mouse lung, H5N1 (wild type and one mutant) in mouse lung, H7N9 (wild type and two mutants) in mouse lung.

MERS-coronavirus: (Wild type and four mutants) in human lung epithelial cells, (wild type only) in ex vivo human epithelial cells, in ex vivo human lung fibroblasts, and ex vivo human lung microvascular endothelial cells.
SARS-coronavirus: (Wild type and four mutants) in human lung epithelial cells and mouse lung.
West Nile Virus: (Wild type and one mutant) in mouse cerebral cortex, mouse cerebellum and mouse lymph node.
Conversion from mouse to human gene symbols was done by capitalizing the mouse symbols. Quality control, background correction, quantile normalization, and differential expression analysis were performed in R using the limma package [33]. Replicates from each infection condition were compared to corresponding time-matched mock-infected samples to obtain log 2 -fold change values, as well as adjusted p-values.

Hypergraph representations and centrality metrics
Formally, a hypergraph is a structure V, E , with V = {v j } n j=1 a set of vertices, and E = {e i } m i=1 a family of hyperedges with each e i ⊆ V . Hyperedges can come in different sizes, |e i |, possibly ranging from the where |e| = 2 is the same as a graph edge and so it follows that all graphs are hypergraphs, specifically identified as being "2-uniform". In the remainder of the paper where clear from context we may use the terms edge and hyperedge interchangeably.
We construct a hypergraph from transcriptomics data using a threshold approach, much like the example in Figure 1. Again, vertices v j will represent individual biological or experimental "conditions" (e.g., mouse lung cells treated with a strain of Influenza virus and sampled at 8 hours) and hyperedges e i represent genes. Thus for us, a hyperedge e i is a gene i that includes a collection of conditions j as its vertices v j . For each condition, we transform the log 2 -fold change values (relative to uninfected mock) for all of the genes into absolute value z-scores. Then, the vertex representing condition X is contained in the hyperedge representing gene G if the absolute value z-score for G in X is greater than or equal to 2 and the adjusted p-value for that log 2 -fold change measurement is less than 0.05. Since transcriptomics log 2 -fold change values tend to be normally distributed for each condition across all genes a z-score transformation is a reasonable way to get all conditions onto the same scale before applying a threshold. The specific thresholds on z-score and p-value were chosen as commonly used in the field, and in exploring other values we have verified limited sensitivity to them.
In this way, hyperedges correspond to genes, and indicate the groups of conditions in which that gene is both highly perturbed (either up or down) from the mock infected control condition, and for which that perturbation is statistically significant. In the remainder of this paper we will say that the gene is "significantly perturbed" in the condition. Unlike hypergraph models of pathways or metabolic reactions, hypergraphs constructed from high-throughput data do not necessarily represent actual biological interactions but rather capture relationships based on similar behavior among entities.
It is important to point out that this method to construct a hypergraph using thresholds on absolute value of z-score and p-value is a specific case of a flexible framework for how hyperedges can be formed from 'omics data. Applying other thresholds will result in different hypergraph models of the same data, to potentially answer different questions. For example, in order to understand the relationship and behavioral similarity among up-regulated genes one might consider a gene hyperedge to contain those conditions for which the gene has high raw (as opposed to absolute value) z-score or log 2 -fold change, as in the Figure 1 example. One could also form edges from conditions for which a gene has a highly negative z-score or fold change, to explore the structure of down-regulated genes. We chose a threshold on the absolute value of z-score in this paper as an attempt to understand genes which are perturbed at all in response viral infection.
We recognize that this formulation is fundamentally different from a typical graph approach to systems biology data. One such example of a graph approach is context likelihood of relatedness (CLR) in which genes that show similar expression patterns across all conditions, as measured by mutual information, are linked together [10]. Our approach to constructing hypergraphs from the data can be seen as having greater sensitivity and flexibility since it allows similarity between genes to be assessed across any number of conditions rather than requiring assessment across all conditions.
Another difference between our hypergraph formulation and typical graph approaches is that in graph approaches vertices represent genes and edges indicate some relationship between genes such as interaction or expression correlation. Our motivation for swapping the roles of vertices and edges is for the sake of clarity in our description of hypergraph centrality measures below. Moreover, as a technical matter, each hypergraph H determines another one, called its "dual" H * , formed exactly by swapping the roles of vertices and edges [5]. Therefore, the dual to our hypergraph formulation has the more traditional form with genes as vertices, but the description of hypergraph centrality in this setting would be less intuitive.
As in graphs, the way in which hyperedges connect vertices in complex patterns is central to the study of hypergraphs. Our focus in this paper is applying generalizations of graph centrality measures to hypergraphs built from transcriptomics data to identify important genes. In order to define these hypergraph centrality measures we must first introduce the notions of a hypergraph walk and distance [2]. Given two hyperedges e, f ∈ E, an s-walk of length k between e and f is a sequence of hyperedges e 0 , e 1 , . . . , e k such that e 0 = e, e k = f , and s ≤ |e i ∩ e i+1 | for all 0 ≤ i ≤ k − 1. In other words, an s-walk is a sequence of edges such that pairwise intersections between neighboring edges have size at least s. Note that a graph walk is a 1-walk. We note that one could define a hypergraph s-walk on to be between vertices rather than hyperedges, as is typically done in a graph. But as above, for the sake of clarity in defining centrality measures we use this edge-based definition.
Continuing to follow Aksoy et al. [2], for a fixed s > 0, we define the s-distance d s (e, f ) between two edges e, f ∈ E as the length of the shortest s-walk between them. If there is no s-walk between two edges then the s-distance is infinite. Aksoy et al. also define a number of network science methods generalized from graphs to hypergraphs, including vertex degree, diameter, and clustering coefficients. This work will use their generalization of betweenness centrality and harmonic closeness centrality to hypergraphs using the stratification parameter s. where σ s f g is the total number of shortest s-walks from edge f to edge g and σ s f g (e) is the number of those shortest s-walks that contain edge e.
• The harmonic s-closeness centrality of an edge e is the reciprocal of the harmonic mean of all distances from e: HCC s (e) := 1 where E s = {e ∈ E : |e| ≥ s}. We may refer to this as s-closeness in this paper, although elsewhere in the literature this term refers to a slightly different concept where the harmonic mean is replaced with the arithmetic mean.
Intuitively, harmonic s-closeness centrality captures the extent to which a given hyperedge is close in sdistance to other hyperedges. In order to have high harmonic s-closeness a hyperedge must have small s-distance to all (or most) other hyperedges. s-Betweenness, on the other hand, identifies bottlenecks in a hypergraph. A hyperedge with high s-betweenness has many shortest s-walks pass through it. In comparison, the original formulation of betweenness and harmonic closeness centrality in the setting of graphs has the s-distance and number of s-paths replaced simply by graph distance and shortest path.
In order to take into account multiple s values simultaneously in our analysis we average the centrality values across a range of s values and define the average s-betweenness centrality and average harmonic s-closeness centrality as Computing (average) s-centralities for each hyperedge provides a ranked list of hyperedges from most central (high value) to least central. All hypergraph construction, metric calculations, and visualizations were performed using the Python hypergraph library HyperNetX (https://github.com/pnnl/HyperNetX). Some conventional approaches to infer graph structures from high-throughput data use correlated gene expression patterns to build connections. In this context, a gene with high degree (i.e. a hub) has similar expression behavior to many other genes, implicating it as a potential master regulator of gene expression. A gene with high betweenness (i.e. a bottleneck) on the other hand, bridges two regions of the graph indicating that it spans two different behavioral profiles. Genes in this position are potentially involved in causing a transition from one response pattern to another. Thus hubs and bottlenecks may represent master gene expression regulators of two different varieties. Previous work by our group and others has shown that graph vertices in hub and bottleneck positions are significantly enriched for genes critical to the process under study [30,29,37,24]. Given these prior results and biological relevance of centrality in the setting of graphs, we hypothesized that hypergraph average s-betweenness (and potentially average s-harmonic closeness) will have similar biological relevance.

Results
By analyzing the curated omnibus transcriptomic data set described above from cells infected with five different viruses and their mutants using both graph and hypergraph approaches, we illustrate the advantages of applying a hypergraph approach to uncover the underlying molecular signatures and mechanisms common across host response to viral infection broadly.

Hypergraph structure
We create hypergraphs from transcriptomics log 2 -fold change data calculated from gene expression levels of infected experiments relative to time-matched uninfected mock experiments As discussed above, in our hypergraphs hyperedges represent genes and vertices represent conditions. The vertex representing condition X is contained in the hyperedge representing gene G if gene G is significantly perturbed, either up-or down-regulated, in condition X. The entire hypergraph represents n = 179 experimental conditions (vertices) and m =7,782 genes (edges). A small subset of highly connected hyperedges (that is, genes with a large core of common conditions), is shown in Figure 2.
Distributions of even fundamental hypergraph statistics can illuminate some of the complex interaction structure present in the data. Figure 3a shows that the distribution of the sizes of the hyperedges (that is, the number of conditions a gene is significantly perturbed in) is roughly power-law, sometimes referred to as "heavy tailed". This means that there are many genes (1,247 of them) significantly perturbed in only one condition and relatively few genes significantly perturbed in many conditions, with a maximum number of conditions for a single gene on the order of 100. The six largest hyperedges, with sizes greater than 100 in increasing order, correspond to the genes ISG15, IL6, ATF3, RSAD2, USP18, and IFIT1. All of these genes are part of the interferon response, a critical pathway in response to viral infections [25,26].
On the other hand the vertex degree distribution is the number of edges a particular vertex is contained in (that is, the number of genes significantly perturbed for each condition (vertex)). This is shown as a histogram in Figure 3b, and it has a very different shape than the edge size distribution. There are relatively few conditions with small numbers (less than 200) or large numbers (more than 500) of significantly perturbed genes. The most common number of significantly perturbed genes for a condition is between 400 and 500. This peak is likely an artifact of how we choose when a vertex is contained within an edge. The degree of a condition vertex is the number of genes that are significantly perturbed for that condition. By our procedure these are genes with z-score higher than 2 and p-value less than 0.05. If we only used the z-score threshold, and our fold change data are normally distributed for each condition, then we would expect that 5% of the genes would have z-score greater than 2. There are 9,760 genes in our data and 5% of that would be 488 genes, which is roughly where the peak is. The skewness and additional modes of the distribution of degrees is likely due to the addition of the p-value condition.
Finally, Figure 3c shows another power-law distribution, this time of the size of pairwise edge intersections, or in other words, the number of conditions that pairs of genes are both significantly perturbed within. We see that there are many pairs of genes that have few conditions in common and only a few pairs of genes that have many conditions in common, again with a maximum on the order of 100. The pair of genes with largest intersection is not surprisingly the two largest edges, IFIT1 and USP18, with 103 conditions in common. Interestingly, IFIT1 and USP18 are both well-established interferon response genes, with IFIT1 strongly promoting interferon activity, and USP18 serving to dampen the response [4].

Gene importance rankings
As noted above, previous studies using graph approaches with similar viral data have demonstrated that network measures like betweenness centrality could be used to identify critical genes [22]. In the present work we hypothesize that extensions of these common graph metrics to a hypergraph, as defined in the Methods section above, can be leveraged to improve upon this prior work. In particular, we hypothesize that, as has been shown in the graph setting, high s-centrality is correlated with the gene being more important in host response to pathogenic viruses.
We calculate average s-betweenness centrality, BC s (g), and average harmonic s-closeness centrality, HCC s (g), for all genes (hyperedges) in the hypergraph for 1 ≤ s ≤ 50. Both of these s-centrality computations provide a numerical value for each gene that can be used to rank the genes from most important (high centrality) to least (low centrality).
To serve as a simpler, but still hypergraph-based, comparison we created another ranked list using hyperedge size for each gene, i.e., the number of conditions the gene was significantly perturbed in. Larger edge sizes indicate more conditions in which the gene was significantly perturbed and therefore the gene is potentially more important to host response.
In order to compare the centrality rankings from the hypergraph to more common graph methods we employed the CLR methodology that we have used previously to enrich for important genes in a network [22,30,29]. The CLR algorithm was run on the matrix of transcriptomics log 2 -fold change values using parameters spline = 3 and bins = 10 and the resulting matrix was filtered for all values ≥ 2. With this (a) Edge size distribution. Each edge represents a gene, the size of the edge is the number of conditions in which the gene is significantly perturbed.
(b) Vertex degree histogram. Each vertex represents a condition, the degree of a vertex is the number of genes significantly perturbed in that condition.
(c) Pairwise edge intersection size distribution. Each edge represents a gene. The intersection between two edges indicates the set of conditions that both genes are significantly perturbed in. Figure 3: Distributions of simple hypergraph statistics of our hypergraph approach, any two genes with shared information above the threshold have similar expression profiles and form a graph edge. This results in an association graph from which we used the NetworkX graph analytics Python package [11] to calculate vertex degree, betweenness centrality, and harmonic closeness centrality.
To provide a simple baseline for comparison a final ranked list was computed directly from the log 2 -fold change table without using any graph structure. For each gene we computed its average absolute value of log 2 -fold change and ranked the genes from highest to lowest average. Higher values mean the gene is more likely to be highly perturbed from the mock-infected samples in many conditions.
In a supplementary file we provide gene rankings for average s-betweenness centrality and average harmonic s-closeness centrality for s = 50, hyperedge size, CLR graph betweenness, CLR graph closeness, CLR graph degree, and average fold change.

Comparison of rankings
To ascertain whether our hypergraph rankings are more highly enriched for genes known to be important in host response to viral infection, we gathered three distinct sets of genes: 1) all genes associated with the Gene Ontology (GO) term "immune response" (GO:0006955), downloaded from amigo.geneontology.org, referred to as 'IR' hereafter, 2) interferon-stimulated genes gathered from interferome.org (http://www. interferome.org/interferome/search/searchGene.jspx), referred to as 'ISG', and 3) a set of human proteins known to be targets of pathogens acquired from Dyer, et al. [9], referred to as 'PT'. Although this is a limited set in terms of number of targets, it represents a set collected from a wide number of pathogens, both viral and bacterial, and is a conservative set for assessing the performance of our method and making comparisons between different approaches and parameters. In Table 1 we show the size of each gene set (along the diagonal) and the sizes of each pairwise intersection of gene sets (off the diagonal). Since our data encompasses a wide variety of virus types and infection systems, general immune-related sets were deemed suitable for our purpose. In order to measure the performance of our rankings, we applied gene set enrichment analysis (GSEA) [34] to each of our gene rankings (average hypergraph s-centralities, hyperedge size, CLR centralities, CLR vertex degree, and mean fold-change) using the three immune-related sets as target gene sets. The GSEA score of a ranked list, computed for a specific gene set, quantifies how concentrated the gene set is at the extremal values of the list. A high GSEA score means the gene set is concentrated at the top of the list while a low (highly negative) score indicates that the gene set is concentrated towards the bottom of the list. A score closer to zero means that the gene set is more uniformly distributed throughout the ranked list. The significance, or p-value, of an observed enrichment score, ES, is assessed by comparing it with a set of ES 0 randomized scores. Figure 4 shows GSEA scores for all rankings and for all three target gene sets. We note the following conclusions:

IR
• Both hypergraph s-centrality metrics for most s values, as well as hyperedge size, showed much higher enrichment than lists derived from CLR graphs and average fold change.
• But s-betweenness enrichment was universally higher than s-closeness enrichment, suggesting that these two measurements are capturing fundamentally different behavior within hypergraphs, and that s-betweenness appears to be more effective at capturing genes that are important in host responses to viral infection.
• Both centrality enrichment results (betweeness and closeness) improve significantly when larger s values are taken into account, indicating that when higher order interactions are considered, they become more powerful in identifying important genes.
A summary visualization of our results is shown in Figure 5 taking the rankings for both s-betweenness and harmonic s-closeness for the highest level of s = 50 as the representative hypergraph centrality rankings. We compare those with the five other rankings and again see that s-betweenness centrality outperforms all other measures. While harmonic s-closeness centrality outperforms all graph measures it is outperformed by the simple hyperedge size ranking. The p-values, nearly all significantly less than 0.05, are shown in the same plot at the end of the bars. These results demonstrate that hypergraph s-betweenness, but not necessarily harmonic s-closeness, considers the complexity of the hypergraph and provides superior performance over graph metrics with regards to identifying biologically important genes. This aligns with prior work in which betweenness centrality computed for vertices in a graph identifies important genes in a network [22,30,29,15].

Discussion
We draw attention to two primary observations of interest in our results. First is the observation that sbetweenness centrality consistently outperforms s-closeness centrality. At first glance this seems surprising  since betweenness and closeness calculated on the CLR graph have comparable performance. While the explanation for this result is the subject of ongoing investigation, we observe that these two types of centrality are measuring significantly different properties. For both graphs and hypergraphs high harmonic closeness centrality indicates that on average a gene is close to many other genes, while high betweenness centrality means that a gene is on many short paths between other genes. Sometimes these two notions coincide, as seems to be the case in the CLR graph, but there are cases in which they do not. For example, a gene that may be more on the periphery, i.e., not on many (short) paths, could still be very close to a central core.
Being on few short paths this gene would have very low betweenness. However, since it is close to a central core it could have high closeness score.
This seems to be the case in the hypergraphs we are studying. Since there are many conditions that have a lot of significantly perturbed genes (see Figure 3b) there is likely a large central core in the hypergraph that increases the closeness scores for peripheral genes, and perhaps all genes. Indeed we have observed that for small values of s the s-closeness values do not correlate with edge size, however for large s values the s-closeness scores do tend to correlate with edge size. This likely means that for low s values the closeness is somehow washed out by this central core and any variability we see is not significant. In contrast, for s-betweenness we see a correlation between edge size and betweenness at all s values. However, even though s-betweenness is correlated to edge size for all s values its enrichment score is still much larger than that for edge size and so seems to be capturing something more significant about hypergraph structure.
This difference between closeness and betweenness may also be related to the nature of the large gene expression data set used in our study. Since both mouse and human-based gene expression data were included in the hypergraph some genes may serve as bridges between different regions of the hypergraph (e.g. predominantly human regions vs predominantly mouse regions). Genes that are truly important in host response to viral infection would be important across species and more effectively brought to light by the betweenness measure that tends to highlight elements occupying bridge-like positions in the hypergraph. Thus, betweenness centrality may be most useful for identifying critical elements when heterogeneous data sets are analyzed.
Our second observation is that our hypergraph s-betweenness centrality significantly outperforms established graph centrality techniques. This is entirely in keeping with our expectation, as the purpose of hypergraphs is to capture the complex, multi-way interactions present in a system that are beyond the ability of graphs to model. Thus where betweenness centrality has been used in prior studies to identify important biological features the application of hypergraph s-betweenness may promote discovery of additional features of interest. While the finding that hypergraph betweenness represents a new tool for identifying critical hypergraph elements is an exciting contribution of this study, it also presents an additional immediate benefit: genes highly ranked by hypergraph betweenness that do not appear in any of our target gene sets represent potentially novel discoveries of genes central to viral infection. One good example of this is the ZZZ3 gene, which appears in position 4 out of 7,782 in the average hypergraph betweenness ranking, but does not appear in any of the IR, ISG or PT gene sets. ZZZ3 is part of the histone reader ATAC complex, which scans the state of histone modification and contributes to gene activation/repression mechanisms [27]. No known connection between virus infection and ZZZ3 exists, but it may serve a critical role in regulating gene expression in response to general infection.
Similarly, EPHX1, GDF15, and DUSP1 were not included in the three gene sets and ranked 29, 30 and 33, respectively. These genes are identified as an epoxide detoxification component, a stress responsive cytokine and a stress-responsive phosphatase, respectively. These roles may be related to virus-induced stress in host cells, but the specific mechanisms involved are yet to be elucidated. More exploration of these and other highly ranked genes is the subject of future work for us.

Conclusion
The work we present in this paper is similar to much of the work surveyed in our literature review in that we show the value of hypergraphs over traditional graph analysis of biological data. However, our work differs from these prior studies in a number of ways. First, our hypergraphs are built natively from transcriptomics data rather than based on existing graph models of systems. Although still capturing some multi-way complexities, hypergraphs inferred from graphs may include some induced interactions not actually present in the system that is being modeled. Creating hypergraphs natively from the data avoids this imputation. Other papers we surveyed do create hypergraphs natively from other types of data, but rather than applying centrality measures instead study more structural features like highly connected vertices.
Previous work [22,30,29,15] had demonstrated that graph metrics can be used to identify important genes in association graphs, and so we set out to determine if hypergraphs provided an improvement over graphs. To assess performance of (hyper)graphs derived from our large viral infection gene expression data set, we identified three gene sets related to virus/pathogen infection and performed an enrichment analysis of our ranked lists compared to these gene sets. While the sets were partially overlapping they represented relatively distinct aspects of viral infection in general. Our results show that s-betweenness, but not necessarily harmonic s-closeness, was a useful metric that is able to identify key genes in a comprehensive gene expression data set. While s-closeness does outperform both CLR centrality measures, CLR degree, and average fold change, it does not exceed the performance of a simple ranking according to hyperedge size, which does not require the full hypergraph structure to calculate. On the other hand, ranking based on hypergraph s-betweenness outperformed all other metrics.
The hypergraphs we created used samples from a wide range of viruses, strains, cell types, and time since infection. In future work we plan to apply this measure to compare critical genes in viral response across differing sample features. For example, we will split our hypergraph based on pathogenicity (high vs. low), cell type or host, and time since infection (early vs. late). Comparing the critical genes across these different hypergraphs may allow us to discover previously unknown indicators of viral infection for early detection or severity determination. Other future work we plan to pursue includes considering other hypergraph constructions, other data types, and hypergraph algorithms to identify highly connected vertices. We plan to combine transcriptomics with proteomics and other 'omics measurements to understand whether hybrid hypergraphs yield better results or if the inclusion of more data washes out the complexities.