Predicting the tripartite network of mosquito-borne disease

Abstract


Introduction
Emerging viruses continue to pose a threat to human and wildlife populations [1].
A growing set of computational tools have explored viral dynamics in the context of species interaction networks using a set of tools called link prediction models.
Typically, these represent hosts and viruses as a bipartite network of either known interactions (that occur in nature [2,3]) or all possible interactions (including, for example, experimental infections [4]), with both represented as links in the network [5].Host-virus link prediction models are predominantly trained on the genomic, immunological, morphological, and ecological traits of hosts and viruses (e.g., [6,7]), while some approaches also leverage information on the latent structure of the network instead of, or in addition to, these traits [8,9].The objective of these modeling exercises is to learn about the underlying biology, explain and reproduce patterns found in nature, and anticipate what future dynamics of viral emergence could look like.For example, many models use networks to understand why some viruses can infect humans but others cannot, with the objective of identifying animal viruses that could someday infect humans for the first time.
In most cases, these models assume that any given "link" between a host and a virus could represent a self-contained transmission cycle (though not necessarily onwards transmission, e.g., West Nile virus in humans and horses [10]).
Vector-borne disease (VBD) transmission substantially complicates this conceptual framework.Vector-borne viruses require an additional species-usually an arthropod (hence arthropod-borne viruses, or arboviruses)-to move them between hosts, which adds complexity into their ecology, epidemiology, and evolution.For example, in the case of arboviruses, the presence of both virus and suitable hosts is not necessarily sufficient for transmission, and the presence or absence of suitable vectors (e.g., their geographic distributions or host preferences) may be a latent variable in ecological datasets [11].Moreover, the "compatibility filters" that can be inferred from the host-virus network will be incomplete, as models will miss both the molecular and physiological determinants of vector-virus compatibility (i.e., vector competence) and the behavioral and ecological determinants of vector-host compatibility (i.e., biting preferences, in the case of blood-feeding arthropods).If vectors are entirely omitted from the inference process, a model might therefore reach spurious conclusions about whether a given host and virus are incompatible based on their biology, or otherwise miss key drivers of network structure; for example, arboviruses have been shown repeatedly to have a higherthan-expected host breadth [12].
No one canonical approach exists to address vector transmission in link prediction studies.Vector transmission could be described as a binary trait of viruses, which may help make some distinctions (e.g., separating the ecology of mosquitoborne and tick-borne flaviviruses from counterparts like hepatitis C), but leaves much to be desired in terms of information content (e.g., not distinguishing the tick-and mosquito-borne flaviviruses).The possibility of incorporating more detailed information on vector-borne transmission into these models has been underexplored, likely because arboviruses are usually seen as a complicated exception to existing datasets, rather than a feature with significant impacts on network structure.Incorporating traits characterizing the life cycle of arboviruses might improve model performance, given that virus traits are often sparser than host traits, and their interactions usually have non-additive but positive effects on model performance.However, adding sparse traits that only describe some of the viruses in the network could also reduce accuracy if the network includes a mix of vector-borne and directly-transmitted viruses.
Alternately, vectors could be added directly into the network as an additional layer of nodes (Figure 1).While previous work has predicted vector-virus networks [13], none have predicted host-vector-virus networks.Existing network models have been used to predict undetected links in tripartite networks [14], but this has yet to be explored for ecological networks.This approach would be much more informative than the bipartite form, but also requires difficult-to-obtain data: sylvatic VBD cycles tend to be characterized one at a time in scientific literature (e.g., "Culex quinquefasciatus vectors West Nile virus in house finches").While available datasets could be used to reconstruct these cycles from each of their component parts (biting preferences, vector competence, and host-virus compatibility), to our knowledge, this has not previously been explored in predictive work.
To address this, we developed two new approaches and tested them on mosquitoborne flaviviruses, a well-studied group that includes important zoonoses like dengue, West Nile, yellow fever, and Zika viruses.Through a synthesis of existing data sources, we combined data on mammal-virus associations [12], vectorflavivirus associations [13], and diptera-mammal biting preferences [15].We combined these data into one mammal-mosquito-flavivirus network, which can also be reduced down to a mammal-flavivirus network where viruses' mosquito communities are represented as node metadata.Using boosted regression trees (BRT; a machine learning method popular in ecological modeling, also sometimes called gradient boosting machines), we tested two approaches to predicting vector-borne transmission as an aspect of the host-virus network.First, we predicted the mammal-flavivirus network using every possible combination of host, vector, and virus traits, as metadata for any given host-virus association, assuming that additional data layers would enhance model performance.This was generally shown to be true, although the combination of host and vector trait data was not informative compared to the incorporation of viral trait data.Second, we developed a tripartite model of vector-borne disease transmission, in which each link represents a known host-vector-virus link and attempted to predict those complete cycles using traits of hosts, mosquito vectors, and viruses.We found that these models performed more poorly on average, but that they were able to make better than random predictions, including some of relevance to arboviral ecology and human health.

Methods
Host, vector, and virus data Host-virus interaction data were obtained from the CLOVER database [16], a manually-and programmatically-curated database of host-virus associations built by reconciling four disparate datasets (the Host-Parasite Phylogeny Project, or HP3 [12]; the Global Mammal Parasite Database v2.0 [17]; the Enhanced Infectious Disease Database [18]; and an unnamed dataset curated by Shaw et al. [19]).We used CLOVER release 0.1.2,which includes data on 5,477 known interactions between 831 viruses of 1,085 mammal species.These data have been carefully cleaned for taxonomic quality control and include detailed metadata on interaction evidence.These data are also part of a larger open database called The Global Virome in One Network (VIRION), the largest open atlas of vertebrate-virus associations [20].Although more data is available from this source, we restricted our analysis to the manually-curated data to prevent inclusion of spurious interactions.
Vector-virus association data were taken from a previous study that aimed to predict the mosquito-flavivirus network.[13] These data include 334 associations between 180 mosquito species and 37 flaviviruses.Host-vector association data were taken from a recent study of dipteran biting networks [15].These data describe 1744 associations between 255 biting dipteran species and 214 hosts (including 67 mammals).Trait data for hosts, vectors, and viruses were assembled from published sources.Thirty-three traits on mosquito life history, ecology, and geography and 22 traits on viral features, were taken from the Evans et al. study of the mosquito-flavivirus network [13].Finally, we used a total of 18 traits on mammal life history, ecology, and morphology from the PanTHERIA database [21].
Modeling approach Boosted regression tree (BRT) models were used to model host-virus and host-vector-virus associations.BRT models have previously been used to model species distributions [22], predict associations in bipartite networks [23,24,25,5], and in other conservation and management settings e.g., [26].Much of the diversity of applications can be attributed in part to the allowance for nonlinear responses and variable interactions in BRT models.Since the regression tree is hierarchical, "upstream" splits based on one variable influence "downstream" splits, which automatically models variable interactions.Further, the process of boosting enhances learning on complex data, as the process produces many regres-sion trees with a small number of splits, each of these "weak learners" iteratively build on previous trees to account for the remaining variation.This approach removes the need to partition variance among submodels, as the goal is not to examine the components of variance explained, but to assess overall model performance with the inclusion or exclusion of particular variable sets.Models were trained in the R statistical programming language [27] using the gbm package [28].

Model 1: Modeling mammal-virus associations as a bipartite network
We used the mammal and virus trait data as described above.However, mosquito vector "traits" were created by calculating the number of mosquito species in a given genus which were demonstrated to transmit a particular flavivirus [13].This is because each host-virus association could be transmitted by any number of mosquito species, creating a range of trait values that may be less informative than simply knowing breadth and composition of the vector community.This resulted in a total of 19 mosquito vector covariates, ranging in value from 0 to 22 species.We removed covariates with less than 25% data coverage, resulting in 13 host traits, 19 mosquito covariates (as virus traits), and 17 virus traits.
The data were split into 80% training and 20% testing sets, where model performance was assessed on the 20% test set.A total of 20 models per covariate group were fit in order to account for the random train/test split.These same 20 train/test divisions were used across the different covariate models, as we trained every possible combination of host, vector, and virus trait data to predict hostvirus associations.Together, this resulted in a dataset that allows the estimation of the relative influence of host traits, viral traits, and vector community data on resulting mammal-virus associations.We sampled background data by randomly combining host and virus species, resulting in 25% known positive associations and 75% background data.
We subset these data in two different ways, to explore how vector data may improve prediction of 1) flaviviruses for which we have some vector data (235 known host-virus associations) and 2) all vector-borne viruses (3016 host-virus associations).This breakdown corresponds to data subsets of 1) only mosquitoborne flaviviruses present in [13] and 2) all viruses that were recorded as vectorborne (or unknown) in the Clover data [16].We present the flavivirus-specific results here, which are qualitatively similar to the more general models for all vector-borne viruses, which are in the Supplemental Materials.

Model 2: Modeling mammal-mosquito-virus associations as a tripartite
network Using the same data resource as used above on host-virus associations, we now considered the identity of the mosquito vector species, and the association between the vector and virus [13], and the feeding association between mosquito vector and mammal species [15].While host and virus traits were largely the same as considered above, the mosquito vector traits consisted of a set of 33 mosquito vector traits from [13].Host and virus traits must have 75% of data coveragethe same as in Model 1 -to be included in this analysis.This resulted in 8 host traits, 29 vector traits, and 16 virus traits.A tripartite link -detailing the full host-vector-virus cycle -was only considered if there were all three associations; host-vector association, vector-virus association, and host-virus association.This creates a situation where a host and vector species may interact, and that vector may be infected by a virus, but this is not a confirmed link if there is no evidence that the host is infected by the virus.
A total of 135 full tripartite links were documented.We sampled background data by randomly combining host, vector, and virus species and then adding enough unique host-vector-virus background points to have 50% true tripartite links and 50% background data.Models were trained in the same manner as in Model 1.
Assessing model performance Model performance was quantified using two measures; accuracy and the area under the receiver operating characteristic (AUC).
Accuracy was defined as the correctly estimated positives (true positives) and negatives (true negatives) over all the predictions, capturing the fraction of times the model correctly classified host-virus associations in the holdout data.Accuracy is bounded between 0 and 1, where larger values correspond to higher model performance.AUC is a widely used metric of model discrimination that captures the ability of the classifier to rank positive instances higher than negative instances.
AUC is bounded between 0 and 1, where a random model will perform with AUC of 0.5 on average, and values closer to 1 indicate higher model performance.
Data and code availability R code and data to reproduce the analyses is available on figshare at https://doi.org/10.6084/m9.figshare.17033309.

Results
Model 1: The mammal-virus models Models trained only on host (AU C = 0.57) or vector (AU C = 0.46) traits consistently performed poorly at the task of host-virus link prediction (Figure 2), though the viral trait model performed well (AU C = 0.95).Generally, combinations of predictor features led to improved model performance.The full model including host, vector, and virus traits performed extremely well (AU C = 0.96).However, both the host-virus and vectorvirus traits only models also performed extremely well (performance differences among these models were essentially indistinguishable; Figure 2).The inclusion of viral traits seems to have been particularly important; for comparison, the model using host and vector traits to predict host-virus associations barely performed better than random (AU C = 0.59).
Variables important for predicting host-virus associations were generally conserved across submodels considering all combinations of host, vector, and virus traits (Figure 3).In the full bipartite model, the most informative variable was whether a virus was found in the Pacific region (likely a proxy for Zika virus, which spread through Pacific islands preceding the epidemic in the Americas).
Other important characteristics predictive of host-virus associations in bipartite models including virus traits were disease severity, genome length, year of virus isolation, if the virus is found in Africa or Australia, and viral clade.In models that omitted virus traits, the top predictors represented host allometry (body mass and metabolic rate, an unsurprising axis of variation) and Culex association, which likely captures a latent split between some bird-reservoired viruses (e.g., West Nile virus) and primate-reservoired ones (e.g., dengue and Zika virus).
Overall, our results suggest that models learned from vector trait data, particularly in the full model, where the contribution of each individual variable is more diffuse.However, our findings also indicate that the inclusion of vector data only minimally improved performance after data on hosts and viruses was already available.As host-virus models are usually trained only on host and virus trait data, our findings suggest that the incorporation of vector data into a host-virus model is an imperfect way to explore the role of vectors in structuring the hostvirus network.However, this also suggests that improved arthropod trait data could improve model performance, and thus the importance of the vector cannot be overlooked.
Finally, we investigated whether including vector trait data would improve performance even if only available for a subset of data informing the network.To test this, we trained the model on a network that included all the arboviruses present in the CLOVER dataset, even though viral trait data and vector associations were only known for flaviviruses.We found that the model using just host and virus traits performed substantially worse here (AU C = 0.70) than the flavivirus-only model with those traits (AU C = 0.95).We found that the best performing models were those that used vector and virus traits (AU C = 0.98) and those that included host, vector, and virus traits (AU C = 0.99; Figure 2).We suggest that this finding indicates that adding data on the vector aspect of transmission may be useful even when it only covers a subset of species in the network.

Model 2:
The tripartite model Models trained on tripartite (i.e., host-vectorvirus) associations had moderate explanatory power (mean AU C = 0.64 (0.065); mean Accuracy = 0.66 (0.046) out of 100 models trained on random subsets).This lower model performance could simply be due to the smaller amount of data used for training (recall that only 135 full tripartite links were known), or the imbalance between the number of potential full tripartite links given host, vector, and virus diversity, and the small number of realized links (see the small number of red links in Figure 4).Although the model's performance was only fair, we found that the model still predicted higher suitability for tripartite links where one or two of the three possible components were confirmed (Figure 5), even though these would be recorded as a "0" outcome variable the same as if none of them were known.
We suggest that this indicates the model was identifying and reproducing real biological signals of compatibility.
The top nine covariates to predicting tripartite (i.e., host-vector-virus) associations were host (n = 5) or virus (n = 4) traits (Figure 6).The top predictors mostly reflected the geography of transmission (host geographic range size, virus transmission in Asia, vector presence in Africa), the life history of the host (age at first birth, lifespan, weaning age, and neonate body mass), and aspects of viral transmission (genome length and transmission by non-mosquito arthropods).
The predictions made by the tripartite model suggest the model may be able to recover interesting or important biologically-plausible interactions.Both the top predicted "undiscovered" human-mosquito-virus links (Table 1) and mammalmosquito-virus links (Table 2) heavily over-represent a small number of viruses, in particular Wesselbron virus and West Nile virus.This is driven by the existing level of sampling in the data: West Nile has the greatest number of known hosts (n = 103 species) and mosquito vectors (n = 51); Wesselbron has the second highest number of vectors (n = 41), though many fewer hosts (n = 11; ranked #13).
This "rich-get-richer" has been previously debated as a strength or weakness for link prediction models; it may be that models are identifying a genuine biological signal of generality (which is known to be true for these viruses), but they may also be recapitulating sampling bias [5,29] and underpredicting link probabilities for undersampled species.Indeed, the richness of flavivirus data available to us in this study is likely largely due to a discovery and data synthesis bump in the wake of the Zika virus epidemic in the Americas.The mammal-mosquito-virus predictions also contain a visible signal of geographic bias: most of the top predictions either involve agricultural species (pigs, Sus scrofa; cows, Bos taurus; or sheep, Ovis aries), synanthropic species (black rats, Rattus rattus), or charismatic North American species (the opossum, Didelphis virginiana; the raccoon, Procyon lotor; the white-tailed deer, Odocoileus virginianus).These likely reflect a compounded bias between the host-virus association data and the biting data, the latter of which is particularly limited to North American and European species.
Despite the signal of data bias in these predictions, the models reveal several predictions of biological interest.For example, Anopheles hyrcanus is predicted as a possible vector of Kokobera virus in humans.The virus was implicated in an outbreak of acute polyarticular illness in Australia in the 1980s based on serology, but it remains poorly understood [30].The virus was first isolated from Culex annulirostris, which also vectors Japanese encephalitis virus and a handful of others; An. hyrcanus is a European and Asian mosquito only currently known to vector Japanese encephalitis virus.Similarly, the model predicts that Culex tritaeniorhynchus -the main vector of Japanese encephalitis virus, found in southeast Asia -could transmit Murray Valley encephalitis virus in wallabies (Macropus agilis).Neither the Australian virus nor the host have been recorded in association with this vector, but as of 2021, the mosquito has been detected in Australia [31], indicating the possibility that this interaction could now emerge.

Discussion
In this study, we considered two approaches to incorporate arboviral life cycles into link prediction models of the mammal-flavivirus network.First, we used a host-virus (bipartite) framework, and assessed the relative influence of including different trait covariates.We found that viral traits were the strongest contributor to model performance, and the incorporation of host and vector traits into the bipartite models did little to improve model performance.Second, we explored how these models could be extended to predict the entire host-vector-virus (tripartite) network.This framing is both inherently more complex than the host-virus predictive problem, and is massively limited by the availability of training data, but appears promising for future development.
Neither of these approaches provided a complete solution to the host-vectorvirus prediction problem, though their limitations differ slightly, with different implications for next steps.Adding vector community data to the bipartite (hostvirus) models may be useful where data allow, but may be less important when more detailed, biologically meaningful viral trait data are available.Compared to synthetic datasets of animal ecology, life history, and morphology, only a handful of viral traits (e.g., genome length or disease severity) are available in a standardized format, to the point that viral host range is itself often used as a viral trait (e.g., our "primate" or "bird" traits, or "host breadth" (see Table S1)).Recently, some studies have begun to use immunogenetic or genome composition variables to characterize host and virus compatibility more directly [32,33,34,35,36,37]; comparable features for vectors are not yet available or tested in this framework.
Shifting towards these kinds of predictors could help models identify more meaningful signals of virus-animal compatibility, and proportionally reduce the signal of bias in predictions.
In contrast, directly modeling the host-vector-virus tripartite network addresses the nuance of vector transmission head-on, but this problem is more severely data limited.As a result, these predictions are very visibly influenced by the geographic and taxonomic bias in the component datasets.However, these data limitations can be addressed by investment in future work characterizing arboviral life cycles in understudied areas [38].Vector-virus combinations can be tested in the laboratory, including in model-experiment feedback designs that leverage existing predictions (e.g., [13]) much like model-guided fieldwork can be used to optimize viral discovery [25].Similarly, further investigation of mosquito biting behavior will help resolve the host-vector component [15], highlighting the need for "basic" natural history research even on mosquitoes that are not known to be primary vectors of human disease.
Our study is the first to attempt modeling the entire tripartite host-vector-virus network.This is a clear knowledge gap in existing approaches to modeling the host-virus network: identifying a suitable host-pathogen association that has no shared vector may not accurately estimate spillover risk.This may be particularly relevant to efforts to identify viruses with undiscovered zoonotic potential, as the presence or absence of human-biting mosquitoes will be a key contributor to their emergence risk [39].Similarly, the tripartite framework can provide useful insights into the establishment of sylvatic cycles in interepidemic periods or upon expansion into new geographic areas.The ability of arboviruses to persist in non-human hosts may determine whether an epidemic ends as immunity grows (like Zika virus in the Americas, which was primarily transmitted human-to-human by Aedes aegypti and Ae.albopictus) or instead becomes a regular occurrence (e.g., yellow fever in the Americas, which is maintained by Haemogogus spp.and Sabethes spp. in non-human primates, between human epidemics driven by Aedes aegypti).These are likely to be particularly important nuances as arboviruses continue to spread around an increasingly globalized world in a changing climate [40,41,42,43] The broader question of "how should we model multi-layer ecological interaction networks" is also one that is likely to have broader implications in computational ecology.For example, there are other cases where researchers are interested the traits that structure tripartite networks, such as bat-bat fly-pathogen networks or plant-pest-parasitoid networks.Multilayer networks are also a topic of increasing interest in network science and mathematics, which will likely open doors for more advanced predictive approaches than the extensions we propose here.This is therefore a promising space for the development of future models, particularly if approached through the lens of iterative validation and data collection [25].Life cycle elements known Predicted probability of tripartite link Figure 5: The tripartite model predicts a higher average probability for associations that have one or two links known (which are still not recorded as positive values in the training data) than those with no elements known to be possible.This suggests that the model is capable of more than just recapitulating the data, and is able to distinguish different levels of biological plausibility within unknown tripartite elements.

Figure 1 :Figure 2 :
Figure1: (A) Predicting host-virus associations (a bipartite network) based on host traits (T h ), virus traits (T v ), and vector communities (m(v)) associated with viruses, is a different problem than (B) predicting host-vector-virus associations (a tripartite network) based on host traits, vector traits, and virus traits.In this paper, we consider both solutions as approaches to end goals like forecasting potential novel associations or spillover scenarios

Figure S2 :
FigureS2: Correlation matrix between model predictions of the full set of subset models including different combinations of host, vector, and virus traits.The full model, including all traits, resulted in the predictions that were most weakly related to the other model predictions, though this model had similar performance as other models (see main text Figure2).Lower triangle values and color scale correspond to Pearson's correlation coefficient values.

Table 1 :
Top predicted epidemic cycles in humans.All vectors are known to be human-biting; all viruses are known to be zoonotic based on either clinical or serological data.