Inferring trends in pollinator distributions across the 1 Neotropics from publicly available data remains 2 challenging despite mobilisation efforts

Abstract

Generally, records of this type are referred to as species occurrence data (sometimes called 64 biological records). 65 Naturalists have been accumulating species occurrence data for centuries. Historically, such data 66 were primarily collected as preserved specimens in museums and herbaria (Newbold, 2010;Spear et 67 al., 2017), and in written accounts (e.g. Oswald and Preston, 2011). More recently, however, this 68 information was also recorded through distribution atlases (e.g., Preston, C.D., Pearman, D.A. & 69 Dines, 2002), and various other structured and unstructured monitoring and citizen science 70 initiatives (Boakes et al., 2010;Pescott et al., 2015;Petersen et al., 2021). Taken together, these data 71 constitute an enormous resource that holds the potential to shape our understanding of species' 72 geographical distributions, as well as how, and potentially why, they have changed over time. To 73 realise this potential, however, they must be accessible to the research community. 74 Species occurrence data have become increasingly accessible over the last two decades. This can be 75 attributed to the mobilisation of historic records from preserved specimens (taken here to include 76 both the digitization of analog records and the deposition of digital records in public databases), the 77 proliferation and growth of citizen science monitoring programs, and the launch of online data 78 portals through which these data can be easily accessed and shared (Ellwood et  . To put this into context, the 80 largest online data portal, the Global Biodiversity Information Facility (GBIF hereafter), currently 81 holds nearly two billion species occurrence records spanning all continents and major taxa (GBIF.org, 82 2021). Approximately ten percent of the records held on GBIF derive from preserved specimens in 83 museums and herbaria that have been mobilised for accession online. Whilst this represents a huge 84 quantity of data, it is estimated that globally, museums and herbaria hold 1.5-2.0 billion preserved 85 specimens (Townsend Peterson et al., 2015). That is to say, up to around ninety percent of these 86 records have not been mobilised for use by the research community, at least not through GBIF. To 87 bridge this gap, resources are now being devoted to national and international data mobilisation 88 initiatives (Nelson and Ellis, 2019; also see e.g. https://www.idigbio.org/). It is essential, therefore, to 89 understand the extent to which specific mobilisation efforts can improve our ability to derive robust 90 estimates of trends in species' distributions. 91 The collection and mobilisation of species occurrence records provide the cornerstone for our 92 understanding of past and current species distributions. However, these activities are typically 93 conducted non-randomly along the axes of space, time an taxonomy; hence, the resultant data are 94 biased towards particular locations, periods and species, respectively ( Kimmig, 2020). These biases become more complicated when multiple datasets, each with their own 97 idiosyncrasies, are aggregated (Whitaker and Kimmig, 2020). Consequently, there is no guarantee 98 that any slice of species occurrence data will be suitable for any particular analytical use. 99 Biases can seriously undermine the estimation of temporal trends in species' distributions, which, in 100 most cases, is a matter of statistical inference: the analyst does not possess a complete census of all 101 species of interest in all places and time periods of interest (i.e., the statistical population) so must 102 instead rely on a sample (the available species occurrence data). Straightforward inference in 103 statistics is predicated on the assumption that the data are sampled randomly from the statistical 104 population of interest (Swinscow, 1997 Fonturbel, 2021a, 2021b). We create a "pre-digitization" dataset by removing the records that were 137 introduced via these two mobilization efforts. We then compare the pre-digitization dataset with the 138 full dataset using three criteria: 1) the total quantity of data after various stages of filtering (e.g. 139 removing records with spatial issues); 2) the extent of any potential biases; and 3) estimates of 140 temporal trends in species' distributions obtained by fitting statistical models to the data. 141

142
Data 143 We extracted occurrence data for Anthophila (GBIF, 2021a, 2021b), Syrphidae (GBIF, 2021c), 144 Phyllostomidae (GBIF, 2021d) and Trochilidae (GBIF, 2021e) collected in the Neotropics (defined 145 here as South and Central America, Mexico and the Caribbean islands) over the period 1950 to 2019 146 from GBIF. We used a bounding box (65 ºS to 40 ºN) to filter the data and subsequently removed 147 records from the USA which fell within its limits. We used the coordinateCleaner R package (Zizka et  148 al., 2019) to flag and remove records with various potential spatial issues: coordinates matching 149 country centroids and capital cities (indicating imprecise geolocation of records from vague locality 150 names), and locations of biodiversity institutes; and records with equal latitude and longitude which 151 can indicate data entry errors. 152

Data assessment
153 Bias heuristics 154 To assess the data for sampling biases, we used five data-driven heuristics. Although the goal is to 155 draw species-level inferences, we apply these heuristics at the taxonomic group level, i.e. separately 156 for the bees, hoverflies, hummingbirds and leaf-nosed bats. It is not possible to assess the data for 157 sampling biases at the species level because they are presence-only: such data provide no 158 information on sampling effort in space or time if a species was not detected. Instead, we use the 159 records for all species in each taxonomic group as a proxy for the spatio-temporal distribution of 160 sampling effort for that group (often called the "target group approach"; see e.g., Phillips et al. the total number of records for a taxonomic group, and the second is the proportion of species 165 known to occur in the Neotropics that have been recorded (i.e., inventory completeness). We 166 acknowledge that these are probably better described as measures of "coverage" than "bias". 167 However, when one looks at how they change over time (as we do here), then they indicate the 168 potential for temporal biases in recording intensity and taxonomic coverage, respectively, both of 169 which will be important to take into account for accurate inference. Information on the number of 170 species known to occur in the Neotropics, derived from the literature, online datasets (specifically 171 for Anthophila), specialists and authorities in each taxonomic group (among the authors), is used to 172 calculate heuristic two (Table 1). 173 The third heuristic is used to indicate preferential sampling of rare species. It is calculated by 174 regressing the total number of records for each species on the number of grid cells (defined below) 175 in which they have been recorded. Each species' deviation from the fitted regression indicates the 176 degree to which it is over-or under-sampled given its recorded range size (Barends et al., 2020). 177 Extending this concept, we use the coefficient of variation (r 2 ) from the model as a measure of 178 "rarity bias". This heuristic ranges from 0, indicating high bias (rare species are over-sampled relative 179 to commoner species), to 1, indicating no bias. Note that where there is a negative correlation 180 between recorded range size and sample size this heuristic becomes problematic to interpret; this 181 problem did not arise here. 182 The fourth heuristic provides a measure of geographic bias; specifically, it measures the degree to 183 which the data deviate from a random distribution in geographic space. This measure is based on 184 the Nearest Neighbour Index (NNI; Clark and Evans, 1954). The NNI is given as the ratio of the 185 average nearest neighbour distance of the empirical sample (using the associated coordinates) to 186 the average nearest neighbour distance of a random distribution of the same density across the 187 same spatial domain. We simulated 15 random distributions of equal density to the occurrence data, 188 which allowed us to present the uncertainty associated with the index. For our NNI, values may 189 range from 0.00 to 2.15: values below 1 indicate that the data are more clustered than a random 190 distribution, values of ~ 1 indicate that the data are randomly distributed, and values above 1 signify 191 over-dispersion relative to a random distribution. We acknowledge that some records available on 192 GBIF have been converted to point locations from, for example, gridded datasets. In these cases, 193 coordinates are only approximate and the NNI may be distorted. 194 The fifth and final heuristic indicates whether the same portion of geographic space has been 195 sampled over time; variation in geographic sampling confounds space and time, and this can result 196 in serious inferential problems if population trends have not been uniform over space. This heuristic 197 comprises a gridded map indicating the number of time periods (defined below) in which each grid 198 cell has been sampled. Of course, changes in the geographic distribution of records could indicate 199 changes in species' distributions and not a bias. However, we suggest that, when working at the 200 taxon group level (i.e., across many species) and at a coarse resolution (see below), changes in which 201 cells have records is most likely to reflect a bias. 202 GBIF on January 7 th 2021. 224

Utility of data for trend estimation 225
To compare the utility of the GBIF data before and after the addition of the two datasets described 226 above, we focussed on Chile, where the newly-mobilised data were collected, and on the bees 227 (Anthophila), because both datasets include a large number of records for this taxon. We began by 228 comparing the total quantity of data before and after digitization, the quantity of records with no 229 spatial issues and the total number of species represented. We then used the five heuristics 230 described earlier to compare the biases in the data pre-and post-digitization. Finally, we compared 231 estimated temporal trends in Anthophila distributions in Chile derived from GBIF before and after 232 the additional data became available. 233

Trend estimation 234
To estimate temporal trends in bee distributions in Chile, we used three statistical models. These 235 include in range size after attempting to correct for changes in recording intensity (see the supplementary 240 material for full details of the models used here). We fitted the RR models at the same resolution as 241 the bias assessment: 1⁰ grid cells in decadal time periods. The Telfer method is slightly different in 242 that it can only be used to compare range sizes between two time periods; hence, we designated the 243 first three and last three decades in our analysis as the first and second periods, respectively (data 244 from the decade in between these periods were not used to fit this model

260
Continental-scale data assessment 261 A plot of the relative number of records against time (Fig. 1A) clearly indicates a temporal bias in 262 data quantity. The number of records of bees, hoverflies, and leaf-nosed bats in each decade is 263 highly variable with no obvious directional trend. The number of records for hummingbirds, on the 264 other hand, shows a marked increase in recent decades (2000-2019). 265 In addition to temporal bias in data quantity, the data are also biased taxonomically, and the extent 266 of these biases varies over time. First, for all taxa, the proportion of known species recorded within 267 GBIF is appreciably < 1. The leaf-nosed bats and hummingbirds are, however, best represented: in 268 the early decades around 75% of species in these groups were recorded and in the later decades this 269 increased to almost 100%. Data are not available for the vast majority of bee and hoverfly species 270 (Fig. 1B). Second, for most groups, rare species tend to be overrepresented in the data. Recall that 271 the taxonomic bias index in Fig. 1C is the r 2 from a regression of the number of records on recorded 272 range size for each species. For bees, leaf-nosed bats and hummingbirds, the index is generally high 273 in the early decades (≥ 0.7); this indicates low potential for selective sampling of rare species. 274 However, the indices fall in later decades which indicates an increased potential for preferential 275 sampling of rare species. The data for hoverflies are most variable in terms of potential rarity bias 276 and contrast with the other groups in that the potential bias is less severe in the later decades. For 277 all groups, there are some decades in which there appears to have been selective sampling of rare 278 species. 279 To reveal the potential for spatial biases in the data, we looked at the degree to which they are 280 clustered in particular portions of the Neotropics using the NNI. For all groups, and in all decades, 281 the data are more clustered than would be expected by chance (Fig. 1D)  percentile calculated by comparing the data to 30 random distributions. Panels E-H show the 307 number of decades in which each 1⁰ grid cell was sampled for each taxon. 308

Data quantity 310
The two newly-mobilised datasets drastically increased the availability of Anthophila records 311 collected in Chile between 1950 and 2019 on GBIF ( Table 2). The total number of records and the 312 number of records without common spatial issues (see Methods) increased approximately sixfold; 313 the number of records with no spatial issues and which are identified to species level increased 314 approximately sevenfold; and the number of species recorded increased from 326 to 356 (Table 2). 315 The increase in species recorded in GBIF represents a move from 70% to 77% of the 464 bee species 316 known to occur in Chile (Lopez-Aliste and Fonturbel, 2021b). 317 Biases 320 Whilst the newly-digitized data drastically increased the quantity of data available for bees in Chile, 321 it did not reduce all forms of bias, and, in some cases, increased their severity. For example, Fig. 2A  322 shows that the vast majority of the new data were collected in decades two, three and four (1960-323 1989). A corollary is that the addition of these data introduced strong temporal biases in data 324 quantity ( Fig. 2A, 2B). Moreover, in the full dataset, on average, preferential sampling of rare species 325 is more apparent (Fig. 2C). Finally, the addition of new records did little to increase the geographical 326 representativeness of the data: the NNIs indicate a similar, if not slightly greater, departure from a 327 random distribution in the full dataset (Fig. 2D). However, we remind the reader that the NNI cannot 328 determine whether the data are non-randomly distributed due to sampling biases or a taxon's true 329 distribution. 330 Whilst the newly-digitised records did little to reduce some forms of bias in the available data, they 331 improved the situation in other respects. The addition of the new data resulted in a more consistent 332 level of taxonomic coverage across decades (~ 30-40 % of species known to occur in Chile; Fig. 2B). 333 They also increased the number of grid cells that have records in multiple decades, with many grid 334 cells even being sampled in all decades (Figs 2E and F). 335 further from 1 are more clustered). Panels E and F show the number of decades in which each 1⁰ 345 grid cell was sampled. 346

Trend estimates 347
It was not possible to fit all models for all 146 species of Anthophila for which data are available in 348 Chile, particularly when using the pre-digitization data. For the Telfer model we omitted species that 349 were not recorded in at least two grid cells in the first time period: see Telfer et al. (2002) and the 350 supplementary material for the rationale. As a result, it was only possible to estimate distribution 351 changes for 32 species using the Telfer method with the pre-digitization data. A separate problem 352 emerged when fitting the relatively complex RR + site model using the pre-digitization data: models 353 for 21 species returned "singular fits". Singular fits occur where the estimated variance of the 354 random intercept is 0, which can indicate that the model is overfitted. As a result, we only included 355 the 304 species for which RR + site models were successfully fitted, but also fitted the simpler RR 356 models which do not include random effects; these models were successfully fitted for all 356 357 species. As we wanted to compare the pre-and post-digitization models, for each model type, we 358 were limited to including only those species whose distribution changes could be estimated using 359 the pre-digitization data (even though many more species' distributions could be estimated using 360 the post-digitization data). 361 Agreement between models fitted using the pre-and post-digitization is generally strong, but there 362 is some variation between model types (Fig. 3). not be produced for B. terrestris using the Telfer method (panel A) due to an absence of records 370 early in the time series (see Telfer et al., 2002). Note that respectively one and three extreme 371 outliers are omitted in panels B and C to enable better visualization of the main cluster of species. 372 Darker points indicate clusters of predictions overlapping for multiple species. Also note that the 373 sign of the Telfer model predictions in panel A does not necessarily indicate whether a species is 374 expanding or declining in absolute terms; rather, they give each species' change relative to other 375 species in the group. 376 To make a simple assessment of whether the newly-digitized data improve our ability to estimate 377 temporal trends in species' distributions, we focused on B. terrestris, which has been continually 378 introduced to Chile since the 1990s (i.e., midway through the time series) and has expanded widely 379 since. We were not able to estimate a trend for B. terrestris using the Telfer method for reasons 380 described in the Methods. For both the pre-and post-digitization datasets, the RR and RR+site 381 models predict that B. terrestris' range size has increased, as one would expect. The addition of the 382 newly-mobilised data had little effect on the predictions; this is indicated by the fact that they fall on 383 the 1:1 line on a plot of the predictions based on the pre-digitization data vs those based on the 384 post-digitization data (Fig. 3). 385

386
In this paper, we have demonstrated the need for analysts to use publicly available species 387 occurrence data with caution when estimating trends in species' distributions. We began by 388 providing evidence of sampling biases in available data on the occurrences of bees, hoverflies, leaf-389 nosed bats, and hummingbirds collected in the Neotropics. We also showed that two recent data 390 digitization efforts reduced some biases in the bee records collected in Chile, but introduced others. 391 Finally, we showed that, despite a dramatic increase in data quantity, statistical models fitted to the 392 pre-and post-digitization datasets produced broadly similar estimates of temporal trends in species' 393 distributions (Fig. 3). 394 The data-driven heuristics used here indicate non-random sampling along the axes of space, time 395 and taxonomy. However, one might not expect presence-only data to be randomly distributed; for 396 example, it is possible that the data are non-randomly distributed across the continent because the 397 taxa are truly concentrated in certain portions of geographic space. We showed that the data for the 398 leaf-nosed bats and hummingbirds were non-randomly distributed (Fig. 1D) due to the availability of 399 many records in the Andean region in Ecuador and Colombia (Fig. 1G and H  information, such as regional experts, in addition to the available data itself. 418 Notwithstanding the fact that the data for some taxa might be more geographically representative 419 than the data-driven heuristics suggest, it is not possible to conclude that the available data for any 420 of the taxon groups are free of bias. There are no data held in GBIF for the vast majority of known 421 bee and hoverfly species (Fig. 1B), perhaps because the few experts in the field tend to focus on a 422 particular subset of species, or because focus has shifted to other taxa (e.g. hummingbirds) in recent 423 years. Furthermore, for all taxa except perhaps bees, rare species are overrepresented in the 424 available data (Fig. 1C) another; if the data were sampled from the former portion in one period, and the latter portion in 434 the next, then one might come to the artefactual conclusion that the species is in decline. Our 435 results clearly demonstrate the need for analysts to properly scrutinise such data before using them 436 to draw inferences about trends in species' distributions. 437 The mobilisation of historic records is the most direct (and arguably cost-effective) way to 438 understand biodiversity change over the last few hundred years (Nelson and Ellis, 2019; Page et al.,  439 2015). However, to our knowledge, there have been no explicit comparisons of the utility of 440 available data for a given inferential goal before and after the mobilisation of such records. We 441 identified two recent mobilisation efforts that increased the quantity of data on bee occurrences in 442 Chile approximately sixfold. The addition of these records had a mixed effect on sampling biases in 443 the available data: a larger fraction of bee species are represented in the post-digitization data 444 across decades, and more grid cells had been sampled in more decades; however, across decades 445 there are stronger biases towards rare species and decades two to four . Whilst perhaps 446 intuitive to some, the point that more data does not necessarily equal less bias is an important one, 447 and has the potential to be overlooked given the abundance of records now available to ecologists. 448 In terms of estimates of temporal trends in bee distributions in Chile, the addition of the newly-449 mobilised data had only a modest effect. This is indicated by fairly strong correlations between the 450 predictions from the models fitted to the pre-digitisation data and those fitted to the full dataset 451 (Fig. 3). It is not clear whether the newly-mobilised data improved the accuracy of the models. We 452 looked at the predictions for B. terrestris which is known to have expanded widely since its 453 introduction in the 1990s. The RR and RR+site models do predict an expansion of B. terrestris, but 454 those predictions are roughly identical regardless of whether they are based on the pre-digitisation 455 data or the full dataset. Given the tendency towards recording of rare species and lack of new 456 records in the later decades within the full dataset, this may indicate undersampling of B. terrestris 457 relative to other bee species. Ideally, we would also have tested whether the models were able to 458 detect a decline in species' distributions. However, to do so we would need to identify a species for 459 which there is clear evidence of a range decline independent of GBIF data. Whilst some species are 460 known to be declining in terms of population size (e.g., Morales et al., 2013), we were not able to 461 confidently identify a species that should be declining in terms of occupied 1⁰ cells. Based on the 462 predictions for B. terrestris alone, it is not possible to conclude that the mobilisation of historic 463 records improves our ability to estimate trends in species' distributions in this case. 464 Targets for data mobilisation have previously been defined in terms of data quantity. would be to target the mobilisation of data that would be most informative for some inferential 468 goal. Studies like ours could be used as "gap analyses" to establish where best to target new 469 mobilisation efforts along the axes of space, time and taxonomy. Such studies could also inform 470 decisions on where best to focus future adaptive or targeted sampling effort and for which taxa. 471 However, we acknowledge that there will always be trade-offs between the mobilisation or sampling 472 strategy (e.g. to reduce bias), funding, logistics, the availability of experts (particularly taxonomists) 473 and local interests. 474 There remain substantial gaps in knowledge about the status of pollinating species worldwide, and 475 the effectiveness of measures to protect them, with evidence largely biased toward Europe and 476