Comparing ecological and evolutionary variability within datasets

Many key questions in evolutionary ecology require the use of variance ratios such as heritability, repeatability, and individual resource specialization. These ratios allow researchers to understand how phenotypic variation is structured into genetic and non-genetic components, to identify how much organisms vary in the resources they use or how functional traits structure species communities. Understanding how evolutionary and ecological processes differ among populations and environments therefore often requires the comparison of these ratios across groups (i.e., populations, sexes, species). Inference based on comparisons of ratios can be limited, however. Variance ratios can remain the same across group despite very different values in the numerator and denominator variances. Moreover, evolutionary ecologists are most often interested in differences in specific variance components among groups rather than in differences in variance ratios per se. Recommendations for how to infer whether groups differ in variance are not clear in the literature. Using simulations, we show how questions regarding the estimation of variance components and their differences among groups can be answered with linear mixed models (LMMs). Frequentist and Bayesian frameworks have similar abilities to identify differences in variance components. However, variance differences at higher levels of organization can be difficult to detect with low sample sizes. We provide tools to conduct power analyses to determine the appropriate sample sizes necessary to detect differences in variance of a given magnitude. We conclude by supplying guidelines for how to report and draw inferences based on the comparisons of variance components and variance ratios Many critical questions in ecology and evolution use variance ratios, such as repeatability, heritability, or individual resource specialization, to make inferences about ecological and evolutionary processes. In many cases, these inferences rely on the comparison of variance ratios among datasets (populations, sexes, or environments). In this article, we show that current approaches of drawing inferences about group differences from comparisons of ratios are inappropriate because ratios can differ due to differences in the numerator, denominator, or both. We investigated how questions regarding differences in variance ratios and constituent variance components can be evaluated using linear mixed model (LMM) approaches and provide guidance for appropriate sampling schemes under different scenarios and discuss common pitfalls associated with estimation of differences in variance component among datasets.


Introduction
Our understanding of many evolutionary and ecological processes is underpinned by an estimation of variance ratios (Table 1). For example, the reporting of repeatability has become pervasive in behavioral studies as it summarizes the amount of variation in behavior attributable to differences among individuals. Informally these differences among individuals can be thought of as differences in their average Communicated by J. Lindström. behaviors. Repeatability then can be interpreted as how much of the overall variation is attributable to individual differences.
Use of variance ratios like repeatability spans a broad swath of evolutionary ecology (Table 1). This includes the most well-known variance standardized ratio: heritability, and extends to interest in community ecology regarding the distribution of functional trait variation expressed within versus among populations or species (Violle et al. 2012).
While useful for understanding the relative magnitude of variation, variance ratios can be highly misleading when compared between groups (Houle 1992;Wilson 2018). Comparisons of variance ratios are only narrowly interpretable because these ratios can differ when numerators differ, when denominators differ, or when both differ. Indeed, variance ratios can be equal despite having different numerators and denominators values. Put another way, differences between groups in ratios like repeatability are not informative as to absolute differences in the magnitudes of variation observed.
To further illustrate the inferential limits of variance ratios, consider the following scenario: researchers are studying the behaviors and dietary habits of two populations of the mythical Dahu (Dahu desterus; Fig. 1A) at different elevations. These elusive creatures have shorter hind-legs on their left side, thus only allowing for clockwise movement (Chartois and Claudel 1945;Jacquat 1995). While measuring aggressive interactions, researchers find no differences in means between populations and similar behavioral repeatabilities ( = 0.8; Fig. 1B). Put another way, the same relative amount of variation is attributable to individuals in each population. The researchers notice, however, that there are large differences in the among-and within-individual variances of each population. Had researchers only examined repeatabilities and mean differences, they would inappropriately conclude that the populations are behaviorally equivalent. Instead, the actual variance estimates reveal that individuals from the high-altitude population are very distinct from one another in their aggressive tendencies while, at low-altitude, individuals show little departure from the population average (Fig. 1B, C).
These researchers are also curious as to whether the harsher climate at the top of the mountain range leads to a narrower dietary breadth. Researchers predict that individual resource specialization will be higher in the low elevation population, as D. desterus have more food options to choose from. To the researcher's surprise, they find much higher individual resource specialization in the high-altitude population: S 1 = 0.2, S 2 = 0.8. Upon examining the specific values of among-and within-individual variation in niche, they find that these differences are a result of the high elevation population having a much narrower total niche width ( Fig. 1D) T IC/IR = V IC / V IR The proportion of variation attributable to community variance (V IC ) relative to the regional pool variance (V IR ) (Violle et al. 2012) while the within-individual variation in niche preference is equal between populations. This means that it is the difference in diet preference among individuals that drives the difference between the two populations. With more varied resources available at low elevation, each individual can specialize along the total niche axis, yet the breadth of diet preference within individuals is the same between populations. For both traits, exclusive reliance on ratios would have led to either inappropriate or incomplete inferences (i.e., inappropriately concluding behavioral equivalence and incompletely recognizing the basis of differences in apparent specialization). Due to these problems with interpretations of variance ratios (Houle 1992;Dochtermann and Royauté 2019), what would be of greater use to researchers is to instead evaluate differences in specific variance components.

A statistical framework for comparing variance components
The statistical procedures necessary for the estimation of variance components and ratios within a single population have been the subject of much attention (e.g., mixed models for repeatability: Dingemanse and Dochtermann 2013; animal models for heritability: Wilson et al. 2010; individual niche specialization: Bolnick et al. 2002;Coblentz et al. 2017;functional trait variation: Nakagawa and Schielzeth 2012;Violle et al. 2012;Carmona et al. 2016). There is also a long history in quantitative genetics regarding the comparison of variances and covariance structures among groups (Shaw 1991;Arnold and Phillips 1999;Roff 2002;Roff et al. 2012;Aguirre et al. 2014). Unfortunately, these quantitative genetic approaches have been poorly disseminated across fields (but see Dochtermann and Roff 2010;White et al. 2020). Here we describe and investigate methods for detecting differences in variance components among groups. Specifically, we compare the strength and weaknesses of three statistical approaches: comparison of confidence intervals, model comparison with AIC, and Bayesian estimation of the difference in variance components. While this selection of methods encompasses very different philosophical approaches to data analysis, all three are routinely used in the estimation of repeatability and other ratios.
We consider a scenario where a phenotypic attribute, y, is measured repeatedly for individual organisms occupying Niche axis Resource use fraquency B D C A one of two different environments (E1 and E2) and in which variation occurs among-and within-individuals (V I and V W respectively). In the following sections, we focus on differences in individual variation and repeatability. Note, however, that this scenario can also be expanded to the comparison of diet specialization for individuals occupying different environments or how functional traits vary among and within species in two different environments.
An easy way to compare these variance components and their ratios (τ = V I /(V I + V W )) is to estimate the variance components for each environment in separate statistical models. We can then test for differences in variances and ratios by environment based on whether estimate confidence intervals overlap or not. While straightforward, this method suffers from two key limitations. First, basing inference on the overlap of 95% confidence intervals is overly conservative (Barr 1969), especially when sample size is low. It is instead whether the confidence interval for the difference in variances excludes 0 that is relevant for drawing inferences. This difference cannot be directly estimated from the approach we have described. However, statistical significance can still be assessed by comparing the overlap of the 83% confidence intervals for variance components, a threshold that provides a better approximation for an α = 0.05 for the null hypothesis of no difference (Schenker and Gentleman 2001;Austin and Hux 2002;MacGregor-Fors and Payton 2013;Hector 2021). Second, by estimating variance components in separate statistical models, the hierarchical structure of the data, i.e., the variance components nested within the environments, has been broken. As a result, potential average differences in the traits of interest are not appropriately tested.
Instead, we suggest that a more appropriate procedure would be the use of a linear mixed model (LMM) where the among-and within-individual variance is estimated for each environment within the same statistical model. This statistical model can be described by the following equation: where y ij describes the phenotypic traits for the ith individual and jth observation. ID 0i is the deviation from an overall intercept, 0 , for the ith individual. 1 represents the regression coefficient for the fixed effect of environment (here a contrast coefficient). The random intercepts and residual variance ( e 0ij ) both follow a multivariate normal distribution, and Ω ID and Ω e are the variance-covariance matrices at the among-and within-individual levels respectively.
The diagonal elements of these matrices represent the among-and within-individual variances in each environment: (1) E 1 and E 2 . The off-diagonal elements represent the crossenvironment correlation (set to 0 if individuals are only ever evaluated in one of the two environments). This formulation has the advantage of allowing considerable flexibility in the specification of the statistical models considered (Dingemanse and Dochtermann 2013). LMMs are now available for most statistical software, and their generalized extensions can accommodate non-normal error distributions (Table 2). Upon fitting LMMs, several methods are then available to determine whether a variance ratio or components of the ratio differ by environment. Specific hypotheses of which variance component differs across environment can be easily tested via model comparison. For example, a model where only the among-individual variance differs by environment can be compared to a null model where the amongand within-individual variances are kept constant across developmental environments ). These models can be estimated within a frequentist framework via restricted maximum likelihood or a Bayesian framework, and suitable decision criteria can be used to determine which model best fits the data. In the case of restricted maximum likelihood estimation, it is also possible to use likelihood ratio tests to compare these models. Note however that the proper degrees of freedom to apply to each model are unclear and additional care should be taken when using this method (Pinheiro and Bates 2006;Santostefano et al. 2016). We recommend calculating these degrees of freedom by considering each variance component as a full parameter for more conservative testing (see also the tutorial in ESM3).
In many cases, researchers are also interested in whether the difference in variance components has a biologically meaningful effect. In other words, when asking questions about whether variance components vary between environments, we are mostly interested in the magnitude of the difference in these components across environments. While model comparison of LMMs can help us understand whether a statistically detectable difference is observable across environments, the magnitude of the difference can only be determined by examining the difference in variance components among environment: ΔV estimated as V E2 − V E1 in our case. When the trait of interest is expressed as standard deviation units (i.e., mean centered and scaled to the standard deviation of the dataset across all populations and environments), this difference can be considered an effect size for the magnitude of the difference among variance components, thus making comparisons across studies possible (Royauté et al., 2015;Hamilton et al. 2017;Royauté and Dochtermann 2017). Note that ΔV could also be expressed on a ratio scale (V E2 / V E1 ) or on a log-additive scale (log(V E2 ) − log (V E1 )). We will return to the topic of statistical significance vs. appropriate effect sizes later in the paper. For now, we simply consider ΔV on an additive scale with data expressed in standard unit deviations because it allows the most straightforward interpretation and functions in cases where a variance component is zero or Table 2 Packages and softwares allowing to test for differences in variance components using linear mixed models (LMMs) along with parameter estimation method (maximum likelihood (ML), restricted maximum likelihood (REML), hierachical likelihood (H-ML), or Bayesian framework) and inference method (likelihood ratio tests (LRT), AIC, bootstrapping or credible interval for ΔV). This list is not comprehensive and is instead based on widely used commercial softwares and R packages Package or software Free or com-  (Bürkner 2017) approaching zero. ΔV can be calculated from the maximum likelihood estimates in a frequentist framework, but calculation of the uncertainty around this estimate is not straightforward and requires additional steps such as bootstrapping. In a Bayesian framework, the calculations are much simpler given that the distribution of ΔV can be directly estimated by taking the difference in the posterior distribution of V E2 − V E1 . The posterior mode of ΔV can then be interpreted as the estimated strength of ΔV, with credible intervals representing the precision around this estimate.
In summary, approaches based on LMM and their generalized extensions allow great flexibility and are well suited to study questions related to how variation in phenotypic traits varies at multiple levels of organization. In the next section, we describe the performance of LMMs to detect differences in variance components.

Methods
The simulations described below focus on interpretation in the context of behavioral repeatability. However, it is worth noting again that inferences about the ability to estimate and detect differences in variances generalize to the components of the ratios described in Table 1.

Fig. 2
Scenarios used in simulations detailing how differences or lack of difference in repeatability (right-side column) can arise from different patterns in the underlying variance components (left-side column; exact values can be found in Table S1). Scenarios A-C correspond to cases where the total variation differs between two environments (E1 and E2) due to differences in the among-

Data simulations
To compare the performance of statistical procedures for detecting differences in variance components and variance ratios, we performed a series of simulations based on the scenarios illustrated in Fig. 2. In these scenarios, a phenotypic attribute y is measured in two different environments (E1 and E2) and variation occurs among and within individuals (V I and V W respectively). In scenarios A through C, the repeatability (τ) differs by an equal amount between the two environments (∆τ = 0.3), but the underlying driver of this difference is either due to a difference in the among-individual variance (A), in the within-individual variance (B), or in both the among-and within-individual variance (C). Note that for scenario C, the total variance remains the same between environments. In scenarios D and E, we explore cases where the variance ratios are equal among environment, either because all variance components are equal as well (D) or in spite of differences in all variance components (E) (see Table S1 for exact values for all parameters). Using the R statistical environment (R Core Team 2020), we generated 500 datasets for each of the following combinations: • Sample size varying from 20 to 200 individuals by increments of 20 for each environment (sample size was equal between the two environments) • Number of repeated measures taken on each individual varying from 2 to 6 repeated measures by increments of 1 • Five different scenarios of known difference in variance ratios as described in Fig. 1 and Table S1.
Each dataset was simulated by sampling from a Gaussian distribution for the random (among-individual values) and the error (within-individual) terms. This resulted in a total of 125,000 datasets on which we tested three different statistical procedures to detect differences in variance components and variance ratios. We provide all R code for data generation and analysis in the Electronic Supplementary Materials (ESM1).

Comparison of confidence interval overlap from separate mixed models
We first compared the overlap of 83% confidence intervals for variance component when estimated from separate linear mixed models. We specified one mixed model for environment 1 and one for environment 2. These models are a simplified version of the one presented in Eq. (1): The individuals in the environment of interest are included as random effects and no additional fixed effect are needed. Upon fitting these models, we computed 83% confidence intervals for the among-and within-individual variance. Datasets where these intervals did not overlap were considered statistically different.

Frequentist LMM with AIC model comparison
Our second approach was to fit the LMM approach described above and test for the significance of the difference in among-and within-individual variance using likelihood ratio tests. Specifically, we specified four different mixed models corresponding to the four different possibilities by which variance components may differ (Royauté et  For each dataset combination, we then compared each model's Aikaike's Information Criterion value (AIC). AIC allows the comparison of relative fit of statistical models, and models with lower AIC values indicate better support relative to competing models. These simulations and this analytical framework are similar to previously used approaches (Shaw 1991;Jenkins 2011;Tüzün et al., 2017). These models were specified using the nlme package for mixed models (Pinheiro and Bates 2006) using restricted maximum likelihood (REML).

Bayesian LMM and difference in variance components
We next fit a mixed model where variances among and within units were allowed to vary between environments (as in model 4 described above) to each randomly (2) y ij = 0 + ID 0i + e 0ij ID 0i ∼ N 0, V ID ; e 0ij ∼ N 0, V e generated dataset. We calculated the posterior mode for the difference in variance components (calculated as ∆V = V E2 − V E1 ) and estimated the 95% credible intervals based on the highest posterior density of this distribution. 95% credible intervals excluding 0 were taken to indicate statistically detectable differences in variance components among environments. All models were run with the MCMCglmm package (Hadfield 2010) using default iteration settings to shorten computing time (13,000 iterations, 3000 burn-in iterations and thinning interval of 10 iterations). We used priors that were minimally informative for the variance components (see ESM1 and ESM3 for prior specification and a discussion on priors).

Probability of correct model identification, precision, bias, and accuracy estimations
We calculated the probability of detecting the model with the correct difference in variance components (hereafter "abridged" to probability of correct model identification), precision, relative bias, and accuracy under each scenario and sampling design to compare the performance of maximum likelihood and Bayesian mixed models. For method 1 (overlap of 83% intervals), we assigned values of 1 when significant differences in variance components were detected in directions predicted by the data generating process, and 0 otherwise. For method 2, we calculated the probability of correct model identification as the proportion of times the model with the lowest AIC matched the generating model. For method 3, we calculated whether a given model detected a difference in variance components based on the overlap of the 95% credible intervals of the ΔV posterior distribution with 0. As in method 1, we then assigned values of 0 or 1 based on whether the detected difference matched with the data generation process of the corresponding scenario. We calculated the probability of correct model identification as the proportion of analyzed datasets in which we detected differences in the direction predicted by each scenario and statistical method. Precision, indicating the similarity of the results produced by simulations with a given scenario, was calculated as the difference between 25 and 75% quantiles of estimates ( Sample size per group Probability of correct model identification reps 2 3 4 5 6 (in %) for each statistical approach by scenario, we calculated the mean difference between the expected value and the value observed in each of the 500 simulations. Finally, we report the root mean square of error (RMSE) for each scenario and sample sizes. This metric calculates how close estimates are to the expected values and serves as an estimate of the accuracy of each statistical approach by scenario.

Results
The probability of correctly detecting differences in variance components did not differ substantially between frequentist and Bayesian methods of estimation (Fig. 3). The highest probability of correct model identification was observed for cases where the variance ratio differs as a result of changes to the within-individual variance (scenario B) or when variation remained equal between environments (scenario D). The statistical power to differentiate between alternative scenarios (i.e., scenarios A, C, and E) was lower, especially with small sample sizes and low number of repeated measures (Fig. 3). Importantly, no statistical method seemed to outperform all others across scenarios. Our results are consistent with previous simulations showing that the amongindividual variance component is particularly difficult to estimate at small sample sizes (Dingemanse and Dochtermann 2013).
In scenarios B and D, the correct differences among variance components were identified > 80% of the time, even at low sample sizes (Fig. 3). In all other scenarios, this threshold was only reached with high sample sizes and a high number of repeated measures. For scenarios C and E-which correspond to cases where the variance ratio differs as a result of among-individual variance (C) or when the variance ratio remains the same despite changes to both amongand within-individual variance (E)-datasets with only 2 repeated measures per individual never achieved a probability of identifying the generating model above 0.8, even with sample sizes above 200 units per environment (i.e., a minimum of 800 total measurements, Fig. 3). Increasing the number of repeated measures only marginally alleviated the problem. For example, in scenario C, only datasets with 4 or more repeated measures per individual reached statistical power above 0.8 with sample sizes above 120 individuals per environment, which is higher than many ecological or evolutionary studies can provide under realistic scenarios.
Note that for AIC model comparison, we calculated power as the number of times the best model corresponded to the generating model. A more conservative approach is to calculate the proportion of times the best model is at least 2 AIC units lower than the second model. This method corresponds to a common threshold to detect statistically distinct models (Burnham and Anderson 1998). When using this more conservative threshold (Fig. S1), datasets generated according to scenarios A and D were never statistically distinguishable from non-generating models, although the correct model was consistently ranked as the best model. This discrepancy is likely because when the generating model does not include differences in the within-individual variability (scenarios A and D), sampling error is erroneously identified as heterogeneity. At smaller sample sizes, this error is greater on average, and thus detectable. At larger sample sizes, this sampling error is smaller but more easily detected and therefore manifests as a difference between groups. To address this, in addition to measures of variance differences like the described ΔV statistic, researchers should also compare mean-standardized variance estimates like the coefficient of variation or Houle's evolvability between groups (Houle 1992;Hansen et al. 2011;Dochtermann and Royauté 2019).
The comparison of relative bias, precision, and accuracy among statistical methods produced mixed results. On average, Bayesian LMMs consistently underestimated the among-individual variance for scenarios in which the among-individual variance differed between environments (scenarios A, C, and E) resulting in a bias at small sample sizes (Fig. S2). However, Bayesian LMMs also had higher precision and accuracy compared to maximum likelihood (Fig. S3, S4). This means that Bayesian estimates tend to be consistently more conservative than maximum likelihood regarding the magnitude of the among-individual variance but that these estimates nonetheless more closely matched simulation conditions.

Discussion
Comparing variability across datasets is important for many questions in evolutionary ecology (e.g., Table 1). However, variance ratios are not sufficient to address questions about how variation is expressed across environments, populations, or sexes. The inability to determine why groups differ based on ratios is in addition to the numerous conceptual and theoretical problems inherent to the estimation of variance ratios (Houle 1992;Hansen et al. 2011). Instead, many questions require the direct comparison of variances.

What are appropriate sample sizes for detecting differences in variance?
Our simulations show that regardless of the statistical methods used, comparing variance components across groups is a "data hungry" question. Scenarios where the amongindividual variance differed between environments were particularly hard to detect at low sample sizes. Note that our objective was not to provide a full exploration of parameter space. Instead, we focused on a subset of scenarios that are likely to be common in ecology and evolution (Fig. 2). Based on our simulations, the probability to detect differences in variance components will depend in large part on the ability to estimate the among-individual variance component (V I ). In the most complex case where differences occur among-and within-individuals (scenario E), researchers would require a minimum of 1,600 observations to correctly detect differences (i.e., 200 individuals measured 4 times in each environment). This is far higher than sample sizes needed for single populations, where moderate repeatabilities only need ∼100 observations to be estimated with > 0.8 power (at least 25 individuals measured 4 times to detect a repeatability of 0.3; see (Dingemanse and Dochtermann 2013).
Given these challenges, we recommend that researchers conduct power calculations prior to the experiment whenever possible (see R code for a priori power analyses in ESM2 and an R Markdown tutorial in ESM3). If not, a simple rule for sampling can be to estimate the sample size needed to detect the lowest among-individual variance value of interest (see, for example, (Martin et al. 2011;van de Pol 2012;Dingemanse and Dochtermann 2013) and multiplying that sample size by the number of experimental groups involved.

How to report results? Statistical significance vs. effect sizes
Given the issues discussed above, how should researchers interested in ecological and evolutionary variation design their studies and report their findings? We suggest that researchers report their results in a manner that focuses on the magnitude of the difference in variability between experimental groups rather than solely focusing on statistical significance.
To this effect, we believe that reporting the results of the full model rather than just the most parsimonious model will be most appropriate in most cases (i.e., model 4 in our conceptual example). This is because model selection only gives information on whether differences among groups are statistically detectable. In contrast, questions regarding the magnitude and precision of the estimated differences are answerable only with interpretation of the most complete statistical model (see tutorial in ESM3).
In addition to presenting results of the full model, we suggest that measures of effect sizes for the differences in variance component also be presented. As reported above, ΔV provides a simple metric to estimate the magnitude of these differences, but it is by no mean the only one. In our theoretical example, the mean trait value did not differ by environments, but in many cases mean and variance are related. In such cases, using comparisons based on Houle's (1992) I 2 value or coefficients of variation for each component as opposed to variance component themselves can be preferable (Hansen et al. 2011;Dochtermann and Royauté 2019). Effect sizes based on the coefficient of variation can also be calculated within an LMM framework as described by Nakagawa et al. (2015) (see also Carmona et al. 2016 andFontana et al. 2018 for approaches relevant to functional trait diversity).
We provide a synthetic guide for which statistical tests and effect sizes are most appropriate depending on the nature of the dataset in Fig. 4A. Returning to our dahu example, an appropriate analysis of the difference in aggression variance would follow the tables and figures from Fig. 4B While we limited our conceptual example to comparisons between two environments, the LMM approach we propose is by no mean restricted to two-group comparisons. For example, Jenkins (2011) used model comparison to tease apart the relative influence of sex, species, and their interaction on the expression of behavioral variation in kangaroo rats. Similarly, Coblentz et al. (2017) show how model selection combined with Bayesian GLMM can allow the comparison of indices of diet specialization within and among species. In both cases, model selection can provide a first pass at whether differences in variance components are detectable among groups, while specific pairwise comparisons of effect sizes (using ΔV or other metrics) will allow discernment of the most pronounced differences in variance component. Regardless of the statistical approach used, we suggest it is important that researchers clearly outline the direction and, when possible, magnitude of the expected effects in their predictions.
Finally, our conceptual examples focus exclusively on the case of "well-behaved" data with normal error distributions. While these comparisons can be made with generalized extensions to LMMS (i.e., GLMMs), researchers must take extra precautions when calculating and comparing the within-individual variances (i.e., the residual variance). Indeed, in the case of non-Gaussian data, the Fig. 4 A Flowchart showing decision rules regarding how to test for differences in variance components, which metrics to report and which effect sizes can be calculated, along with their definitions in table format. B Reporting example based on the simulated case study in Fig. 1B, C. The first table used REML model selection with AIC to compare the support for different hypotheses for how variance components of aggression may differ between the low and high elevation populations. The best model is one where among-and within-individual variances are higher in the high elevation population. The second table compares all components by environment (posterior medians and 95% credible intervals estimated from a Bayesian mixed model with model 4, note that frequentist confidence interval can also be reported using non-parametric bootstrapping as shown in ESM3). Finally, because aggression does not differ on average between populations, lnVR is an appropriate metric to report the effect size for the difference in variance between populations residual variance depends on both the link function used and how the software deals with overdispersion (additive vs. multiplicative overdispersion). Nakagawa and Schielzeth (2010) provide a very useful and extensive guide explaining how the correct residual variation can be calculated.

Conclusions
Variance ratios are straightforward metrics to describe how various ecological and evolutionary processes occur. However, comparing these ratios across studies or group can be misleading if poor attention is given to the specific variance components making up those ratios. More importantly, a lack of difference in these ratios does not mean that variation is expressed equally among groups. Given these limitations, we advocate for techniques allowing the estimation of differences in each variance component rather than focusing solely on variance ratios. The statistical tools allowing comparison of trait variation have become increasingly sophisticated and now allow asking very precise questions. Specifically, we can now ask how trait variation is generated and how variation differs among groups. However, despite the availability of these tools, researchers interested in ecological and evolutionary variation must remain careful in their study designs. As our simulations show, scenarios involving differences in among-individual variance are particularly difficult to detect without substantial sample sizes. Finally, we hope the statistical approaches and tools for power analysis presented here will allow for appropriate comparisons of trait variation in ecological and evolutionary studies.