Nested Anova (STAN)

This tutorial will focus on the use of Bayesian estimation to fit simple linear regression models. BUGS (Bayesian inference Using Gibbs Sampling) is an algorithm and supporting language (resembling R) dedicated to performing the Gibbs sampling implementation of Markov Chain Monte Carlo (MCMC) method. Dialects of the BUGS language are implemented within three main projects:

  1. OpenBUGS - written in component pascal.

  2. JAGS - (Just Another Gibbs Sampler) - written in C++.

  3. STAN - a dedicated Bayesian modelling framework written in C++ and implementing Hamiltonian MCMC samplers.

Whilst the above programs can be used stand-alone, they do offer the rich data pre-processing and graphical capabilities of R, and thus, they are best accessed from within R itself. As such there are multiple packages devoted to interfacing with the various software implementations:

  • R2OpenBUGS - interfaces with OpenBUGS

  • R2jags - interfaces with JAGS

  • rstan - interfaces with STAN

This tutorial will demonstrate how to fit models in STAN (Gelman, Lee, and Guo (2015)) using the package rstan (Stan Development Team (2018)) as interface, which also requires to load some other packages.

Overview

Introduction

When single sampling units are selected amongst highly heterogeneous conditions, it is unlikely that these single units will adequately represent the populations and repeated sampling is likely to yield very different outcomes. For example, if we were investigating the impacts of fuel reduction burning across a highly heterogeneous landscape, our ability to replicate adequately might be limited by the number of burn sites available.

Alternatively, sub-replicates within each of the sampling units (e.g. sites) can be collected (and averaged) so as to provided better representatives for each of the units and ultimately reduce the unexplained variability of the test of treatments. In essence, the sub-replicates are the replicates of an additional nested factor whose levels are nested within the main treatment factor. A nested factor refers to a factor whose levels are unique within each level of the factor it is nested within and each level is only represented once. For example, the fuel reduction burn study design could consist of three burnt sites and three un-burnt (control) sites each containing four quadrats (replicates of site and sub-replicates of the burn treatment). Each site represents a unique level of a random factor (any given site cannot be both burnt and un-burnt) that is nested within the fire treatment (burned or not).

A nested design can be thought of as a hierarchical arrangement of factors (hence the alternative name hierarchical designs) whereby a treatment is progressively sub-replicated. As an additional example, imagine an experiment designed to comparing the leaf toughness of a number of tree species. Working down the hierarchy, five individual trees were randomly selected within (nested within) each species, three branches were randomly selected within each tree, two leaves were randomly selected within each branch and the force required to shear the leaf material in half (transversely) was measured in four random locations along the leaf. Clearly any given leaf can only be from a single branch, tree and species. Each level of sub-replication is introduced to further reduce the amount of unexplained variation and thereby increasing the power of the test for the main treatment effect. Additionally, it is possible to investigate which scale has the greatest (or least, etc) degree of variability - the level of the species, individual tree, branch, leaf, leaf region etc.

  • Nested factors are typically random factors, of which the levels are randomly selected to represent all possible levels (e.g. sites). When the main treatment effect (often referred to as Factor A) is a fixed factor, such designs are referred to as a mixed model nested ANOVA, whereas when Factor A is random, the design is referred to as a Model II nested ANOVA.

  • Fixed nested factors are also possible. For example, specific dates (corresponding to particular times during a season) could be nested within season. When all factors are fixed, the design is referred to as a Model I mixed model.

  • Fully nested designs (the topic of this chapter) differ from other multi-factor designs in that all factors within (below) the main treatment factor are nested and thus interactions are un-replicated and cannot be tested. Indeed, interaction effects (interaction between Factor A and site) are assumed to be zero.

Linear models (frequentist)

The linear models for two and three factor nested design are:

\[ y_{ijk} = \mu + \alpha_i + \beta_{j(i)} + \epsilon_{ijk},\]

\[ y_{ijkl} = \mu + \alpha_i + \beta_{j(i)} + gamma_{k(j(i))} + \epsilon_{ijkl},\]

where \(\mu\) is the overall mean, \(\alpha\) is the effect of Factor A, \(\beta\) is the effect of Factor B, \(\gamma\) is the effect of Factor C and \(\epsilon\) is the random unexplained or residual component.

Linear models (Bayesian)

So called “random effects” are modelled differently from “fixed effects” in that rather than estimate their individual effects, we instead estimate the variability due to these “random effects.” Since technically all variables in a Bayesian framework are random, some prefer to use the terms ‘fixed effects’ and ‘varying effects.’ As random factors typically represent “random” selections of levels (such as a set of randomly selected sites), incorporated in order to account for the dependency structure (observations within sites are more likely to be correlated to one another - not strickly independent) to the data, we are not overly interested in the individual differences between levels of the ‘varying’ (random) factor. Instead (in addition to imposing a separate correlation structure within each nest), we want to know how much variability is attributed to this level of the design. The linear models for two and three factor nested design are:

\[ y_{ijk} = \mu + \alpha_i + \beta_{j(i)} + \epsilon_{ijk}, \;\;\; \epsilon_{ijk} \sim N(0, \sigma^2), \;\;\; \beta_{j(i)} \sim N(0, \sigma^2_{B}) \]

\[ y_{ijkl} = \mu + \alpha_i + \beta_{j(i)} + \gamma_{k(j(i))} + \epsilon_{ijkl}, \;\;\; \epsilon_{ijkl} \sim N(0, \sigma^2), \;\;\; \beta_{j(i)} \sim N(0, \sigma^2_{B}) \;\;\; \gamma_{k(j(i))} \sim N(0, \sigma^2_C) \]

where \(\mu\) is the overall mean, \(\alpha\) is the effect of Factor A, \(\beta\) is the variability of Factor B (nested within Factor A), \(\gamma\) is the variability of Factor C (nested within Factor B) and \(\epsilon\) is the random unexplained or residual component that is assumed to be normally distributed with a mean of zero and a constant amount of standard deviation (\(\sigma^2\)). The subscripts are iterators. For example, the \(i\) represents the number of effects to be estimated for Factor A. Thus the first formula can be read as the \(k\)-th observation of \(y\) is drawn from a normal distribution (with a specific level of variability) and mean proposed to be determined by a base mean (\(\mu\) - mean of the first treatment across all nests) plus the effect of the \(i\)-th treatment effect plus the variabilitythe model proposes that, given a base mean (\(\mu\)) and knowing the effect of the \(i\)-th treatment (factor A) and which of the \(j\)-th nests within the treatment the \(k\)-th observation from Block \(j\) (factor B) within treatment effect.

Null hypotheses

Separate null hypotheses are associated with each of the factors, however, nested factors are typically only added to absorb some of the unexplained variability and thus, specific hypotheses tests associated with nested factors are of lesser biological importance. Hence, rather than estimate the effects of random effects, we instead estimate how much variability they contribute.

Factor A: the main treatment effect (fixed)

  • \(H_0(A): \mu_1=\mu_2=\ldots=\mu_i=\mu\) (the population group means are all equal). That is, that the mean of population \(1\) is equal to that of population \(2\) and so on, and thus all population means are equal to one another - no effect of the factor on the response. If the effect of the \(i\)-th group is the difference between the \(i\)-th group mean and the mean of the first group (\(\alpha_i=\mu_i-\mu_1\)) then the \(H_0\) can alternatively be written as:

  • \(H_0(A) : \alpha_1=\alpha_2=\ldots=\alpha_i=0\) (the effect of each group equals zero). If one or more of the \(\alpha_i\) are different from zero (the response mean for this treatment differs from the overall response mean), there is evidence that the null hypothesis is not true indicating that the factor does affect the response variable.

Factor A: the main treatment effect (random)

  • \(H_0(A) : \sigma^2_{\alpha}=0\) (population variance equals zero). There is no added variance due to all possible levels of A.

Factor B: the nested effect (random)

  • \(H_0(B) : \sigma^2_{\beta}=0\) (population variance equals zero). There is no added variance due to all possible levels of B within the (set or all possible) levels of A.

Factor B: the nested effect (fixed)

  • \(H_0(B): \mu_{1(1)}=\mu_{2(1)}=\ldots=\mu_{j(i)}=\mu\) (the population group means of B (within A) are all equal).

  • \(H_0(B): \beta_{1(1)}=\beta_{2(1)}=\ldots=\beta_{j(i)}=0\) (the effect of each chosen B group equals zero).

Analysis of variance

Analysis of variance sequentially partitions the total variability in the response variable into components explained by each of the factors (starting with the factors lowest down in the hierarchy - the most deeply nested) and the components unexplained by each factor. Explained variability is calculated by subtracting the amount unexplained by the factor from the amount unexplained by a reduced model that does not contain the factor. When the null hypothesis for a factor is true (no effect or added variability), the ratio of explained and unexplained components for that factor (F-ratio) should follow a theoretical F-distribution with an expected value less than 1. The appropriate unexplained residuals and therefore the appropriate F-ratios for each factor differ according to the different null hypotheses associated with different combinations of fixed and random factors in a nested linear model (see Table below).

> fact_anova_table
      df        MS         F-ratio (B random)    Var comp (B random)        
A     "a-1"     "MS A"     "(MS A)/(MS B'(A))"   "((MS A) - (MS B'(A)))/nb" 
B'(A) "(b-1)a"  "MS B'(A)" "(MS B'(A))/(MS res)" "((MS B'(A)) - (MS res))/n"
Res   "(n-1)ba" "MS res"   ""                    ""                         
      F-ratio (B fixed)     Var comp (B fixed)         
A     "(MS A)/(MS res)"     "((MS A) - (MS res))/nb"   
B'(A) "(MS B'(A))/(MS res)" "((MS B'(A)) - (MS res))/n"
Res   ""                    ""                         

The corresponding R syntax is given below.

> #A fixed/random, B random (balanced)
> summary(aov(y~A+Error(B), data))
> VarCorr(lme(y~A,random=1|B, data))
> 
> #A fixed/random, B random (unbalanced)
> anova(lme(y~A,random=1|B, data), type='marginal')
> 
> #A fixed/random, B fixed(balanced)
> summary(aov(y~A+B, data))
> 
> #A fixed/random, B fixed (unbalanced)
> contrasts(data$B) <- contr.sum
> Anova(aov(y~A/B, data), type='III')

Variance components

As previously alluded to, it can often be useful to determine the relative contribution (to explaining the unexplained variability) of each of the factors as this provides insights into the variability at each different scales. These contributions are known as Variance components and are estimates of the added variances due to each of the factors. For consistency with leading texts on this topic, I have included estimated variance components for various balanced nested ANOVA designs in the above table. However, variance components based on a modified version of the maximum likelihood iterative model fitting procedure (REML) is generally recommended as this accommodates both balanced and unbalanced designs. While there are no numerical differences in the calculations of variance components for fixed and random factors, fixed factors are interpreted very differently and arguably have little clinical meaning (other to infer relative contribution). For fixed factors, variance components estimate the variance between the means of the specific populations that are represented by the selected levels of the factor and therefore represent somewhat arbitrary and artificial populations. For random factors, variance components estimate the variance between means of all possible populations that could have been selected and thus represents the true population variance.

Assumptions

An F-distribution represents the relative frequencies of all the possible F-ratio’s when a given null hypothesis is true and certain assumptions about the residuals (denominator in the F-ratio calculation) hold. Consequently, it is also important that diagnostics associated with a particular hypothesis test reflect the denominator for the appropriate F-ratio. For example, when testing the null hypothesis that there is no effect of Factor A (\(H_0(A):\alpha_i=0\)) in a mixed nested ANOVA, the means of each level of Factor B are used as the replicates of Factor A. As with single factor anova, hypothesis testing for nested ANOVA assumes the residuals are:

  • normally distributed. Factors higher up in the hierarchy of a nested model are based on means (or means of means) of lower factors and thus the Central Limit Theory would predict that normality will usually be satisfied for the higher level factors. Nevertheless, boxplots using the appropriate scale of replication should be used to explore normality. Scale transformations are often useful.

  • equally varied. Boxplots and plots of means against variance (using the appropriate scale of replication) should be used to explore the spread of values. Residual plots should reveal no patterns. Scale transformations are often useful.

  • independent of one another - this requires special consideration so as to ensure that the scale at which sub-replicates are measured is still great enough to enable observations to be independent.

Unbalanced nested designs

Designs that incorporate fixed and random factors (either nested or factorial), involve F-ratio calculations in which the denominators are themselves random factors other than the overall residuals. Many statisticians argue that when such denominators are themselves not statistically significant (at the \(0.25\) level), there are substantial power benefits from pooling together successive non-significant denominator terms. Thus an F-ratio for a particular factor might be recalculated after pooling together its original denominator with its denominators denominator and so on. The conservative \(0.25\) is used instead of the usual 0.05 to reduce further the likelihood of Type II errors (falsely concluding an effect is non-significant - that might result from insufficient power).

For a simple completely balanced nested ANOVA, it is possible to pool together (calculate their mean) each of the sub-replicates within each nest (site) and then perform single factor ANOVA on those aggregates. Indeed, for a balanced design, the estimates and hypothesis for Factor A will be identical to that produced via nested ANOVA. However, if there are an unequal number of sub-replicates within each nest, then the single factor ANOVA will be less powerful that a proper nested ANOVA. Unbalanced designs are those designs in which sample (subsample) sizes for each level of one or more factors differ. These situations are relatively common in biological research, however such imbalance has some important implications for nested designs.

Firstly, hypothesis tests are more robust to the assumptions of normality and equal variance when the design is balanced. Secondly (and arguably, more importantly), the model contrasts are not orthogonal (independent) and the sums of squares component attributed to each of the model terms cannot be calculated by simple additive partitioning of the total sums of squares. In such situations, exact F-ratios cannot be constructed (at least in theory), variance components calculations are more complicated and significance tests cannot be computed. The denominator MS in an F-ratio is determined by examining the expected value of the mean squares of each term in a model. Unequal sample sizes result in expected means squares for which there are no obvious logical comparators that enable the impact of an individual model term to be isolated. The severity of this issue depends on which scale of the sub-sampling hierarchy the unbalance(s) occurs as well whether the unbalance occurs in the replication of a fixed or random factor. For example, whilst unequal levels of the first nesting factor (e.g. unequal number of burn vs un-burnt sites) has no effect on F-ratio construction or hypothesis testing for the top level factor (irrespective of whether either of the factors are fixed or random), unequal sub-sampling (replication) at the level of a random (but not fixed) nesting factor will impact on the ability to construct F-ratios and variance components of all terms above it in the hierarchy. There are a number of alternative ways of dealing with unbalanced nested designs. All alternatives assume that the imbalance is not a direct result of the treatments themselves. Such outcomes are more appropriately analysed by modelling the counts of surviving observations via frequency analysis.

  • Split the analysis up into separate smaller simple ANOVA’s each using the means of the nesting factor to reflect the appropriate scale of replication. As the resulting sums of squares components are thereby based on an aggregated dataset the analyses then inherit the procedures and requirements of single ANOVA.
  • Adopt mixed-modelling techniques.

We note that, in a Bayesian framework, issues of design balance essentially evaporate.

Linear mixed effects models

Although the term “mixed-effects” can be used to refer to any design that incorporates both fixed and random predictors, its use is more commonly restricted to designs in which factors are nested or grouped within other factors. Typical examples include nested, longitudinal (measurements repeated over time) data, repeated measures and blocking designs. Furthermore, rather than basing parameter estimations on observed and expected mean squares or error strata (as outline above), mixed-effects models estimate parameters via maximum likelihood (ML) or residual maximum likelihood (REML). In so doing, mixed-effects models more appropriately handle estimation of parameters, effects and variance components of unbalanced designs (particularly for random effects). Resulting fitted (or expected) values of each level of a factor (for example, the expected population site means) are referred to as Best Linear Unbiased Predictors (BLUP’s). As an acknowledgement that most estimated site means will be more extreme than the underlying true population means they estimate (based on the principle that smaller sample sizes result in greater chances of more extreme observations and that nested sub-replicates are also likely to be highly intercorrelated), BLUP’s are less spread from the overall mean than are simple site means. In addition, mixed-effects models naturally model the “within-block” correlation structure that complicates many longitudinal designs.

Whilst the basic concepts of mixed-effects models have been around for a long time, recent computing advances and adoptions have greatly boosted the popularity of these procedures. Linear mixed effects models are currently at the forefront of statistical development, and as such, are very much a work in progress - both in theory and in practice. Recent developments have seen a further shift away from the traditional practices associated with degrees of freedom, probability distribution and p-value calculations. The traditional approach to inference testing is to compare the fit of an alternative (full) model to a null (reduced) model (via an F-ratio). When assumptions of normality and homogeneity of variance apply, the degrees of freedom are easily computed and the F-ratio has an exact F-distribution to which it can be compared. However, this approach introduces two additional problematic assumptions when estimating fixed effects in a mixed effects model. Firstly, when estimating the effects of one factor, the parameter estimates associated with other factor(s) are assumed to be the true values of those parameters (not estimates). Whilst this assumption is reasonable when all factors are fixed, as random factors are selected such that they represent one possible set of levels drawn from an entire population of possible levels for the random factor, it is unlikely that the associated parameter estimates accurately reflect the true values. Consequently, there is not necessarily an appropriate F-distribution. Furthermore, determining the appropriate degrees of freedom (nominally, the number of independent observations on which estimates are based) for models that incorporate a hierarchical structure is only possible under very specific circumstances (such as completely balanced designs). Degrees of freedom is a somewhat arbitrary defined concept used primarily to select a theoretical probability distribution on which a statistic can be compared. Arguably, however, it is a concept that is overly simplistic for complex hierarchical designs. Most statistical applications continue to provide the “approximate” solutions (as did earlier versions within R). However, R linear mixed effects development leaders argue strenuously that given the above shortcomings, such approximations are variably inappropriate and are thus omitted.

Markov chain Monte Carlo (MCMC) sampling methods provide a Bayesian-like alternative for inference testing. Markov chains use the mixed model parameter estimates to generate posterior probability distributions of each parameter from which Monte Carlo sampling methods draw a large set of parameter samples. These parameter samples can then be used to calculate highest posterior density (HPD) intervals (also known as Bayesian credible intervals). Such intervals indicate the interval in which there is a specified probability (typically \(95\)%) that the true population parameter lies. Furthermore, whilst technically against the spirit of the Bayesian philosophy, it is also possible to generate P values on which to base inferences.

Data generation

Imagine we has designed an experiment in which we intend to measure a response (\(y\)) to one of treatments (three levels; “a1,” “a2” and “a3”). The treatments occur at a spatial scale (over an area) that far exceeds the logistical scale of sampling units (it would take too long to sample at the scale at which the treatments were applied). The treatments occurred at the scale of hectares whereas it was only feasible to sample y using 1m quadrats. Given that the treatments were naturally occurring (such as soil type), it was not possible to have more than five sites of each treatment type, yet prior experience suggested that the sites in which you intended to sample were very uneven and patchy with respect to \(y\). In an attempt to account for this inter-site variability (and thus maximize the power of the test for the effect of treatment, you decided to employ a nested design in which 10 quadrats were randomly located within each of the five replicate sites per three treatments. As this section is mainly about the generation of artificial data (and not specifically about what to do with the data), understanding the actual details are optional and can be safely skipped.

> library(plyr)
> set.seed(123)
> nTreat <- 3
> nSites <- 15
> nSitesPerTreat <- nSites/nTreat
> nQuads <- 10
> site.sigma <- 12
> sigma <- 5
> n <- nSites * nQuads
> sites <- gl(n=nSites,k=nQuads, lab=paste0('S',1:nSites))
> A <- gl(nTreat, nSitesPerTreat*nQuads, n, labels=c('a1','a2','a3'))
> a.means <- c(40,70,80)
> ## the site means (treatment effects) are drawn from normal distributions
> ## with means of 40, 70 and 80 and standard deviations of 12
> A.effects <- rnorm(nSites, rep(a.means,each=nSitesPerTreat),site.sigma)
> #A.effects <- a.means %*% t(model.matrix(~A, data.frame(A=gl(nTreat,nSitesPerTreat,nSites))))+rnorm(nSites,0,site.sigma)
> Xmat <- model.matrix(~sites -1)
> lin.pred <- Xmat %*% c(A.effects)
> ## the quadrat observations (within sites) are drawn from
> ## normal distributions with means according to the site means
> ## and standard deviations of 5
> y <- rnorm(n,lin.pred,sigma)
> data.nest <- data.frame(y=y, A=A, Sites=sites,Quads=1:length(y))
> head(data.nest)  #print out the first six rows of the data set
         y  A Sites Quads
1 42.20886 a1    S1     1
2 35.76354 a1    S1     2
3 23.44121 a1    S1     3
4 36.78107 a1    S1     4
5 30.91034 a1    S1     5
6 27.93517 a1    S1     6
> 
> library(ggplot2)
> ggplot(data.nest, aes(y=y, x=1)) + geom_boxplot() + facet_grid(.~Sites)

Exploratory data analysis

Normality and Homogeneity of variance

> #Effects of treatment
> boxplot(y~A, ddply(data.nest, ~A+Sites,numcolwise(mean, na.rm=T)))

> 
> #Site effects
> boxplot(y~Sites, ddply(data.nest, ~A+Sites+Quads,numcolwise(mean, na.rm=T)))

> 
> ## with ggplot2
> ggplot(ddply(data.nest, ~A+Sites,numcolwise(mean, na.rm=T)), aes(y=y, x=A)) +
+   geom_boxplot()

Conclusions:

  • there is no evidence that the response variable is consistently non-normal across all populations - each boxplot is approximately symmetrical.

  • there is no evidence that variance (as estimated by the height of the boxplots) differs between the five populations. More importantly, there is no evidence of a relationship between mean and variance - the height of boxplots does not increase with increasing position along the y-axis. Hence it there is no evidence of non-homogeneity.

Obvious violations could be addressed either by:

  • transform the scale of the response variables (to address normality, etc). Note transformations should be applied to the entire response variable (not just those populations that are skewed).

Model fitting

For non-hierarchical linear models, uniform priors on variance (standard deviation) parameters seem to work reasonably well. Gelman and others (2006) warns that the use of the inverse-gamma family of non-informative priors are very sensitive to ϵ particularly when variance is close to zero and this may lead to unintentionally informative priors. When the number of groups (treatments or varying/random effects) is large (more than \(5\)), Gelman and others (2006) advocated the use of either uniform or half-cauchy priors. Yet when the number of groups is low, Gelman and others (2006) indicates that uniform priors have a tendency to result in inflated variance estimates. Consequently, half-cauchy priors are generally recommended for variances.

Full parameterisation

\[ y_{ijk} \sim N(\mu_{ij}, \sigma^2), \;\;\; \mu_{ij}=\alpha_0 + \alpha_i + \beta_{j(i)}, \]

where \(\beta_{ij)} \sim N(0, \sigma^2_B)\), \(\alpha_0, \alpha_i \sim N(0, 1000000)\), and \(\sigma^2, \sigma^2_B \sim \text{Cauchy(0, 25)}\). The full parameterisation, shows the effects parameterisation in which there is an intercept (\(\alpha_0\)) and two treatment effects (\(\alpha_i\), where \(i\) is \(1,2\)).

Matrix parameterisation

\[ y_{ijk} \sim N(\mu_{ij}, \sigma^2), \;\;\; \mu_{ij}=\boldsymbol \alpha \boldsymbol X + \beta_{j(i)}, \]

where \(\beta_{ij} \sim N(0, \sigma^2_B)\), \(\boldsymbol \alpha \sim MVN(0, 1000000)\), and \(\sigma^2, \sigma^2_B \sim \text{Cauchy(0, 25)}\). The full parameterisation, shows the effects parameterisation in which there is an intercept (\(\alpha_0\)) and two treatment effects (\(\alpha_i\), where \(i\) is \(1,2\)). The matrix parameterisation is a compressed notation, In this parameterisation, there are three alpha parameters (one representing the mean of treatment a1, and the other two representing the treatment effects (differences between a2 and a1 and a3 and a1). In generating priors for each of these three alpha parameters, we could loop through each and define a non-informative normal prior to each (as in the Full parameterisation version). However, it turns out that it is more efficient (in terms of mixing and thus the number of necessary iterations) to define the priors from a multivariate normal distribution. This has as many means as there are parameters to estimate (\(3\)) and a \(3\times3\) matrix of zeros and \(100\) in the diagonals.

\[ \boldsymbol \mu = \begin{bmatrix} 0 \\ 0 \\ 0 \end{bmatrix}, \;\;\; \sigma^2 \sim \begin{bmatrix} 1000000 & 0 & 0 \\ 0 & 1000000 & 0 \\ 0 & 0 & 1000000 \end{bmatrix}. \]

Hierarchical parameterisation

\[ y_{ijk} \sim N(\beta_{i(j)}, \sigma^2), \;\;\; \beta_{i(j)}\sim N(\mu_i, \sigma^2_B), \]

where \(\mu_i = \boldsymbol \alpha \boldsymbol X\), \(\alpha_i \sim N(0, 1000000)\), and \(\sigma^2, \sigma^2_B \sim \text{Cauchy(0, 25)}\). In the heirarchical parameterisation, we are indicating two residual layers - one representing the variability in the observed data between individual observations (within sites) and the second representing the variability between site means (within the three treatments).

Full means parameterisation

> rstanString="
+ data{
+    int n;
+    int nA;
+    int nB;
+    vector [n] y;
+    int A[n];
+    int B[n];
+ }
+ 
+ parameters{
+   real alpha[nA];
+   real<lower=0> sigma;
+   vector [nB] beta;
+   real<lower=0> sigma_B;
+ }
+  
+ model{
+     real mu[n];
+ 
+     // Priors
+     alpha ~ normal( 0 , 100 );
+     beta ~ normal( 0 , sigma_B );
+     sigma_B ~ cauchy( 0 , 25 );
+     sigma ~ cauchy( 0 , 25 );
+     
+     for ( i in 1:n ) {
+         mu[i] = alpha[A[i]] + beta[B[i]];
+     }
+     y ~ normal( mu , sigma );
+ }
+ 
+ "
> 
> ## write the model to a text file
> writeLines(rstanString, con = "fullModel.stan")

Arrange the data as a list (as required by STAN). As input, STAN will need to be supplied with: the response variable, the predictor matrix, the number of predictors, the total number of observed items. This all needs to be contained within a list object. We will create two data lists, one for each of the hypotheses.

> data.nest.list <- with(data.nest, list(y=y, A=as.numeric(A), B=as.numeric(Sites),
+   n=nrow(data.nest), nB=length(levels(Sites)),nA=length(levels(A))))

Define the nodes (parameters and derivatives) to monitor and the chain parameters.

> params <- c("alpha","sigma","sigma_B")
> burnInSteps = 3000
> nChains = 2
> numSavedSteps = 3000
> thinSteps = 1
> nIter = burnInSteps+ceiling((numSavedSteps * thinSteps)/nChains)

Start the STAN model (check the model, load data into the model, specify the number of chains and compile the model). Load the rstan package.

> library(rstan)

Now run the STAN code via the rstan interface.

> data.nest.rstan.c <- stan(data = data.nest.list, file = "fullModel.stan", chains = nChains, pars = params, iter = nIter, warmup = burnInSteps, thin = thinSteps)

SAMPLING FOR MODEL 'fullModel' NOW (CHAIN 1).
Chain 1: 
Chain 1: Gradient evaluation took 0 seconds
Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 0 seconds.
Chain 1: Adjust your expectations accordingly!
Chain 1: 
Chain 1: 
Chain 1: Iteration:    1 / 4500 [  0%]  (Warmup)
Chain 1: Iteration:  450 / 4500 [ 10%]  (Warmup)
Chain 1: Iteration:  900 / 4500 [ 20%]  (Warmup)
Chain 1: Iteration: 1350 / 4500 [ 30%]  (Warmup)
Chain 1: Iteration: 1800 / 4500 [ 40%]  (Warmup)
Chain 1: Iteration: 2250 / 4500 [ 50%]  (Warmup)
Chain 1: Iteration: 2700 / 4500 [ 60%]  (Warmup)
Chain 1: Iteration: 3001 / 4500 [ 66%]  (Sampling)
Chain 1: Iteration: 3450 / 4500 [ 76%]  (Sampling)
Chain 1: Iteration: 3900 / 4500 [ 86%]  (Sampling)
Chain 1: Iteration: 4350 / 4500 [ 96%]  (Sampling)
Chain 1: Iteration: 4500 / 4500 [100%]  (Sampling)
Chain 1: 
Chain 1:  Elapsed Time: 1.44 seconds (Warm-up)
Chain 1:                0.825 seconds (Sampling)
Chain 1:                2.265 seconds (Total)
Chain 1: 

SAMPLING FOR MODEL 'fullModel' NOW (CHAIN 2).
Chain 2: 
Chain 2: Gradient evaluation took 0 seconds
Chain 2: 1000 transitions using 10 leapfrog steps per transition would take 0 seconds.
Chain 2: Adjust your expectations accordingly!
Chain 2: 
Chain 2: 
Chain 2: Iteration:    1 / 4500 [  0%]  (Warmup)
Chain 2: Iteration:  450 / 4500 [ 10%]  (Warmup)
Chain 2: Iteration:  900 / 4500 [ 20%]  (Warmup)
Chain 2: Iteration: 1350 / 4500 [ 30%]  (Warmup)
Chain 2: Iteration: 1800 / 4500 [ 40%]  (Warmup)
Chain 2: Iteration: 2250 / 4500 [ 50%]  (Warmup)
Chain 2: Iteration: 2700 / 4500 [ 60%]  (Warmup)
Chain 2: Iteration: 3001 / 4500 [ 66%]  (Sampling)
Chain 2: Iteration: 3450 / 4500 [ 76%]  (Sampling)
Chain 2: Iteration: 3900 / 4500 [ 86%]  (Sampling)
Chain 2: Iteration: 4350 / 4500 [ 96%]  (Sampling)
Chain 2: Iteration: 4500 / 4500 [100%]  (Sampling)
Chain 2: 
Chain 2:  Elapsed Time: 1.447 seconds (Warm-up)
Chain 2:                0.698 seconds (Sampling)
Chain 2:                2.145 seconds (Total)
Chain 2: 
> 
> print(data.nest.rstan.c, par = c("alpha", "sigma", "sigma_B"))
Inference for Stan model: fullModel.
2 chains, each with iter=4500; warmup=3000; thin=1; 
post-warmup draws per chain=1500, total post-warmup draws=3000.

          mean se_mean   sd  2.5%   25%   50%   75% 97.5% n_eff Rhat
alpha[1] 42.06    0.15 5.24 31.81 38.65 42.09 45.31 53.39  1275    1
alpha[2] 69.73    0.15 5.57 58.33 66.37 69.84 73.06 81.46  1311    1
alpha[3] 83.14    0.16 5.48 71.63 79.81 83.18 86.56 94.09  1106    1
sigma     5.04    0.01 0.31  4.48  4.82  5.02  5.23  5.69  1870    1
sigma_B  11.58    0.08 2.75  7.59  9.67 11.10 12.98 18.17  1247    1

Samples were drawn using NUTS(diag_e) at Fri Dec 11 09:11:02 2020.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).
> 
> data.nest.rstan.c.df <-as.data.frame(extract(data.nest.rstan.c))
> head(data.nest.rstan.c.df)
   alpha.1  alpha.2  alpha.3    sigma  sigma_B      lp__
1 43.15204 70.32489 83.74677 5.015235  7.69266 -354.6193
2 41.99303 61.68272 83.19149 5.498385 14.33031 -356.2989
3 44.80492 80.03676 90.67930 4.512057 11.10260 -359.9276
4 42.89050 76.05822 93.23336 5.185964 12.10384 -357.2439
5 40.25460 70.21282 82.32774 4.813780 10.92056 -354.4821
6 40.75613 72.59147 81.86612 4.952488  9.72872 -358.6908

Full effect parameterisation

> rstan2String="
+ data{
+    int n;
+    int nB;
+    vector [n] y;
+    int A2[n];
+    int A3[n];
+    int B[n];
+ }
+ 
+ parameters{
+   real alpha0;
+   real alpha2;
+   real alpha3;
+   real<lower=0> sigma;
+   vector [nB] beta;
+   real<lower=0> sigma_B;
+ }
+  
+ model{
+     real mu[n];
+ 
+     // Priors
+     alpha0 ~ normal( 0 , 100 );
+     alpha2 ~ normal( 0 , 100 );
+     alpha3 ~ normal( 0 , 100 );
+     beta ~ normal( 0 , sigma_B );
+     sigma_B ~ cauchy( 0 , 25 );
+     sigma ~ cauchy( 0 , 25 );
+     
+     for ( i in 1:n ) {
+         mu[i] = alpha0 + alpha2*A2[i] + 
+                alpha3*A3[i] + beta[B[i]];
+     }
+     y ~ normal( mu , sigma );
+ }
+ 
+ "
> 
> ## write the model to a text file
> writeLines(rstan2String, con = "full2Model.stan")

Arrange the data as a list (as required by STAN). As input, STAN will need to be supplied with: the response variable, the predictor matrix, the number of predictors, the total number of observed items. This all needs to be contained within a list object. We will create two data lists, one for each of the hypotheses.

> A2 <- ifelse(data.nest$A=='a2',1,0)
> A3 <- ifelse(data.nest$A=='a3',1,0)
> data.nest.list <- with(data.nest, list(y=y, A2=A2, A3=A3, B=as.numeric(Sites),
+    n=nrow(data.nest), nB=length(levels(Sites))))

Define the nodes (parameters and derivatives) to monitor and the chain parameters.

> params <- c("alpha0","alpha2","alpha3","sigma","sigma_B")
> burnInSteps = 3000
> nChains = 2
> numSavedSteps = 3000
> thinSteps = 1
> nIter = burnInSteps+ceiling((numSavedSteps * thinSteps)/nChains)

Now run the STAN code via the rstan interface.

> data.nest.rstan.c2 <- stan(data = data.nest.list, file = "full2Model.stan", chains = nChains, pars = params, iter = nIter, warmup = burnInSteps, thin = thinSteps)

SAMPLING FOR MODEL 'full2Model' NOW (CHAIN 1).
Chain 1: 
Chain 1: Gradient evaluation took 0 seconds
Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 0 seconds.
Chain 1: Adjust your expectations accordingly!
Chain 1: 
Chain 1: 
Chain 1: Iteration:    1 / 4500 [  0%]  (Warmup)
Chain 1: Iteration:  450 / 4500 [ 10%]  (Warmup)
Chain 1: Iteration:  900 / 4500 [ 20%]  (Warmup)
Chain 1: Iteration: 1350 / 4500 [ 30%]  (Warmup)
Chain 1: Iteration: 1800 / 4500 [ 40%]  (Warmup)
Chain 1: Iteration: 2250 / 4500 [ 50%]  (Warmup)
Chain 1: Iteration: 2700 / 4500 [ 60%]  (Warmup)
Chain 1: Iteration: 3001 / 4500 [ 66%]  (Sampling)
Chain 1: Iteration: 3450 / 4500 [ 76%]  (Sampling)
Chain 1: Iteration: 3900 / 4500 [ 86%]  (Sampling)
Chain 1: Iteration: 4350 / 4500 [ 96%]  (Sampling)
Chain 1: Iteration: 4500 / 4500 [100%]  (Sampling)
Chain 1: 
Chain 1:  Elapsed Time: 4.735 seconds (Warm-up)
Chain 1:                3.205 seconds (Sampling)
Chain 1:                7.94 seconds (Total)
Chain 1: 

SAMPLING FOR MODEL 'full2Model' NOW (CHAIN 2).
Chain 2: 
Chain 2: Gradient evaluation took 0 seconds
Chain 2: 1000 transitions using 10 leapfrog steps per transition would take 0 seconds.
Chain 2: Adjust your expectations accordingly!
Chain 2: 
Chain 2: 
Chain 2: Iteration:    1 / 4500 [  0%]  (Warmup)
Chain 2: Iteration:  450 / 4500 [ 10%]  (Warmup)
Chain 2: Iteration:  900 / 4500 [ 20%]  (Warmup)
Chain 2: Iteration: 1350 / 4500 [ 30%]  (Warmup)
Chain 2: Iteration: 1800 / 4500 [ 40%]  (Warmup)
Chain 2: Iteration: 2250 / 4500 [ 50%]  (Warmup)
Chain 2: Iteration: 2700 / 4500 [ 60%]  (Warmup)
Chain 2: Iteration: 3001 / 4500 [ 66%]  (Sampling)
Chain 2: Iteration: 3450 / 4500 [ 76%]  (Sampling)
Chain 2: Iteration: 3900 / 4500 [ 86%]  (Sampling)
Chain 2: Iteration: 4350 / 4500 [ 96%]  (Sampling)
Chain 2: Iteration: 4500 / 4500 [100%]  (Sampling)
Chain 2: 
Chain 2:  Elapsed Time: 4.246 seconds (Warm-up)
Chain 2:                3.033 seconds (Sampling)
Chain 2:                7.279 seconds (Total)
Chain 2: 
> 
> print(data.nest.rstan.c2, par = c("alpha0", "alpha2", "alpha3", "sigma", "sigma_B"))
Inference for Stan model: full2Model.
2 chains, each with iter=4500; warmup=3000; thin=1; 
post-warmup draws per chain=1500, total post-warmup draws=3000.

         mean se_mean   sd  2.5%   25%   50%   75% 97.5% n_eff Rhat
alpha0  42.14    0.18 5.28 32.35 38.73 41.95 45.37 53.07   821    1
alpha2  27.75    0.23 7.18 13.76 23.20 27.86 32.36 42.00   977    1
alpha3  40.82    0.26 7.41 26.03 36.15 40.72 45.54 55.23   784    1
sigma    5.04    0.01 0.30  4.49  4.82  5.03  5.24  5.68  1365    1
sigma_B 11.61    0.10 2.87  7.58  9.70 11.15 13.02 18.04   885    1

Samples were drawn using NUTS(diag_e) at Fri Dec 11 09:11:46 2020.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).
> 
> data.nest.rstan.c2.df <-as.data.frame(extract(data.nest.rstan.c2))
> head(data.nest.rstan.c2.df)
    alpha0   alpha2   alpha3    sigma   sigma_B      lp__
1 39.70299 33.61554 38.67780 5.984734  8.194675 -361.6994
2 47.68870 17.00806 25.96021 5.075493 10.263042 -354.7169
3 41.26990 29.39069 46.25732 5.113701  8.065909 -359.5336
4 38.02844 38.40668 49.88790 5.127820 13.155391 -355.3177
5 40.43790 18.58691 46.44096 4.763335 12.664118 -355.2254
6 46.51676 26.35108 36.54911 4.474970 11.370100 -360.1367

Matrix parameterisation

> rstanString2="
+ data{
+    int n;
+    int nB;
+    int nA;
+    vector [n] y;
+    matrix [n,nA] X;
+    int B[n];
+    vector [nA] a0;
+    matrix [nA,nA] A0;
+ }
+ 
+ parameters{
+   vector [nA] alpha;
+   real<lower=0> sigma;
+   vector [nB] beta;
+   real<lower=0> sigma_B;
+ }
+  
+ model{
+     real mu[n];
+ 
+     // Priors
+     //alpha ~ normal( 0 , 100 );
+     alpha ~ multi_normal(a0,A0);
+     beta ~ normal( 0 , sigma_B );
+     sigma_B ~ cauchy( 0 , 25);
+     sigma ~ cauchy( 0 , 25 );
+     
+     for ( i in 1:n ) {
+         mu[i] = dot_product(X[i],alpha) + beta[B[i]];
+     }
+     y ~ normal( mu , sigma );
+ }
+ 
+ "
> 
> ## write the model to a text file
> writeLines(rstanString2, con = "matrixModel.stan")

Arrange the data as a list (as required by STAN). As input, STAN will need to be supplied with: the response variable, the predictor matrix, the number of predictors, the total number of observed items. This all needs to be contained within a list object. We will create two data lists, one for each of the hypotheses.

> X <- model.matrix(~A, data.nest)
> nA <- ncol(X)
> data.nest.list <- with(data.nest, list(y=y, X=X, B=as.numeric(Sites),
+    n=nrow(data.nest), nB=length(levels(Sites)), nA=nA,
+    a0=rep(0,nA), A0=diag(100000,nA)))

Define the nodes (parameters and derivatives) to monitor and the chain parameters.

> params <- c("alpha","sigma","sigma_B")
> burnInSteps = 3000
> nChains = 2
> numSavedSteps = 3000
> thinSteps = 1
> nIter = burnInSteps+ceiling((numSavedSteps * thinSteps)/nChains)

Now run the STAN code via the rstan interface.

> data.nest.rstan.m <- stan(data = data.nest.list, file = "matrixModel.stan", chains = nChains, pars = params, iter = nIter, warmup = burnInSteps, thin = thinSteps)

SAMPLING FOR MODEL 'matrixModel' NOW (CHAIN 1).
Chain 1: 
Chain 1: Gradient evaluation took 0 seconds
Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 0 seconds.
Chain 1: Adjust your expectations accordingly!
Chain 1: 
Chain 1: 
Chain 1: Iteration:    1 / 4500 [  0%]  (Warmup)
Chain 1: Iteration:  450 / 4500 [ 10%]  (Warmup)
Chain 1: Iteration:  900 / 4500 [ 20%]  (Warmup)
Chain 1: Iteration: 1350 / 4500 [ 30%]  (Warmup)
Chain 1: Iteration: 1800 / 4500 [ 40%]  (Warmup)
Chain 1: Iteration: 2250 / 4500 [ 50%]  (Warmup)
Chain 1: Iteration: 2700 / 4500 [ 60%]  (Warmup)
Chain 1: Iteration: 3001 / 4500 [ 66%]  (Sampling)
Chain 1: Iteration: 3450 / 4500 [ 76%]  (Sampling)
Chain 1: Iteration: 3900 / 4500 [ 86%]  (Sampling)
Chain 1: Iteration: 4350 / 4500 [ 96%]  (Sampling)
Chain 1: Iteration: 4500 / 4500 [100%]  (Sampling)
Chain 1: 
Chain 1:  Elapsed Time: 8.255 seconds (Warm-up)
Chain 1:                4.682 seconds (Sampling)
Chain 1:                12.937 seconds (Total)
Chain 1: 

SAMPLING FOR MODEL 'matrixModel' NOW (CHAIN 2).
Chain 2: 
Chain 2: Gradient evaluation took 0 seconds
Chain 2: 1000 transitions using 10 leapfrog steps per transition would take 0 seconds.
Chain 2: Adjust your expectations accordingly!
Chain 2: 
Chain 2: 
Chain 2: Iteration:    1 / 4500 [  0%]  (Warmup)
Chain 2: Iteration:  450 / 4500 [ 10%]  (Warmup)
Chain 2: Iteration:  900 / 4500 [ 20%]  (Warmup)
Chain 2: Iteration: 1350 / 4500 [ 30%]  (Warmup)
Chain 2: Iteration: 1800 / 4500 [ 40%]  (Warmup)
Chain 2: Iteration: 2250 / 4500 [ 50%]  (Warmup)
Chain 2: Iteration: 2700 / 4500 [ 60%]  (Warmup)
Chain 2: Iteration: 3001 / 4500 [ 66%]  (Sampling)
Chain 2: Iteration: 3450 / 4500 [ 76%]  (Sampling)
Chain 2: Iteration: 3900 / 4500 [ 86%]  (Sampling)
Chain 2: Iteration: 4350 / 4500 [ 96%]  (Sampling)
Chain 2: Iteration: 4500 / 4500 [100%]  (Sampling)
Chain 2: 
Chain 2:  Elapsed Time: 8.74 seconds (Warm-up)
Chain 2:                5.977 seconds (Sampling)
Chain 2:                14.717 seconds (Total)
Chain 2: 
> 
> print(data.nest.rstan.m, par = c("alpha", "sigma", "sigma_B"))
Inference for Stan model: matrixModel.
2 chains, each with iter=4500; warmup=3000; thin=1; 
post-warmup draws per chain=1500, total post-warmup draws=3000.

          mean se_mean   sd  2.5%   25%   50%   75% 97.5% n_eff Rhat
alpha[1] 42.16    0.20 5.65 31.37 38.49 42.17 45.58 52.75   771    1
alpha[2] 27.40    0.30 7.96 11.55 22.56 27.49 32.23 42.92   704    1
alpha[3] 41.11    0.26 7.92 25.22 36.27 41.25 46.15 56.31   908    1
sigma     5.03    0.01 0.31  4.46  4.83  5.01  5.23  5.67  1848    1
sigma_B  11.71    0.09 2.71  7.72  9.89 11.28 13.00 18.46   926    1

Samples were drawn using NUTS(diag_e) at Fri Dec 11 09:12:48 2020.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).
> 
> data.nest.rstan.m.df <-as.data.frame(extract(data.nest.rstan.m))
> head(data.nest.rstan.m.df)
   alpha.1  alpha.2  alpha.3    sigma   sigma_B      lp__
1 33.94608 30.18616 45.50708 5.005223 14.545831 -357.5156
2 38.68898 23.87072 38.06314 5.022923  9.121265 -354.3067
3 42.52249 31.77733 25.79672 5.520427 13.070609 -364.6733
4 42.87262 32.61743 34.14891 4.884315 12.579670 -356.7221
5 40.65653 25.51010 47.30654 4.735719 16.704468 -361.4840
6 49.58963 17.14219 35.23524 4.610314 11.579346 -355.0129

Hierarchical parameterisation

> rstanString3="
+ data{
+    int n;
+    int nA;
+    int nSites;
+    vector [n] y;
+    matrix [nSites,nA] X;
+    matrix [n,nSites] Z;
+ }
+ 
+ parameters{
+    vector[nA] beta;
+    vector[nSites] gamma;
+    real<lower=0> sigma;
+    real<lower=0> sigma_S;
+    
+ }
+  
+ model{
+     vector [n] mu_site;
+     vector [nSites] mu;
+ 
+     // Priors
+     beta ~ normal( 0 , 1000 );
+     gamma ~ normal( 0 , 1000 );
+     sigma ~ cauchy( 0 , 25 );
+     sigma_S~ cauchy( 0 , 25 );
+ 
+     mu_site = Z*gamma;
+     y ~ normal( mu_site , sigma );
+     mu = X*beta;
+     gamma ~ normal(mu, sigma_S);
+ }
+ 
+ "
> 
> ## write the model to a text file
> writeLines(rstanString3, con = "hierarchicalModel.stan")

Arrange the data as a list (as required by STAN). As input, STAN will need to be supplied with: the response variable, the predictor matrix, the number of predictors, the total number of observed items. This all needs to be contained within a list object. We will create two data lists, one for each of the hypotheses.

> dt.A <- ddply(data.nest,~Sites,catcolwise(unique))
> X<-model.matrix(~A, dt.A)
> Z<-model.matrix(~Sites-1, data.nest)
> data.nest.list <- list(y=data.nest$y, X=X, Z=Z, n=nrow(data.nest),
+   nSites=nrow(X),nA=ncol(X))

Define the nodes (parameters and derivatives) to monitor and the chain parameters.

> params <- c("beta","sigma","sigma_S")
> burnInSteps = 3000
> nChains = 2
> numSavedSteps = 3000
> thinSteps = 1
> nIter = burnInSteps+ceiling((numSavedSteps * thinSteps)/nChains)

Now run the STAN code via the rstan interface.

> data.nest.rstan.h <- stan(data = data.nest.list, file = "hierarchicalModel.stan", chains = nChains, pars = params, iter = nIter, warmup = burnInSteps, thin = thinSteps)

SAMPLING FOR MODEL 'hierarchicalModel' NOW (CHAIN 1).
Chain 1: 
Chain 1: Gradient evaluation took 0 seconds
Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 0 seconds.
Chain 1: Adjust your expectations accordingly!
Chain 1: 
Chain 1: 
Chain 1: Iteration:    1 / 4500 [  0%]  (Warmup)
Chain 1: Iteration:  450 / 4500 [ 10%]  (Warmup)
Chain 1: Iteration:  900 / 4500 [ 20%]  (Warmup)
Chain 1: Iteration: 1350 / 4500 [ 30%]  (Warmup)
Chain 1: Iteration: 1800 / 4500 [ 40%]  (Warmup)
Chain 1: Iteration: 2250 / 4500 [ 50%]  (Warmup)
Chain 1: Iteration: 2700 / 4500 [ 60%]  (Warmup)
Chain 1: Iteration: 3001 / 4500 [ 66%]  (Sampling)
Chain 1: Iteration: 3450 / 4500 [ 76%]  (Sampling)
Chain 1: Iteration: 3900 / 4500 [ 86%]  (Sampling)
Chain 1: Iteration: 4350 / 4500 [ 96%]  (Sampling)
Chain 1: Iteration: 4500 / 4500 [100%]  (Sampling)
Chain 1: 
Chain 1:  Elapsed Time: 0.761 seconds (Warm-up)
Chain 1:                0.317 seconds (Sampling)
Chain 1:                1.078 seconds (Total)
Chain 1: 

SAMPLING FOR MODEL 'hierarchicalModel' NOW (CHAIN 2).
Chain 2: 
Chain 2: Gradient evaluation took 0 seconds
Chain 2: 1000 transitions using 10 leapfrog steps per transition would take 0 seconds.
Chain 2: Adjust your expectations accordingly!
Chain 2: 
Chain 2: 
Chain 2: Iteration:    1 / 4500 [  0%]  (Warmup)
Chain 2: Iteration:  450 / 4500 [ 10%]  (Warmup)
Chain 2: Iteration:  900 / 4500 [ 20%]  (Warmup)
Chain 2: Iteration: 1350 / 4500 [ 30%]  (Warmup)
Chain 2: Iteration: 1800 / 4500 [ 40%]  (Warmup)
Chain 2: Iteration: 2250 / 4500 [ 50%]  (Warmup)
Chain 2: Iteration: 2700 / 4500 [ 60%]  (Warmup)
Chain 2: Iteration: 3001 / 4500 [ 66%]  (Sampling)
Chain 2: Iteration: 3450 / 4500 [ 76%]  (Sampling)
Chain 2: Iteration: 3900 / 4500 [ 86%]  (Sampling)
Chain 2: Iteration: 4350 / 4500 [ 96%]  (Sampling)
Chain 2: Iteration: 4500 / 4500 [100%]  (Sampling)
Chain 2: 
Chain 2:  Elapsed Time: 0.718 seconds (Warm-up)
Chain 2:                0.333 seconds (Sampling)
Chain 2:                1.051 seconds (Total)
Chain 2: 
> 
> print(data.nest.rstan.h, par = c("beta", "sigma", "sigma_S"))
Inference for Stan model: hierarchicalModel.
2 chains, each with iter=4500; warmup=3000; thin=1; 
post-warmup draws per chain=1500, total post-warmup draws=3000.

         mean se_mean   sd  2.5%   25%   50%   75% 97.5% n_eff Rhat
beta[1] 42.29    0.11 5.32 31.76 38.75 42.34 45.78 52.66  2552    1
beta[2] 27.53    0.15 7.52 12.26 22.77 27.29 32.41 42.24  2637    1
beta[3] 40.96    0.15 7.68 26.05 36.02 40.77 46.02 56.51  2719    1
sigma    5.05    0.01 0.30  4.51  4.84  5.04  5.23  5.68  2990    1
sigma_S 11.61    0.06 2.79  7.77  9.58 11.14 12.95 18.55  2473    1

Samples were drawn using NUTS(diag_e) at Fri Dec 11 09:13:20 2020.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).
> 
> data.nest.rstan.h.df <-as.data.frame(extract(data.nest.rstan.h))
> head(data.nest.rstan.h.df)
    beta.1   beta.2   beta.3    sigma   sigma_S      lp__
1 40.18971 36.18208 40.68880 4.860507  8.009102 -354.4892
2 31.50034 41.79442 50.65135 4.933646 12.951002 -364.8473
3 37.78913 35.81777 47.67498 5.196665  9.753561 -355.2075
4 46.70426 24.76463 44.75701 4.907866 11.683086 -359.7312
5 44.43432 24.81058 38.15300 5.117181 10.045637 -352.9194
6 44.37251 19.98415 33.45183 4.833427  9.668305 -353.2098

If you want to include finite-population standard deviations in the model you can use the following code.

> rstanString4="
+ data{
+    int n;
+    int nA;
+    int nSites;
+    vector [n] y;
+    matrix [nSites,nA] X;
+    matrix [n,nSites] Z;
+ }
+ 
+ parameters{
+    vector[nA] beta;
+    vector[nSites] gamma;
+    real<lower=0> sigma;
+    real<lower=0> sigma_S;
+    
+ }
+ 
+ model{
+     vector [n] mu_site;
+     vector [nSites] mu;
+ 
+     // Priors
+     beta ~ normal( 0 , 1000 );
+     gamma ~ normal( 0 , 1000 );
+     sigma ~ cauchy( 0 , 25 );
+     sigma_S~ cauchy( 0 , 25 );
+ 
+     mu_site = Z*gamma;
+     y ~ normal( mu_site , sigma );
+     mu = X*beta;
+     gamma ~ normal(mu, sigma_S);
+ }
+ 
+ generated quantities {
+     vector [n] mu_site;
+     vector [nSites] mu;
+     vector [n] y_err;
+     real sd_y;
+     vector [nSites] mu_site_err;
+     real sd_site;
+     real sd_A;
+     
+     mu_site = Z*gamma;
+     y_err = mu_site - y;
+     sd_y = sd(y_err);
+ 
+     mu = X*beta;
+     mu_site_err = mu - gamma;
+     sd_site = sd(mu_site_err);
+ 
+     sd_A = sd(beta);
+ }
+ 
+ "
> 
> ## write the model to a text file
> writeLines(rstanString4, con = "SDModel.stan")
> 
> #data list
> dt.A <- ddply(data.nest,~Sites,catcolwise(unique))
> X<-model.matrix(~A, dt.A)
> Z<-model.matrix(~Sites-1, data.nest)
> data.nest.list <- list(y=data.nest$y, X=X, Z=Z, n=nrow(data.nest),
+    nSites=nrow(X),nA=ncol(X))
> 
> #parameters and chain details
> params <- c('beta','sigma','sigma_S','sd_A','sd_site','sd_y')
> adaptSteps = 1000
> burnInSteps = 3000
> nChains = 2
> numSavedSteps = 3000
> thinSteps = 1
> nIter = burnInSteps+ceiling((numSavedSteps * thinSteps)/nChains)
> 
> data.nest.rstan.SD <- stan(data = data.nest.list, file = "SDModel.stan", chains = nChains, pars = params, iter = nIter, warmup = burnInSteps, thin = thinSteps)

SAMPLING FOR MODEL 'SDModel' NOW (CHAIN 1).
Chain 1: 
Chain 1: Gradient evaluation took 0 seconds
Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 0 seconds.
Chain 1: Adjust your expectations accordingly!
Chain 1: 
Chain 1: 
Chain 1: Iteration:    1 / 4500 [  0%]  (Warmup)
Chain 1: Iteration:  450 / 4500 [ 10%]  (Warmup)
Chain 1: Iteration:  900 / 4500 [ 20%]  (Warmup)
Chain 1: Iteration: 1350 / 4500 [ 30%]  (Warmup)
Chain 1: Iteration: 1800 / 4500 [ 40%]  (Warmup)
Chain 1: Iteration: 2250 / 4500 [ 50%]  (Warmup)
Chain 1: Iteration: 2700 / 4500 [ 60%]  (Warmup)
Chain 1: Iteration: 3001 / 4500 [ 66%]  (Sampling)
Chain 1: Iteration: 3450 / 4500 [ 76%]  (Sampling)
Chain 1: Iteration: 3900 / 4500 [ 86%]  (Sampling)
Chain 1: Iteration: 4350 / 4500 [ 96%]  (Sampling)
Chain 1: Iteration: 4500 / 4500 [100%]  (Sampling)
Chain 1: 
Chain 1:  Elapsed Time: 0.696 seconds (Warm-up)
Chain 1:                0.316 seconds (Sampling)
Chain 1:                1.012 seconds (Total)
Chain 1: 

SAMPLING FOR MODEL 'SDModel' NOW (CHAIN 2).
Chain 2: 
Chain 2: Gradient evaluation took 0 seconds
Chain 2: 1000 transitions using 10 leapfrog steps per transition would take 0 seconds.
Chain 2: Adjust your expectations accordingly!
Chain 2: 
Chain 2: 
Chain 2: Iteration:    1 / 4500 [  0%]  (Warmup)
Chain 2: Iteration:  450 / 4500 [ 10%]  (Warmup)
Chain 2: Iteration:  900 / 4500 [ 20%]  (Warmup)
Chain 2: Iteration: 1350 / 4500 [ 30%]  (Warmup)
Chain 2: Iteration: 1800 / 4500 [ 40%]  (Warmup)
Chain 2: Iteration: 2250 / 4500 [ 50%]  (Warmup)
Chain 2: Iteration: 2700 / 4500 [ 60%]  (Warmup)
Chain 2: Iteration: 3001 / 4500 [ 66%]  (Sampling)
Chain 2: Iteration: 3450 / 4500 [ 76%]  (Sampling)
Chain 2: Iteration: 3900 / 4500 [ 86%]  (Sampling)
Chain 2: Iteration: 4350 / 4500 [ 96%]  (Sampling)
Chain 2: Iteration: 4500 / 4500 [100%]  (Sampling)
Chain 2: 
Chain 2:  Elapsed Time: 0.731 seconds (Warm-up)
Chain 2:                0.32 seconds (Sampling)
Chain 2:                1.051 seconds (Total)
Chain 2: 
> 
> print(data.nest.rstan.SD, par = c('beta','sigma','sigma_S','sd_A','sd_site','sd_y'))
Inference for Stan model: SDModel.
2 chains, each with iter=4500; warmup=3000; thin=1; 
post-warmup draws per chain=1500, total post-warmup draws=3000.

         mean se_mean   sd  2.5%   25%   50%   75% 97.5% n_eff Rhat
beta[1] 42.24    0.11 5.36 31.51 38.96 42.28 45.54 52.77  2361    1
beta[2] 27.42    0.15 7.52 11.92 23.03 27.50 32.20 41.74  2508    1
beta[3] 40.91    0.15 7.59 26.02 36.09 41.04 45.56 56.39  2707    1
sigma    5.04    0.01 0.31  4.46  4.82  5.01  5.23  5.69  3342    1
sigma_S 11.61    0.05 2.69  7.72  9.68 11.13 13.03 18.14  2421    1
sd_A    10.20    0.12 4.48  2.69  7.22  9.64 12.73 20.54  1445    1
sd_site 10.68    0.03 1.11  9.24  9.96 10.43 11.14 13.51  1139    1
sd_y     4.99    0.00 0.10  4.85  4.92  4.98  5.05  5.22  1337    1

Samples were drawn using NUTS(diag_e) at Fri Dec 11 09:13:52 2020.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).
> 
> data.nest.rstan.SD.df <-as.data.frame(extract(data.nest.rstan.SD))
> head(data.nest.rstan.SD.df)
    beta.1   beta.2   beta.3    sigma   sigma_S      sd_A   sd_site     sd_y
1 40.43674 29.25310 45.21689 4.663609  9.517728  8.193149  9.983161 4.929392
2 49.89501 15.10486 29.68431 4.579810  9.530417 17.470868 10.813172 4.996209
3 37.77238 35.19504 46.26184 5.175441 11.006254  5.790620  9.896594 5.267506
4 36.21905 33.77648 49.31786 4.644749  9.820680  8.357426 11.617789 5.059987
5 43.52928 24.90145 40.77302 5.411379  7.547366 10.054018  9.158198 5.074078
6 46.50217 23.16053 34.67047 4.614323 10.378371 11.671189 10.579336 4.939073
       lp__
1 -352.2536
2 -356.7218
3 -361.6852
4 -360.1793
5 -357.3132
6 -354.1919

Data generation - second example

Now imagine a similar experiment in which we intend to measure a response (\(y\)) to one of treatments (three levels; “a1,” “a2” and “a3”). As with the previous design, we decided to establish a nested design in which there are sub-replicate (\(1\)m Quadrats) within each Site. In the current design, we have decided to further sub-replicate. Within each of the \(5\) Quadrats, we are going to randomly place \(2\times10\)cm pit traps. Now we have Sites nested within Treatments, Quadrats nested within Sites AND, Pits nested within Sites. The latter of these (Pits nested within Sites) are the observations (\(y\)). As this section is mainly about the generation of artificial data (and not specifically about what to do with the data), understanding the actual details are optional and can be safely skipped.

> set.seed(123)
> nTreat <- 3
> nSites <- 15
> nSitesPerTreat <- nSites/nTreat
> nQuads <- 5
> nPits <- 2
> site.sigma <- 10 # sd within between sites within treatment
> quad.sigma <- 10
> sigma <- 7.5
> n <- nSites * nQuads * nPits
> sites <- gl(n=nSites,n/nSites,n, lab=paste("site",1:nSites))
> A <- gl(nTreat, n/nTreat, n, labels=c('a1','a2','a3'))
> a.means <- c(40,70,80)
> 
> #A<-gl(nTreat,nSites/nTreat,nSites,labels=c('a1','a2','a3'))
> a<-gl(nTreat,1,nTreat,labels=c('a1','a2','a3'))
> a.X <- model.matrix(~a, expand.grid(a))
> a.eff <- as.vector(solve(a.X,a.means))
> site.means <- rnorm(nSites,a.X %*% a.eff,site.sigma)
> 
> A <- gl(nTreat,nSites/nTreat,nSites,labels=c('a1','a2','a3'))
> A.X <- model.matrix(~A, expand.grid(A))
> #a.X <- model.matrix(~A, expand.grid(A=gl(nTreat,nSites/nTreat,nSites,labels=c('a1','a2','a3'))))
> site.means <- rnorm(nSites,A.X %*% a.eff,site.sigma)
> 
> SITES <- gl(nSites,(nSites*nQuads)/nSites,nSites*nQuads,labels=paste('site',1:nSites))
> sites.X <- model.matrix(~SITES-1)
> quad.means <- rnorm(nSites*nQuads,sites.X %*% site.means,quad.sigma)
> 
> #SITES <- gl(nSites,1,nSites,labels=paste('site',1:nSites))
> #sites.X <- model.matrix(~SITES-1)
> #quad.means <- rnorm(nSites*nQuads,sites.X %*% site.means,quad.sigma)
> 
> QUADS <- gl(nSites*nQuads,n/(nSites*nQuads),n,labels=paste('quad',1:(nSites*nQuads)))
> quads.X <- model.matrix(~QUADS-1)
> #quads.eff <- as.vector(solve(quads.X,quad.means))
> #pit.means <- rnorm(n,quads.eff %*% t(quads.X),sigma)
> pit.means <- rnorm(n,quads.X %*% quad.means,sigma)
> 
> PITS <- gl(nPits*nSites*nQuads,1, n, labels=paste('pit',1:(nPits*nSites*nQuads)))
> data.nest1<-data.frame(Pits=PITS,Quads=QUADS,Sites=rep(SITES,each=2), A=rep(A,each=nQuads*nPits),y=pit.means)
> #data.nest1<-data.nest1[order(data.nest1$A,data.nest1$Sites,data.nest1$Quads),]
> head(data.nest1)  #print out the first six rows of the data set
   Pits  Quads  Sites  A        y
1 pit 1 quad 1 site 1 a1 61.79607
2 pit 2 quad 1 site 1 a1 56.24699
3 pit 3 quad 2 site 1 a1 42.40885
4 pit 4 quad 2 site 1 a1 52.06672
5 pit 5 quad 3 site 1 a1 73.71286
6 pit 6 quad 3 site 1 a1 62.50529
> 
> ggplot(data.nest1, aes(y=y, x=1)) + geom_boxplot() + facet_grid(.~Quads)

Exploratory data analysis

Normality and Homogeneity of variance

> #Effects of treatment
> boxplot(y~A, ddply(data.nest1, ~A+Sites,numcolwise(mean, na.rm=T)))

> 
> #Site effects
> boxplot(y~Sites, ddply(data.nest1, ~A+Sites+Quads,numcolwise(mean, na.rm=T)))

> 
> #Quadrat effects
> boxplot(y~Quads, ddply(data.nest1, ~A+Sites+Quads+Pits,numcolwise(mean, na.rm=T)))

Conclusions:

  • there is no evidence that the response variable is consistently non-normal across all populations - each boxplot is approximately symmetrical.

  • there is no evidence that variance (as estimated by the height of the boxplots) differs between the five populations. More importantly, there is no evidence of a relationship between mean and variance - the height of boxplots does not increase with increasing position along the \(y\)-axis. Hence it there is no evidence of non-homogeneity.

  • it is a little difficult to assess normality/homogeneity of variance of quadrats since there are only two pits per quadrat. Nevertheless, there is no suggestion that variance increases with increasing mean.

Obvious violations could be addressed either by:

  • transform the scale of the response variables (to address normality, etc). Note transformations should be applied to the entire response variable (not just those populations that are skewed).

Model fitting

Frequentist for comparison

> library(nlme)
> d.lme <- lme(y ~ A, random=~1|Sites/Quads,data=data.nest1)
> summary(d.lme)
Linear mixed-effects model fit by REML
  Data: data.nest1 
       AIC      BIC   logLik
  1137.994 1155.937 -562.997

Random effects:
 Formula: ~1 | Sites
        (Intercept)
StdDev:    10.38248

 Formula: ~1 | Quads %in% Sites
        (Intercept) Residual
StdDev:    8.441615 7.161178

Fixed effects:  y ~ A 
               Value Std.Error DF  t-value p-value
(Intercept) 41.38646   5.04334 75 8.206160  0.0000
Aa2         21.36271   7.13236 12 2.995181  0.0112
Aa3         39.14584   7.13236 12 5.488483  0.0001
 Correlation: 
    (Intr) Aa2   
Aa2 -0.707       
Aa3 -0.707  0.500

Standardized Within-Group Residuals:
        Min          Q1         Med          Q3         Max 
-2.11852493 -0.54600763 -0.03428569  0.53382444  2.26256381 

Number of Observations: 150
Number of Groups: 
           Sites Quads %in% Sites 
              15               75 
> 
> anova(d.lme)
            numDF denDF  F-value p-value
(Intercept)     1    75 446.9152  <.0001
A               2    12  15.1037   5e-04

Full effect parameterisation

> rstanString="
+ data{
+    int n;
+    int nSite;
+    int nQuad;
+    vector [n] y;
+    int A2[n];
+    int A3[n];
+    int Site[n];
+    int Quad[n];
+ }
+ 
+ parameters{
+   real alpha0;
+   real alpha2;
+   real alpha3;
+   real<lower=0> sigma;
+   vector [nSite] beta_Site;
+   real<lower=0> sigma_Site;
+   vector [nQuad] beta_Quad;
+   real<lower=0> sigma_Quad;
+ }
+  
+ model{
+     real mu[n];
+ 
+     // Priors
+     alpha0 ~ normal( 0 , 100 );
+     alpha2 ~ normal( 0 , 100 );
+     alpha3 ~ normal( 0 , 100 );
+     beta_Site~ normal( 0 , sigma_Site );
+     sigma_Site ~ cauchy( 0 , 25 );
+     beta_Quad~ normal( 0 , sigma_Quad );
+     sigma_Quad ~ cauchy( 0 , 25 );
+     sigma ~ cauchy( 0 , 25 );
+     
+     for ( i in 1:n ) {
+         mu[i] = alpha0 + alpha2*A2[i] + 
+                alpha3*A3[i] + beta_Site[Site[i]] + beta_Quad[Quad[i]];
+     }
+     y ~ normal( mu , sigma );
+ }
+ 
+ "
> 
> ## write the model to a text file
> writeLines(rstanString, con = "fullModel2.stan")
> 
> A2 <- ifelse(data.nest1$A=='a2',1,0)
> A3 <- ifelse(data.nest1$A=='a3',1,0)
> data.nest.list <- with(data.nest1, list(y=y, A2=A2, A3=A3, Site=as.numeric(Sites),
+    n=nrow(data.nest1), nSite=length(levels(Sites)),
+    nQuad=length(levels(Quads)), Quad=as.numeric(Quads)))
> 
> params <- c('alpha0','alpha2','alpha3','sigma','sigma_Site', 'sigma_Quad')
> burnInSteps = 3000
> nChains = 2
> numSavedSteps = 3000
> thinSteps = 1
> nIter = burnInSteps+ceiling((numSavedSteps * thinSteps)/nChains)
> 
> data.nest1.rstan.f <- stan(data = data.nest.list, file = "fullModel2.stan", chains = nChains, pars = params, iter = nIter, warmup = burnInSteps, thin = thinSteps)

SAMPLING FOR MODEL 'fullModel2' NOW (CHAIN 1).
Chain 1: 
Chain 1: Gradient evaluation took 0 seconds
Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 0 seconds.
Chain 1: Adjust your expectations accordingly!
Chain 1: 
Chain 1: 
Chain 1: Iteration:    1 / 4500 [  0%]  (Warmup)
Chain 1: Iteration:  450 / 4500 [ 10%]  (Warmup)
Chain 1: Iteration:  900 / 4500 [ 20%]  (Warmup)
Chain 1: Iteration: 1350 / 4500 [ 30%]  (Warmup)
Chain 1: Iteration: 1800 / 4500 [ 40%]  (Warmup)
Chain 1: Iteration: 2250 / 4500 [ 50%]  (Warmup)
Chain 1: Iteration: 2700 / 4500 [ 60%]  (Warmup)
Chain 1: Iteration: 3001 / 4500 [ 66%]  (Sampling)
Chain 1: Iteration: 3450 / 4500 [ 76%]  (Sampling)
Chain 1: Iteration: 3900 / 4500 [ 86%]  (Sampling)
Chain 1: Iteration: 4350 / 4500 [ 96%]  (Sampling)
Chain 1: Iteration: 4500 / 4500 [100%]  (Sampling)
Chain 1: 
Chain 1:  Elapsed Time: 6.208 seconds (Warm-up)
Chain 1:                5.699 seconds (Sampling)
Chain 1:                11.907 seconds (Total)
Chain 1: 

SAMPLING FOR MODEL 'fullModel2' NOW (CHAIN 2).
Chain 2: 
Chain 2: Gradient evaluation took 0 seconds
Chain 2: 1000 transitions using 10 leapfrog steps per transition would take 0 seconds.
Chain 2: Adjust your expectations accordingly!
Chain 2: 
Chain 2: 
Chain 2: Iteration:    1 / 4500 [  0%]  (Warmup)
Chain 2: Iteration:  450 / 4500 [ 10%]  (Warmup)
Chain 2: Iteration:  900 / 4500 [ 20%]  (Warmup)
Chain 2: Iteration: 1350 / 4500 [ 30%]  (Warmup)
Chain 2: Iteration: 1800 / 4500 [ 40%]  (Warmup)
Chain 2: Iteration: 2250 / 4500 [ 50%]  (Warmup)
Chain 2: Iteration: 2700 / 4500 [ 60%]  (Warmup)
Chain 2: Iteration: 3001 / 4500 [ 66%]  (Sampling)
Chain 2: Iteration: 3450 / 4500 [ 76%]  (Sampling)
Chain 2: Iteration: 3900 / 4500 [ 86%]  (Sampling)
Chain 2: Iteration: 4350 / 4500 [ 96%]  (Sampling)
Chain 2: Iteration: 4500 / 4500 [100%]  (Sampling)
Chain 2: 
Chain 2:  Elapsed Time: 6.936 seconds (Warm-up)
Chain 2:                4.237 seconds (Sampling)
Chain 2:                11.173 seconds (Total)
Chain 2: 
> 
> print(data.nest1.rstan.f, par = c('alpha0','alpha2','alpha3','sigma','sigma_Site', 'sigma_Quad'))
Inference for Stan model: fullModel2.
2 chains, each with iter=4500; warmup=3000; thin=1; 
post-warmup draws per chain=1500, total post-warmup draws=3000.

            mean se_mean   sd  2.5%   25%   50%   75% 97.5% n_eff Rhat
alpha0     41.38    0.15 5.37 30.55 37.98 41.41 44.77 52.25  1350    1
alpha2     21.39    0.19 7.75  6.39 16.32 21.37 26.38 36.77  1589    1
alpha3     39.26    0.22 8.14 23.39 34.32 39.16 44.34 55.86  1331    1
sigma       7.29    0.01 0.60  6.21  6.86  7.25  7.66  8.56  1980    1
sigma_Site 11.34    0.07 2.94  6.85  9.33 10.90 12.94 18.37  1899    1
sigma_Quad  8.52    0.03 1.13  6.38  7.74  8.45  9.27 10.77  1405    1

Samples were drawn using NUTS(diag_e) at Fri Dec 11 09:14:51 2020.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).
> 
> data.nest1.rstan.f.df <-as.data.frame(extract(data.nest1.rstan.f))
> head(data.nest1.rstan.f.df)
    alpha0   alpha2   alpha3    sigma sigma_Site sigma_Quad      lp__
1 45.17073 19.74204 35.44779 7.074432   8.002146   8.342248 -603.1710
2 36.74594 18.49752 48.94935 7.008208  10.685535   9.273360 -611.0495
3 38.31165 21.47771 49.82054 7.815280   8.589523   8.578517 -605.6948
4 38.67027 31.15159 39.55512 6.697657  13.424702   9.653050 -609.0281
5 34.00744 17.64337 45.82909 7.422725  11.481667  10.765264 -627.5098
6 40.61961 25.48412 38.63614 7.260439   8.503316   7.401149 -594.0790

Matrix parameterisation

> rstanString2="
+ data{
+    int n;
+    int nSite;
+    int nQuad;
+    int nA;
+    vector [n] y;
+    matrix [n,nA] X;
+    int Site[n];
+    int Quad[n];
+    vector [nA] a0;
+    matrix [nA,nA] A0;
+ }
+ 
+ parameters{
+   vector [nA] alpha;
+   real<lower=0> sigma;
+   vector [nSite] beta_Site;
+   real<lower=0> sigma_Site;
+   vector [nQuad] beta_Quad;
+   real<lower=0> sigma_Quad;
+ }
+  
+ model{
+     real mu[n];
+ 
+     // Priors
+     //alpha ~ normal( 0 , 100 );
+     alpha ~ multi_normal(a0,A0);
+     beta_Site ~ normal( 0 , sigma_Site );
+     sigma_Site ~ cauchy( 0 , 25);
+   beta_Quad ~ normal( 0 , sigma_Quad );
+     sigma_Quad ~ cauchy( 0 , 25);
+     sigma ~ cauchy( 0 , 25 );
+     
+     for ( i in 1:n ) {
+         mu[i] = dot_product(X[i],alpha) + beta_Site[Site[i]] + beta_Quad[Quad[i]];
+     }
+     y ~ normal( mu , sigma );
+ }
+ 
+ "
> 
> ## write the model to a text file
> writeLines(rstanString2, con = "matrixModel2.stan")
> 
> X <- model.matrix(~A, data.nest)
> nA <- ncol(X)
> data.nest.list <- with(data.nest1, list(y=y, X=X, Site=as.numeric(Sites),
+    Quad=as.numeric(Quads),
+    n=nrow(data.nest1), nSite=length(levels(Sites)),
+    nQuad=length(levels(Quads)),
+    nA=nA,
+    a0=rep(0,nA), A0=diag(100000,nA)))
> 
> params <- c('alpha','sigma','sigma_Site', 'sigma_Quad')
> burnInSteps = 3000
> nChains = 2
> numSavedSteps = 3000
> thinSteps = 1
> nIter = burnInSteps+ceiling((numSavedSteps * thinSteps)/nChains)
> 
> data.nest1.rstan.m2 <- stan(data = data.nest.list, file = "matrixModel2.stan", chains = nChains, pars = params, iter = nIter, warmup = burnInSteps, thin = thinSteps)

SAMPLING FOR MODEL 'matrixModel2' NOW (CHAIN 1).
Chain 1: 
Chain 1: Gradient evaluation took 0 seconds
Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 0 seconds.
Chain 1: Adjust your expectations accordingly!
Chain 1: 
Chain 1: 
Chain 1: Iteration:    1 / 4500 [  0%]  (Warmup)
Chain 1: Iteration:  450 / 4500 [ 10%]  (Warmup)
Chain 1: Iteration:  900 / 4500 [ 20%]  (Warmup)
Chain 1: Iteration: 1350 / 4500 [ 30%]  (Warmup)
Chain 1: Iteration: 1800 / 4500 [ 40%]  (Warmup)
Chain 1: Iteration: 2250 / 4500 [ 50%]  (Warmup)
Chain 1: Iteration: 2700 / 4500 [ 60%]  (Warmup)
Chain 1: Iteration: 3001 / 4500 [ 66%]  (Sampling)
Chain 1: Iteration: 3450 / 4500 [ 76%]  (Sampling)
Chain 1: Iteration: 3900 / 4500 [ 86%]  (Sampling)
Chain 1: Iteration: 4350 / 4500 [ 96%]  (Sampling)
Chain 1: Iteration: 4500 / 4500 [100%]  (Sampling)
Chain 1: 
Chain 1:  Elapsed Time: 9.665 seconds (Warm-up)
Chain 1:                12.864 seconds (Sampling)
Chain 1:                22.529 seconds (Total)
Chain 1: 

SAMPLING FOR MODEL 'matrixModel2' NOW (CHAIN 2).
Chain 2: 
Chain 2: Gradient evaluation took 0 seconds
Chain 2: 1000 transitions using 10 leapfrog steps per transition would take 0 seconds.
Chain 2: Adjust your expectations accordingly!
Chain 2: 
Chain 2: 
Chain 2: Iteration:    1 / 4500 [  0%]  (Warmup)
Chain 2: Iteration:  450 / 4500 [ 10%]  (Warmup)
Chain 2: Iteration:  900 / 4500 [ 20%]  (Warmup)
Chain 2: Iteration: 1350 / 4500 [ 30%]  (Warmup)
Chain 2: Iteration: 1800 / 4500 [ 40%]  (Warmup)
Chain 2: Iteration: 2250 / 4500 [ 50%]  (Warmup)
Chain 2: Iteration: 2700 / 4500 [ 60%]  (Warmup)
Chain 2: Iteration: 3001 / 4500 [ 66%]  (Sampling)
Chain 2: Iteration: 3450 / 4500 [ 76%]  (Sampling)
Chain 2: Iteration: 3900 / 4500 [ 86%]  (Sampling)
Chain 2: Iteration: 4350 / 4500 [ 96%]  (Sampling)
Chain 2: Iteration: 4500 / 4500 [100%]  (Sampling)
Chain 2: 
Chain 2:  Elapsed Time: 11.383 seconds (Warm-up)
Chain 2:                7.552 seconds (Sampling)
Chain 2:                18.935 seconds (Total)
Chain 2: 
> 
> print(data.nest1.rstan.m2, par = c('alpha','sigma','sigma_Site', 'sigma_Quad'))
Inference for Stan model: matrixModel2.
2 chains, each with iter=4500; warmup=3000; thin=1; 
post-warmup draws per chain=1500, total post-warmup draws=3000.

            mean se_mean   sd  2.5%   25%   50%   75% 97.5% n_eff Rhat
alpha[1]   41.35    0.15 5.97 28.78 37.74 41.51 45.02 53.05  1643 1.00
alpha[2]   21.37    0.25 8.25  5.77 16.01 21.27 26.55 37.46  1117 1.01
alpha[3]   39.19    0.22 8.42 22.50 33.77 39.03 44.54 55.95  1497 1.00
sigma       7.30    0.02 0.60  6.22  6.88  7.26  7.66  8.57  1465 1.00
sigma_Site 11.51    0.09 3.23  6.77  9.35 11.00 13.13 19.06  1240 1.00
sigma_Quad  8.52    0.03 1.15  6.41  7.72  8.47  9.26 10.92  1424 1.00

Samples were drawn using NUTS(diag_e) at Fri Dec 11 09:16:06 2020.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).
> 
> data.nest1.rstan.m2.df <-as.data.frame(extract(data.nest1.rstan.m2))
> head(data.nest1.rstan.m2.df)
   alpha.1  alpha.2  alpha.3    sigma sigma_Site sigma_Quad      lp__
1 43.11339 25.20017 38.48582 6.425179   7.402032   7.991526 -610.6453
2 42.06806 21.99089 43.87533 8.086784  13.840811   9.164063 -640.4495
3 48.75651 18.25023 31.56387 7.253844  12.568228  11.137956 -620.1871
4 47.97542 27.88142 33.31643 6.983402  18.774030   9.553736 -617.8734
5 37.97879 37.97889 39.74896 7.316680  16.485006   9.229776 -629.8107
6 31.03042 26.85564 57.37791 6.391101  17.110902   7.842159 -596.1107

References

Gelman, Andrew, Daniel Lee, and Jiqiang Guo. 2015. “Stan: A Probabilistic Programming Language for Bayesian Inference and Optimization.” Journal of Educational and Behavioral Statistics 40 (5): 530–43.
Gelman, Andrew, and others. 2006. “Prior Distributions for Variance Parameters in Hierarchical Models (Comment on Article by Browne and Draper).” Bayesian Analysis 1 (3): 515–34.
Stan Development Team. 2018. RStan: The R Interface to Stan.” http://mc-stan.org/.
Assistant Professor in Statistics

Related