Confounders, Colliders, Mediator. What to adjust for

Hello everybody. here we are with my usual monthly update. Today, similarly to last time, I would like to talk about a bit about classical statistical analysis problems, such as those which I find myself routinely explaining to my students. To all health economists who were looking for some more HTA-related content I, once again, apologise and ask them to tune in next time in case what I will talk about today is not of their interest. The main reason why I wanted to go back to this topic is that I am currently teaching a classical stats course for 3rd year medical students who seem to be a bit confused about some basic stats terminology with reference to regression analysis. As usual, I take the opportunity to write here in order to also help myself in making these concepts as clear as possible so to be able to effectively communicate them to my students. I hope some of you may find this topic interesting as much as I do.
So, the topic today will be the distinction between different types of independent variables which are often use as covariates within a regression framework. More specifically, there is quite a big literature work on the distinction between three general categories of covariates: confounders, colliders, and moderators. Based on this classification, there has also been quite a discussion on whether an adjustment for the effects of such variables on the influence of the main determinant on the outcome is needed, depending on the type of analysis/research question formulated.
Let’s see if I am able to make these concepts clear and provide some general suggestions on whether and if an adjustment for these variables is needed in a standard statistical analysis. As usual, let me start with an example, mostly inspired from another publicly available simulation example that can be found here.
Causal relationships between variables
Consider a directed acyclic graph (DAG) representation of how a researcher would think the variables in their model to be connected to each other, where the presence of an arrow between two variables implies a causal effect from the first to the second variable. For example, let’s consider the task of performing an analysis using observational data where
- A confounder variable “Con” (Confounder),
- A collider variable “Col” (Collider),
- A mediator variable “M” (Mediator)
Depending on which effect the researcher wants to focus on, a decision has to be made on which other variables should be left out from the analysis graph (A1-A3), given that they are not related to both
Here the R
code I used to generate the graph:
library(dagitty)
library(ggdag)
library(ggrepel)
library(dplyr)
set.seed(1234)
g <- dagify(Y ~ X,
X ~ A1,
A2 ~ X,
Y ~ M,
M ~ X,
Col ~ X,
Col ~ Y,
Y ~ A3,
Y ~ Con,
X ~ Con,
exposure = "X",outcome = "Y",
coords = data.frame(x=c(5,1,1,1,3,3,5,3),
y=c(1,1,2,0,2,0,0,1.5),
name=c("Y","X","A1","A2","M","Col","A3","Con")))
g %>%
ggplot(aes(x = x, y = y, xend = xend, yend = yend)) +
geom_dag_point(col="grey90") +
geom_dag_edges() +
geom_dag_text(label=c("A1","A3","Con","M","X","Y","A2","Col"),col = "black") +
theme_dag()
Now, let’s see how we can identify each type of variable and decide on whether we should adjust for it or not in our analysis.
Confounders
In the vaccination example above, age is a so-called confounding variable as it affects both
library(tidyverse)
d <- data.frame("Party_preference" = rep(c(rep("Conservatives", 4), rep("Greens", 4)),2),
"Place_of_residence" = c(rep("City",8), rep("Countryside",8)),
"Car_usage" = c(22,33,26,19,18,31,28,23,58,23,12,7,52,34,6,8),
"Frequency" = rep(c("Daily", "Weekly", "Less often", "Never"),4),
"Weight" = c(.30,.30,.30,.30,.75,.75,.75,.75,.70,.70,.70,.70,.25,.25,.25,.25))
d$Frequency <- factor(d$Frequency, levels = unique(d$Frequency))
d$Car_usage_weighted <- d$Car_usage*d$Weight
ggplot(d, aes(x=Frequency,y=Car_usage_weighted,fill=Party_preference)) +
geom_col() + theme_minimal() +
scale_fill_manual(values=c("darkgrey","darkgreen")) +
xlab("Frequency of car usage") + ylab("Share in %") +
ggtitle("Frequency of car usage by party preference",subtitle = "") +
facet_wrap(~Party_preference)
ggplot(d, aes(x=Frequency,y=Car_usage,fill=Party_preference)) +
geom_col() + theme_minimal() +
scale_fill_manual(values=c("darkgrey","darkgreen")) +
xlab("Frequency of car usage") + ylab("Share in %") +
ggtitle("Frequency of car usage by party preference and place of residence",subtitle = "") +
facet_wrap(~Party_preference+Place_of_residence)
In this example, it is clear that Conservative voters more often use their car “daily” whereas Green voters are more likely to rarely or “never” use a car. This is supported by the existence of a correlation in the observational data. However, is it also reflecting a causal relationship? For all that we know, voters of Green parties may be mostly urban, well-educated people whereas Conservative voters may mostly live in rural areas. For various reasons, thus, we can assume that place of residence (urban/rural) affects voting preferences. On the other hand, living in an urban agglomeration is also associated with better access to public transport, meaning it is more likely that you can get to work by bus or tram or even by bike, as opposed to living in a rural area where there are larger commuting distances and fewer connections by bus or train between places. So we could argue that place of residence is a confounding factor that needs to be controlled for.
We can see from the second plot that the survey results are much less clear about a causal effect from party preference on car usage when controlling for place of residence. By “controlling for”, we here simply divide our total survey respondents into groups of rural vs. urban dwellers and carry out the analysis separately within these groups. As the results show, Green voters who live in the countryside are similar to rural Conservative voters in terms of their car usage in this fictional example. Conversely, urban Conservatives also less frequently use their car compared with rural Conservatives. So here we have a spurious correlation that is actually due to a compositional effect: because the city population hosts proportionally more Greens than Conservatives, the Greens are overall less likely to use their cars on a daily basis. Controlling for place of residence, this effect almost completely disappears. Does that mean we can be absolutely sure that there is no direct causal effect here? In general, I would always refrain from using causal language in observational studies. You need to convince your audience that you have considered all important confounding variables, and that they are in fact confounding variables, apart from addressing the usual statistical issues (is your sample size large enough, etc.), and then you could come to the “tentative” conclusion that there does not “appear to be” a direct effect here.
To conclude, it is important to think about whether there might be factors that affect both
Colliders
Let’s get back to our example of smoking and covid infections in hospital patients. The researcher wants to know whether smoking affects the probability of catching Covid. However, the data are from hospital patients, so they also would like to control for the factor hospitalisation - yes/no. Now, we know that Covid can get you into the hospital, but we also know that heavy smoking causes diseases that need stationary treatment independent of Covid. Therefore, in a DAG representation of the analysis problem, you could imagine arrows pointing from both
This means that, in a representative sample of the total population, symptomatic Covid infections might actually be more prevalent among smokers (or there might be no difference), but here in the hospital smokers are over-represented for other reasons, such as lung cancer, deflating the share of Covid patients with respect to the non-smokers who less often suffer from cancer and thus their share of Covid patients is relatively larger. We can easily reproduce this with some simulated data. In the following code example, we assign smokers and non-smokers in the general population the exact same risk of being hospitalized due to Covid. Then we assign a higher risk of getting hospitalized for other reasons to the smokers. We assigned a
set.seed(1234)
population <- data.frame(smoking = c(rep("smoker", 20000), rep("non_smoker", 80000)),
covid_hospitalisation = rbinom(100000,1,.005))
population$other_hospitalisation[population$smoking=="smoker"] <- rbinom(20000,1,.07)
population$other_hospitalisation[population$smoking=="non_smoker"] <- rbinom(80000,1,.05)
population$in_hospital <- population$covid_hospitalisation | population$other_hospitalisation
Let’s check that in the general population, both groups have an equal proportion of Covid hospitalisation:
test_covid_pop <- t.test(covid_hospitalisation~smoking, population)
test_covid_pop
#
# Welch Two Sample t-test
#
# data: covid_hospitalisation by smoking
# t = 1.2437, df = 32434, p-value = 0.2136
# alternative hypothesis: true difference in means between group non_smoker and group smoker is not equal to 0
# 95 percent confidence interval:
# -0.0003887495 0.0017387495
# sample estimates:
# mean in group non_smoker mean in group smoker
# 0.005275 0.004600
We see that among both smokers and non-smokers, roughly
hospital <- population[which(population$in_hospital),]
test_covid_hosp <- t.test(covid_hospitalisation~smoking, hospital)
test_covid_hosp
#
# Welch Two Sample t-test
#
# data: covid_hospitalisation by smoking
# t = 4.2995, df = 3073.5, p-value = 1.765e-05
# alternative hypothesis: true difference in means between group non_smoker and group smoker is not equal to 0
# 95 percent confidence interval:
# 0.01780052 0.04764717
# sample estimates:
# mean in group non_smoker mean in group smoker
# 0.09438604 0.06166220
Here we see that the non-smokers have a higher proportion of Covid patients (
Whenever the researcher has a variable that is affected by both
Mediators
Let’s assume that households with children have on average a lower household income as opposed to childless couples. One of the obvious reasons for this correlation is given by the fact that mothers or fathers often reduce their working hours to work part-time (or take maternal/paternal leaves) while their children are still young. Working less of course translates into a lower household income. Let’s look at the consequences in another (made-up) example dataset:
set.seed(1234)
x = c(rep(0,1000),rep(1,500),rep(2,600),rep(3,300),rep(4,100))
m = rexp(2500,.2) *(-1) + 40
f = c(rexp(1000,.2) *(-1) + 40,rnorm(500,31,9),rnorm(600,30,10),rnorm(300,28,7),rnorm(100,25,5))
e = f* rnorm(2500,100,25)
d = data.frame(x,m,f,e)
ggplot(d, aes(x=factor(x),y=e)) + geom_boxplot() + theme_minimal() + ylim(0,8000) +
xlab("Number of children <6 year old") + ylab("Household income in Euro") +
ggtitle("Monthly income by number of children") +
geom_smooth(method='lm',aes(group=1))
In this dataset, there is a clear negative correlation between the number of children in a household and the monthly income. Let’s assume other factors such as the number of adults living in the household are held constant. Now, let’s factor in the working time, broadly distinguishing between full-time and part-time workers:
ggplot(d[d$f>=35,], aes(x=factor(x),y=e)) + geom_boxplot() + theme_minimal() + ylim(0,8000) +
xlab("Number of children <6 year old") + ylab("Household income in Euro") +
ggtitle("Monthly income by number of children",subtitle = "only full-time (35+ hours/week)") +
geom_smooth(method='lm',aes(group=1))
ggplot(d[d$f<=20,], aes(x=factor(x),y=e)) + geom_boxplot() + theme_minimal() + ylim(0,4000) +
xlab("Number of children <6 year old") + ylab("Household income in Euro") +
ggtitle("Monthly income by number of children",subtitle = "only part-time (<= 50%)") +
geom_smooth(method='lm',aes(group=1))
Two things are to be noted about these two graphs. First, if you look at the scales of the y axis, it is obvious that part-time workers earn less as opposed to full-time workers. Second, the association between the number of children and the income is not negative in neither of the two groups; if anything, it is positive, such that persons with more children earn on average just as much (or even a bit more) as opposed to childless persons – if they work the same amount of hours, that is. Now what did we find here? Controlling for the number of working hours, having small children in the household does apparently not lead to less earnings (in this made-up example). Rather, our variable “working hours” fully explains why households with small children earn less compared with childless couples. Importantly, this is a different case conceptually as opposed to the confounder example above about car usage among Green party voters. We did not reveal a correlation to be actually spurious here; rather, we found the reason why
Having children in the household affects your working hours, which in turn affects your household income. So “working hours” is one of (or maybe the only) causal pathway connecting
By contrast, in the vaccination example, if we control for the confounders (age, etc.) and find no direct effect anymore, then we would say that there is in fact apparently no causal effect of
lm_mediators <- lm(e~x,d)
summary(lm_mediators)
#
# Call:
# lm(formula = e ~ x, data = d)
#
# Residuals:
# Min 1Q Median 3Q Max
# -3201.7 -738.8 -46.2 704.9 5578.1
#
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 3497.93 30.82 113.51 <2e-16 ***
# x -245.42 18.16 -13.52 <2e-16 ***
# ---
# Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#
# Residual standard error: 1090 on 2498 degrees of freedom
# Multiple R-squared: 0.06814, Adjusted R-squared: 0.06777
# F-statistic: 182.7 on 1 and 2498 DF, p-value: < 2.2e-16
We see that an additional child lowers your income on average by -245.4187037 Euros, and this effect is statistically significant. If you add working hours to the equation
lm_mediators2 <- lm(e~x + f,d)
summary(lm_mediators2)
#
# Call:
# lm(formula = e ~ x + f, data = d)
#
# Residuals:
# Min 1Q Median 3Q Max
# -3212.1 -489.0 -10.3 499.8 2935.6
#
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 39.495 77.932 0.507 0.612
# x -3.571 14.299 -0.250 0.803
# f 99.995 2.156 46.369 <2e-16 ***
# ---
# Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#
# Residual standard error: 798.8 on 2497 degrees of freedom
# Multiple R-squared: 0.4993, Adjusted R-squared: 0.4989
# F-statistic: 1245 on 2 and 2497 DF, p-value: < 2.2e-16
the effect will disappear. Showing both models with and without the mediator variable lets you (and your audience) quickly recognize that, first, having children is in general associated with lower income, and, second, the effect is caused by parents working fewer hours.
Moderator
A moderator is a variable that, if present, alters the effect of
Here the key differences between a moderator variable and a mediator or confounder. A moderator variable has no causal connection to
So by these rules, moderator variables are not a necessity to include when assessing the correlation of two variables. Leaving it out does not bias the results, as would happen if you fail to adjust for a confounding variable. You simply gain more insight into the mechanism generating the data if you consider the moderator. You realise that your data are heterogeneous and that for some sub-groups in your data, the treatment works differently compared with other sub-groups. This is important, e.g. when assessing the efficacy of a drug in treating an illness, where the effect may vary by factors such as pre-existing conditions, other medication, or genetic factors.
If you theoretically suspect that there might be heterogeneous treatment effects and you want to assess whether there is a moderator present, you can do so, again, either by simply filtering the data to e.g. males or females only and then see if you get different statistics in these sub-groups. If you have multiple variables and a more complex model, you can include an interaction effect into a regression to see if there is a moderator effect (e.g. gender
How do I “control”/”adjust” for variables?
Quick summary of what I have discussed so far:
- If you have data from a randomised experiment, you likely do not need to worry about confounders, colliders, etc.
- If you have observational data, by contrast, start your analysis by drawing a diagram of the hypothesised influences between your variables of interest (
and ) as well as other potentially relevant variables. - Variables that point to both
and are confounders that must be controlled for in the analysis. Failing to do so will result in biased statistical results. Mechanisms ( affects affects ) can be included if you want to shed light on what part of the effect of on is mediated by , but the total effect of on is given by an analysis without . Other types of variables are at best unnecessary, at worst inducing new bias (through collider variables).
So how do you “control for” or (synonyms) “adjust for” or “hold constant” confounding variables? Here are three widely employed strategies (only very briefly mentioned - this post is becoming too long!).
Partition data into sub-groups
This is the most easy way to go and almost self-explanatory. See the graphs and code for the example of car usage among Green party voters above. If you want to make sure that a confounding variable isn’t biasing your results, you simply filter your data such that the confounding variable is constant in the sub-set. This option is viable if you have a small number of confounders (i.e. only one or two) and these variables have only a few distinct values.
Multivariable regression
There are of course many variants and advancements over the standard linear model which can help you identify causal effects (or tentatively claim that you might have identified them) when you have special data structures. For instance, if you have longitudinal data of multiple individuals (or countries, etc.), also known as time-series cross-sectional or panel data, there are special regression models that can make use of both the temporal and the inter-individual variation. If you are less concerned with identifying the causal effect of
Propensity-score matching
Matching means that you don’t use all your data, but you try to find, for your selected group of “treatment” units, counterparts that are most similar in all other regards except for the treatment. Note that matching is no perfect cure for multi-collinearity, you still need some degree of “overlap” of the distributions (i.e. not all poor countries should have high fertility and vice versa), otherwise you don’t find any matching partners except for, again, the unusual outliers.
Conclusion
I hope this post was helpful to you and, most importantly, to myself in order to have a clearer idea of what the differences between the different types of