library(tidyverse)
library(ivreg)
library(plm)
library(jtools)
library(sandwich)
library(lmtest)
library(fixest)
library(kableExtra)
library(ivDiag)
<- read_csv(here::here('posts',
lyme 'macdonald_2018_breakdown',
'data',
'MacDonaldetal_JAE_2018.csv'))
<- lyme %>% filter(is.na(DensityResIndex) == FALSE) lyme_na
When it comes to insect and invert borne zoonotic diseases, edge forest habitat is thought of in the literature to be high risk. But because the worlds forests are only getting more fragmented, the area of edge forest is increasing. Lyme disease particularly, has been found to be a major offender, as the land type lends itself well to strong tick vectors like deer.
This has been a strong motivator for studying Lyme disease incidence. If the wildlife-urban interface (WUI) is getting larger, it’s likely that more people will choose to live in those areas, and we may expect the rate of Lyme disease to rise. It’s for this reason that habitat fragmentation has logically been used as an indicator for incidence. But studies on this topic, while generally supporting this idea, have been inconsistent.
MacDonald et al. (2018), suggests that the reason for this inconsistency could be an endogeneity bias, in which as much as habitat fragmentation influences Lyme prevalence, it also works the other way around, in which Lyme incidence influences where people choose to live, which ironically can lead to more fragmentation. This violates a very important assumption that makes most statistical models work, in which the predictor variables must not be correlated with the error term. Because this is the method used in most other studies on this topic, this suggest some bias in the literature.
They therefore try a different approach, and instead use an instrumental variable (IV) model. An IV is a variable that correlates with the predictor variable in a model. We refer to this as endogeneity, and the variable as our endogenous variable. This method can be used to predict Lyme disease by proximity, as habitat fragmentation on it’s own is unreliable.
MacDonald et al. (2018) utilizes this concept by including a density restriction index, which indicates lot size regulation in an area. The thing about the instrumental variable approach to modeling is that it’s a 2 stage process where first, the instrument is regressed on our habitat fragmentation (our endogenous variable), and second, the the endogenous variable is regressed on the dependent variable (Lyme disease incidence). This is a very robust way performing this analysis.
Using this strategy, the paper attempts to model Lyme disease incidence in the WUI of the US northeastern deciduous forests. Along, with and IV approach, it uses a fixed effects approach and checks for robustness with several F-tests.
Workflow
Variable names
Variable | Role in Model | Description |
---|---|---|
DensityResIndex | IV | Density restriction index, indicates the average lot size regulation for a given observation |
PercWuiPop | Endogenous Variable | Percent of county population residing in WUI |
LogLyme0103_1113 | Dependent Variable | Lyme disease incidence |
MeanPatchArea | Exogenous Covariate | Average forest patch area by county |
EdgeDensity | Exogenous Covariate | Forest edge density by county |
PercLandForest | Exogenous Covariate | % forested area in county |
StateYrGroup | Fixed Variable | Identifier for each observation based on state and year |
AgStatDistrict | Clustering Variable (for CMSE) | USDA agricultural statistics districts |
First and second stage IV regression
A general overview of the methods employed in the paper…
The instrument is first regressed on the endogenous variable
<- lm(PercWuiPop ~ DensityResIndex,
stage_1 data = lyme_na)
Then the first stage fitted values are regressed against the dependant variable.
<- lm(LogLyme0103_1113 ~ PercWuiPop,
stage_2 data = lyme_na)
Both stages include covariates, although the strength of them will be tested later on.
<- lm(PercWuiPop ~ DensityResIndex +
stage_1 + EdgeDensity + PercLandForest,
MeanPatchArea data = lyme_na)
<- lm(LogLyme0103_1113 ~ PercWuiPop +
stage_2 + EdgeDensity + PercLandForest,
MeanPatchArea data = lyme_na)
The ivreg
package can run both the stage 1 and stage 2 regressions intertwined in a single function, making this process easy.
<- ivreg(LogLyme0103_1113 ~ MeanPatchArea + EdgeDensity + PercLandForest | # covariates
lyme_iv | # endogenous variable
PercWuiPop # instrumental variable
DensityResIndex, data = lyme_na)
summary(lyme_iv)
Call:
ivreg(formula = LogLyme0103_1113 ~ MeanPatchArea + EdgeDensity +
PercLandForest | PercWuiPop | DensityResIndex, data = lyme_na)
Residuals:
Min 1Q Median 3Q Max
-4.5836 -0.9984 0.1557 1.0503 3.3776
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.789588 0.186272 9.607 < 2e-16 ***
PercWuiPop 0.037167 0.008414 4.417 1.22e-05 ***
MeanPatchArea -0.042153 0.017051 -2.472 0.0138 *
EdgeDensity -0.002510 0.007850 -0.320 0.7493
PercLandForest 0.007982 0.018260 0.437 0.6622
Diagnostic tests:
df1 df2 statistic p-value
Weak instruments 1 509 62.97 1.35e-14 ***
Wu-Hausman 1 508 17.09 4.17e-05 ***
Sargan 0 NA NA NA
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.459 on 509 degrees of freedom
Multiple R-Squared: -0.005559, Adjusted R-squared: -0.01346
Wald test: 28.41 on 4 and 509 DF, p-value: < 2.2e-16
Fixed effects
The authors also ran the second stage as a fixed effects model. This approach is one where you group your data to account for unobserved differences between them. It does this by using a dummy variable (0 or 1) for each class for a group to indicate which data are from which class.
Here, the models were fixed by year or state & year.
<- lm(LogLyme0103_1113 ~ PercWuiPop + as_factor(StateYrGroup) - 1, data = lyme_na)
fixed_effects
summary(fixed_effects)
Call:
lm(formula = LogLyme0103_1113 ~ PercWuiPop + as_factor(StateYrGroup) -
1, data = lyme_na)
Residuals:
Min 1Q Median 3Q Max
-4.2041 -0.6909 -0.0239 0.6520 4.9930
Coefficients:
Estimate Std. Error t value Pr(>|t|)
PercWuiPop 0.017941 0.001989 9.022 < 2e-16 ***
as_factor(StateYrGroup)1 3.454231 0.420940 8.206 2.04e-15 ***
as_factor(StateYrGroup)2 2.879821 0.421244 6.836 2.43e-11 ***
as_factor(StateYrGroup)3 2.920626 0.648817 4.501 8.44e-06 ***
as_factor(StateYrGroup)4 3.983158 0.648824 6.139 1.72e-09 ***
as_factor(StateYrGroup)5 0.598705 0.340043 1.761 0.0789 .
as_factor(StateYrGroup)6 2.711908 0.342733 7.913 1.70e-14 ***
as_factor(StateYrGroup)7 2.458755 0.328560 7.483 3.39e-13 ***
as_factor(StateYrGroup)8 3.046376 0.328395 9.277 < 2e-16 ***
as_factor(StateYrGroup)9 2.471677 0.322881 7.655 1.04e-13 ***
as_factor(StateYrGroup)10 3.369002 0.321252 10.487 < 2e-16 ***
as_factor(StateYrGroup)11 1.505099 0.178967 8.410 4.53e-16 ***
as_factor(StateYrGroup)12 2.742846 0.179086 15.316 < 2e-16 ***
as_factor(StateYrGroup)13 0.968439 0.455144 2.128 0.0339 *
as_factor(StateYrGroup)14 3.004004 0.456960 6.574 1.26e-10 ***
as_factor(StateYrGroup)15 2.364315 0.267596 8.835 < 2e-16 ***
as_factor(StateYrGroup)16 3.100802 0.266760 11.624 < 2e-16 ***
as_factor(StateYrGroup)17 0.980690 0.188115 5.213 2.74e-07 ***
as_factor(StateYrGroup)18 2.117406 0.187057 11.320 < 2e-16 ***
as_factor(StateYrGroup)19 1.807620 0.198379 9.112 < 2e-16 ***
as_factor(StateYrGroup)20 2.624127 0.197436 13.291 < 2e-16 ***
as_factor(StateYrGroup)21 0.442376 0.381141 1.161 0.2463
as_factor(StateYrGroup)22 2.936382 0.383114 7.665 9.74e-14 ***
as_factor(StateYrGroup)23 1.525581 0.185235 8.236 1.64e-15 ***
as_factor(StateYrGroup)24 3.005449 0.185549 16.198 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.123 on 489 degrees of freedom
Multiple R-squared: 0.8878, Adjusted R-squared: 0.8821
F-statistic: 154.8 on 25 and 489 DF, p-value: < 2.2e-16
The paper states that the fixed-effects had very little to know effect on the estimate, only changing it by +-0.002.
F-test and Robustness Checks
In the most general sense, a F-test looks for statistical significance by comparing variances. But in a modeling context, a F-test can be used to estimate how much the addition of 1 or more variables improves the model. We do this by comparing our restricted and unrestricted models.
The general equation:
\[ F = \frac{\text{SSE}_R - \text{SSE}_{UR} / e}{\text{SSE}_{UR} / (n - k)} \]
Where:
- \(\text{SSE}_{UR}\) is the full model with every variable.
- \(\text{SSE}_{R}\) is a model with one or more variables excluded. The excluded variables are being tested for their strength.
- \(\text e\) is the restrictions or the total number of omitted variables
- \(\text k\) is the total number of variables in the regression, including the intercept.
- \(\text n\) is the sample size.
They performed this test on the first stage of the IV approach. To accomplish this manually in R…
# get unrestricted model
<- lm(PercWuiPop ~ DensityResIndex
ur_model + MeanPatchArea + EdgeDensity + PercLandForest
data = lyme_na)
,
# sse unrestricted
<- sum((fitted(ur_model) - lyme_na$PercWuiPop)^2)
sse_ur
# restricted models
<- lm(PercWuiPop ~ MeanPatchArea + EdgeDensity + PercLandForest
r_model data = lyme_na)
,
# sse restricted
<- sum((fitted(r_model) - lyme_na$PercWuiPop)^2)
sse_r
# non-robust f test
<- ((sse_r - sse_ur) / 1) /
F_std / (nrow(lyme_na) - 5))
(sse_ur
cat("F value: ", F_std, "\n")
F value: 62.96987
This can also be accomplished with an ANOVA…
<- anova(r_model, ur_model)
f_stat cat("F value with ANOVA: ", f_stat$F[2])
F value with ANOVA: 62.96987
A F-value above 10 indicates a strong instrument, and this F is significantly above 10.
You would then obtain a critical value.
- \(\text ResDF_{UR}\) is derived from sample size of dataset - total num of variables in model
- \(\text ResDF_{R}\) is derived from the previously calculated Residual Df, minus the number of restrictions.
- \(\text {Df}_1\) is therefore derived from ResDF_UR - variables in model.
- \(\text {Df}_2\) is k - \(\text {Df}_1\) - 1
We will just use the qf()
function to calculate it.
With a df2 this large, it would be approximately 2.20. But we can also calculate it with qf()
:
<- summary(ur_model)$coefficients %>% nrow()
df1
cat("F distribution critical value: ", qf(0.95, f_stat$Df[2], df1 - f_stat$Df[2] - 1))
F distribution critical value: 10.12796
The problem with doing the F-test this way, is that it’s not robust. It assumes homoscedasticity and i.i.d, which because the data has both a time and spatial component, likely isn’t true. To account for this, the paper also performs a more robust version of the F-test called the effective F, from Olea & Pflueger (2013). This test balances the instrumental variables and endogenous variables by a weighted covariance matrix. The matrix accounts for unequal variance and clustering. Although clustering was already accounted for as the error term for these models are clustered mean squared errors (CMSE).
\(F_{\text{eff}} = \frac{(Y'X) \hat{W}^{-1}_2 (Z'X)}{K \cdot n}\)
\(\text Z\) is an instrumental variable vector. \(\text X\) is an endogenous variable vector. \(\text Z', X'\) are transposes of the prior variables (as in row to column, or vise versa) \(\hat{W}^{-1}_2\) is the covariance matrix \(\text K\) is number of instruments \(\text n\) is sample size
The crux of effective F is the \(\tau\) parameter. A higher tau means lower critical values and less strict criteria. For lower \(\tau\)s, the opposite is true. A generally accepted \(\tau\) is 10%.
This can be done simply with eff_F()
from the ivDiag
package:
<- eff_F(lyme_na, Y = "LogLyme0103_1113", D = "PercWuiPop", Z = "DensityResIndex", controls = c("PercLandForest", "EdgeDensity", "MeanPatchArea"), cl = "AgStatDistrict")
effective_f
cat("Effective F Stat:", effective_f)
Effective F Stat: 11.4094
Clustered Mean Squared Errors
I had mentioned earlier in this blog that clustering should not be a problem when doing the non-robust F-test. This is because this study uses clustered mean squared errors (CMSE) as it’s error term. We can calculate them with vcovCL()
from the sandwich
package, and get a summary/hypothesis test with the coeftest()
from the lmtest
package.
You funnel in a model and a clustering variable within the dataset:
# Calculate clustered standard errors
<- vcovCL(ur_model, cluster = ~ AgStatDistrict)
clustered_se # Hypothesis testing with clustered standard errors
coeftest(ur_model, vcov = clustered_se)
t test of coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -4.42576 3.44729 -1.2838 0.1997827
DensityResIndex 23.08206 6.83351 3.3778 0.0007867 ***
MeanPatchArea -0.14486 0.37884 -0.3824 0.7023420
EdgeDensity 0.35138 0.18181 1.9326 0.0538361 .
PercLandForest 0.80993 0.45888 1.7650 0.0781565 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Results
This study created a variety of different models and tested different combinations of covariates. The models used were:
- OLS
- Year Fixed Effects
- State * Year Fixed Effects
- IV
- IV with Year Fixed Effects
And the different combinations of covariates were: * None (Bivariate model) * % Forest & Edge Density * % Forest & Mean Patch Size * Mean Patch Size & Edge Density * All covariates
Here, we run all models…
Code
##~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
##~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
## ~~
## OLS ----
## ~~
##~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
##~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
##~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
## Bivariate models ----
##~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
<- lm(LogLyme0103_1113 ~ PercWuiPop, data = lyme)
ols_biv <- ols_biv %>%
ols_biv_clust vcovCL(cluster = ~ AgStatDistrict) %>%
coeftest(ols_biv, vcov = .)
##~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
## Multivariate models ----
##~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# %forest + edge
<- lm(LogLyme0103_1113 ~ PercWuiPop + EdgeDensity + PercLandForest, data = lyme)
ols_multi_1 <- ols_multi_1 %>%
ols_multi_1_clust vcovCL(cluster = ~ AgStatDistrict) %>%
coeftest(ols_multi_1, vcov = .)
# %Forest + patch area
<- lm(LogLyme0103_1113 ~ PercWuiPop + MeanPatchArea + PercLandForest, data = lyme)
ols_multi_2 <- ols_multi_2 %>%
ols_multi_2_clust vcovCL(cluster = ~ AgStatDistrict) %>%
coeftest(ols_multi_2, vcov = .)
# mean patch + edge
<- lm(LogLyme0103_1113 ~ PercWuiPop + MeanPatchArea + EdgeDensity, data = lyme)
ols_multi_3 <- ols_multi_3 %>%
ols_multi_3_clust vcovCL(cluster = ~ AgStatDistrict) %>%
coeftest(ols_multi_3, vcov = .)
# %Forest + mean patch + edge
<- lm(LogLyme0103_1113 ~ PercWuiPop + MeanPatchArea + EdgeDensity + PercLandForest, data = lyme)
ols_multi_4 <- ols_multi_4 %>%
ols_multi_4_clust vcovCL(cluster = ~ AgStatDistrict) %>%
coeftest(ols_multi_4, vcov = .)
# Combine all models into list ~~~
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
<- list(ols_biv = ols_biv_clust[2,1:2],
ols_models ols_multi_1 = ols_multi_1_clust[2:nrow(ols_multi_1_clust),1:2],
ols_multi_2 = ols_multi_2_clust[2:nrow(ols_multi_2_clust),1:2],
ols_multi_3 = ols_multi_3_clust[2:nrow(ols_multi_3_clust),1:2],
ols_multi_4 = ols_multi_4_clust[2:nrow(ols_multi_4_clust),1:2])
##~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
##~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
## ~~
## YEAR FE ----
## ~~
##~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
##~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
##~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
## Bivariate models ----
##~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
<- lm(LogLyme0103_1113 ~ PercWuiPop + as_factor(Year) - 1, data = lyme)
yearfe_biv <- yearfe_biv %>%
yearfe_biv_clust vcovCL(cluster = ~ AgStatDistrict) %>%
coeftest(yearfe_biv, vcov = .)
##~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
## Multivariate models ----
##~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# %forest + edge
<- lm(LogLyme0103_1113 ~ PercWuiPop + EdgeDensity + PercLandForest + as_factor(Year) - 1, data = lyme)
yearfe_multi_1 <- yearfe_multi_1 %>%
yearfe_multi_1_clust vcovCL(cluster = ~ AgStatDistrict) %>%
coeftest(yearfe_multi_1, vcov = .)
# %Forest + patch area
<- lm(LogLyme0103_1113 ~ PercWuiPop + MeanPatchArea + PercLandForest + as_factor(Year) - 1, data = lyme)
yearfe_multi_2 <- yearfe_multi_2 %>%
yearfe_multi_2_clust vcovCL(cluster = ~ AgStatDistrict) %>%
coeftest(yearfe_multi_2, vcov = .)
# mean patch + edge
<- lm(LogLyme0103_1113 ~ PercWuiPop + MeanPatchArea + EdgeDensity + as_factor(Year) - 1, data = lyme)
yearfe_multi_3 <- yearfe_multi_3 %>%
yearfe_multi_3_clust vcovCL(cluster = ~ AgStatDistrict) %>%
coeftest(yearfe_multi_3, vcov = .)
# %Forest + mean patch + edge
<- lm(LogLyme0103_1113 ~ PercWuiPop + MeanPatchArea + EdgeDensity + PercLandForest + as_factor(Year) - 1, data = lyme)
yearfe_multi_4 <- yearfe_multi_4 %>%
yearfe_multi_4_clust vcovCL(cluster = ~ AgStatDistrict) %>%
coeftest(yearfe_multi_4, vcov = .)
# Combine all models into list ~~~
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
<- list(yearfe_biv = yearfe_biv_clust[1, c("Estimate", "Std. Error")],
year_fe_models yearfe_multi_1 = yearfe_multi_1_clust[1:3, c("Estimate", "Std. Error")],
yearfe_multi_2 = yearfe_multi_2_clust[1:3, c("Estimate", "Std. Error")],
yearfe_multi_3 = yearfe_multi_3_clust[1:3, c("Estimate", "Std. Error")],
yearfe_multi_4 = yearfe_multi_4_clust[1:4, c("Estimate", "Std. Error")])
##~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
##~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
## ~~
## STATE + YEAR FE ----
## ~~
##~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
##~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
##~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
## Bivariate models ----
##~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
<- lm(LogLyme0103_1113 ~ PercWuiPop + as_factor(StateYrGroup) - 1, data = lyme)
stateyearfe_biv <- stateyearfe_biv %>%
stateyearfe_biv_clust vcovCL(cluster = ~ AgStatDistrict) %>%
coeftest(stateyearfe_biv, vcov = .)
##~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
## Multivariate models ----
##~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# %forest + edge
<- lm(LogLyme0103_1113 ~ PercWuiPop + EdgeDensity + PercLandForest + as_factor(StateYrGroup) - 1, data = lyme)
stateyearfe_multi_1 <- stateyearfe_multi_1 %>%
stateyearfe_multi_1_clust vcovCL(cluster = ~ AgStatDistrict) %>%
coeftest(stateyearfe_multi_1, vcov = .)
# %Forest + patch area
<- lm(LogLyme0103_1113 ~ PercWuiPop + MeanPatchArea + PercLandForest + as_factor(StateYrGroup) - 1, data = lyme)
stateyearfe_multi_2 <- stateyearfe_multi_2 %>%
stateyearfe_multi_2_clust vcovCL(cluster = ~ AgStatDistrict) %>%
coeftest(stateyearfe_multi_2, vcov = .)
# mean patch + edge
<- lm(LogLyme0103_1113 ~ PercWuiPop + MeanPatchArea + EdgeDensity + as_factor(StateYrGroup) - 1, data = lyme)
stateyearfe_multi_3 <- stateyearfe_multi_3 %>%
stateyearfe_multi_3_clust vcovCL(cluster = ~ AgStatDistrict) %>%
coeftest(stateyearfe_multi_3, vcov = .)
# %Forest + mean patch + edge
<- lm(LogLyme0103_1113 ~ PercWuiPop + MeanPatchArea + EdgeDensity + PercLandForest + as_factor(StateYrGroup) - 1, data = lyme)
stateyearfe_multi_4 <- stateyearfe_multi_4 %>%
stateyearfe_multi_4_clust vcovCL(cluster = ~ AgStatDistrict) %>%
coeftest(stateyearfe_multi_4, vcov = .)
# Combine all models into list ~~~
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
<- list(stateyearfe_biv = stateyearfe_biv_clust[1, c("Estimate", "Std. Error")],
stateyearfe_models stateyearfe_multi_1 = stateyearfe_multi_1_clust[1:3, 1:2],
stateyearfe_multi_2 = stateyearfe_multi_2_clust[1:3, 1:2],
stateyearfe_multi_3 = stateyearfe_multi_3_clust[1:3, 1:2],
stateyearfe_multi_4 = stateyearfe_multi_4_clust[1:4, 1:2])
##~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
##~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
## ~~
## IV ----
## ~~
##~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
##~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
##~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
## Bivariate models ----
##~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
<- ivreg(LogLyme0103_1113 ~ PercWuiPop |
iv_biv
DensityResIndex,data = lyme_na)
<- iv_biv %>%
iv_biv_clust vcovCL(cluster = ~ AgStatDistrict) %>%
coeftest(iv_biv, vcov = .)
##~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
## Multivariate models ----
##~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# %forest + edge
<- ivreg(LogLyme0103_1113 ~ EdgeDensity + PercLandForest |
iv_multi_1 |
PercWuiPop
DensityResIndex, data = lyme_na)
<- iv_multi_1 %>%
iv_multi_1_clust vcovCL(cluster = ~ AgStatDistrict) %>%
coeftest(iv_multi_1, vcov = .)
# %Forest + patch area
<- ivreg(LogLyme0103_1113 ~ MeanPatchArea + PercLandForest |
iv_multi_2 |
PercWuiPop
DensityResIndex, data = lyme_na)
<- iv_multi_2 %>%
iv_multi_2_clust vcovCL(cluster = ~ AgStatDistrict) %>%
coeftest(iv_multi_2, vcov = .)
# mean patch + edge
<- ivreg(LogLyme0103_1113 ~ MeanPatchArea + EdgeDensity |
iv_multi_3 |
PercWuiPop
DensityResIndex, data = lyme_na)
<- iv_multi_3 %>%
iv_multi_3_clust vcovCL(cluster = ~ AgStatDistrict) %>%
coeftest(iv_multi_3, vcov = .)
# %Forest + mean patch + edge
<- ivreg(LogLyme0103_1113 ~ MeanPatchArea + EdgeDensity + PercLandForest |
iv_multi_4 |
PercWuiPop
DensityResIndex, data = lyme_na)
<- iv_multi_4 %>%
iv_multi_4_clust vcovCL(cluster = ~ AgStatDistrict) %>%
coeftest(iv_multi_4, vcov = .)
# Combine all models into list ~~~
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
<- list(iv_biv = iv_biv_clust[1, c("Estimate", "Std. Error")],
iv_models iv_multi_1 = iv_multi_1_clust[2:4, 1:2],
iv_multi_2 = iv_multi_2_clust[2:4, 1:2],
iv_multi_3 = iv_multi_3_clust[2:4, 1:2],
iv_multi_4 = iv_multi_4_clust[2:5, 1:2])
##~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
##~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
## ~~
## IV + YEAR FE ----
## ~~
##~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
##~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
##~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
## Bivariate models ----
##~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
<- ivreg(LogLyme0103_1113 ~ PercWuiPop + as_factor(Year) - 1 |
iv_yearfe_biv + as_factor(Year) - 1,
DensityResIndex data = lyme_na
)<- iv_yearfe_biv %>%
iv_yearfe_biv_clust vcovCL(cluster = ~ AgStatDistrict) %>%
coeftest(iv_yearfe_biv, vcov = .)
##~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
## Multivariate models ----
##~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# %forest + edge
<- ivreg(
iv_yearfe_multi_1 ~ PercWuiPop + PercLandForest + EdgeDensity + as_factor(Year) - 1 |
LogLyme0103_1113 + PercLandForest + EdgeDensity + as_factor(Year) - 1,
DensityResIndex data = lyme_na
)<- iv_yearfe_multi_1 %>%
iv_yearfe_multi_1_clust vcovCL(cluster = ~ AgStatDistrict) %>%
coeftest(iv_yearfe_multi_1, vcov = .)
# %Forest + patch area
<- ivreg(
iv_yearfe_multi_2 ~ PercWuiPop + PercLandForest + MeanPatchArea + as_factor(Year) - 1 |
LogLyme0103_1113 + PercLandForest + MeanPatchArea + as_factor(Year) - 1,
DensityResIndex data = lyme_na
)<- iv_yearfe_multi_2 %>%
iv_yearfe_multi_2_clust vcovCL(cluster = ~ AgStatDistrict) %>%
coeftest(iv_yearfe_multi_2, vcov = .)
# mean patch + edge
<- ivreg(
iv_yearfe_multi_3 ~ PercWuiPop + EdgeDensity + MeanPatchArea + as_factor(Year) - 1 |
LogLyme0103_1113 + EdgeDensity + MeanPatchArea + as_factor(Year) - 1 ,
DensityResIndex data = lyme_na
)<- iv_yearfe_multi_3 %>%
iv_yearfe_multi_3_clust vcovCL(cluster = ~ AgStatDistrict) %>%
coeftest(iv_yearfe_multi_3, vcov = .)
# %Forest + mean patch + edge
<- ivreg(
iv_yearfe_multi_4 ~ PercWuiPop + PercLandForest + EdgeDensity + MeanPatchArea + as_factor(Year) - 1|
LogLyme0103_1113 + PercLandForest + EdgeDensity + MeanPatchArea + as_factor(Year) - 1,
DensityResIndex data = lyme_na
)<- iv_yearfe_multi_4 %>%
iv_yearfe_multi_4_clust vcovCL(cluster = ~ AgStatDistrict) %>%
coeftest(iv_yearfe_multi_4, vcov = .)
# Combine all models into list ~~~
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
<- list(iv_yearfe_biv = iv_yearfe_biv_clust[1, c("Estimate", "Std. Error")],
iv_yearfe_models iv_yearfe_multi_1 = iv_yearfe_multi_1_clust[1:3, 1:2],
iv_yearfe_multi_2 = iv_yearfe_multi_2_clust[1:3, 1:2],
iv_yearfe_multi_3 = iv_yearfe_multi_3_clust[1:3, 1:2],
iv_yearfe_multi_4 = iv_yearfe_multi_4_clust[1:4, 1:2])
##~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
##~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
## ~~
## FUNCTION TO CREATE MODELS ----
## ~~
##~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
##~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
<- function(models, type_label){
make_table
= data_frame()
table_output
for(i in 1:length(models)){
if(length(models[[i]]) < 3){
<- table_output %>%
table_output bind_rows(models[[i]]) %>%
mutate(model = type_label,
type = "bivariate",
covariate = "PercWuiPop",
num = 0)
}else{
<- table_output %>%
table_output bind_rows(
%>% as_tibble()
models[[i]] %>%
mutate(model = type_label,
type = "multivariate",
covariate = names(models[[i]][,1]),
num = i - 1
)
)
}
}
return(table_output)
}
<- make_table(ols_models, "OLS") %>%
ols_table select(-model) %>%
rename(Std_Error = "Std. Error") %>%
mutate(Estimate = Estimate %>% signif(2),
Std_Error = Std_Error %>% signif(2))
<- make_table(year_fe_models, "Year FE") %>%
year_fe_table select(-model) %>%
rename(Std_Error = "Std. Error") %>%
mutate(Estimate = Estimate %>% signif(2),
Std_Error = Std_Error %>% signif(2))
<- make_table(stateyearfe_models, "State + Year FE") %>%
stateyearfe_table select(-model) %>%
rename(Std_Error = "Std. Error") %>%
mutate(Estimate = Estimate %>% signif(2),
Std_Error = Std_Error %>% signif(2))
<- make_table(iv_models, "IV") %>%
iv_table select(-model) %>%
rename(Std_Error = "Std. Error") %>%
mutate(Estimate = Estimate %>% signif(2),
Std_Error = Std_Error %>% signif(2))
<- make_table(iv_yearfe_models, "IV + Year FE") %>%
iv_yearfe_table select(-model) %>%
rename(Std_Error = "Std. Error") %>%
mutate(Estimate = Estimate %>% signif(2),
Std_Error = Std_Error %>% signif(2))
Table creation for every model
# A tibble: 14 × 5
Estimate Std_Error type covariate num
<dbl> <dbl> <chr> <chr> <dbl>
1 0.017 0.0034 bivariate PercWuiPop 0
2 0.0051 0.0038 multivariate PercWuiPop 1
3 0.03 0.0061 multivariate EdgeDensity 1
4 -0.013 0.009 multivariate PercLandForest 1
5 0.0066 0.0037 multivariate PercWuiPop 2
6 -0.059 0.013 multivariate MeanPatchArea 2
7 0.057 0.012 multivariate PercLandForest 2
8 0.0052 0.0037 multivariate PercWuiPop 3
9 -0.014 0.0077 multivariate MeanPatchArea 3
10 0.026 0.0048 multivariate EdgeDensity 3
11 0.005 0.0038 multivariate PercWuiPop 4
12 -0.022 0.02 multivariate MeanPatchArea 4
13 0.023 0.0094 multivariate EdgeDensity 4
14 0.0095 0.023 multivariate PercLandForest 4
# A tibble: 14 × 5
Estimate Std_Error type covariate num
<dbl> <dbl> <chr> <chr> <dbl>
1 0.016 0.0035 bivariate PercWuiPop 0
2 0.0046 0.0038 multivariate PercWuiPop 1
3 0.03 0.0061 multivariate EdgeDensity 1
4 -0.012 0.0091 multivariate PercLandForest 1
5 0.0061 0.0038 multivariate PercWuiPop 2
6 -0.06 0.013 multivariate MeanPatchArea 2
7 0.058 0.012 multivariate PercLandForest 2
8 0.0047 0.0038 multivariate PercWuiPop 3
9 -0.014 0.0078 multivariate MeanPatchArea 3
10 0.027 0.0049 multivariate EdgeDensity 3
11 0.0045 0.0038 multivariate PercWuiPop 4
12 -0.022 0.019 multivariate MeanPatchArea 4
13 0.023 0.0095 multivariate EdgeDensity 4
14 0.01 0.022 multivariate PercLandForest 4
# A tibble: 14 × 5
Estimate Std_Error type covariate num
<dbl> <dbl> <chr> <chr> <dbl>
1 0.018 0.0038 bivariate PercWuiPop 0
2 0.0057 0.0038 multivariate PercWuiPop 1
3 0.029 0.0057 multivariate EdgeDensity 1
4 -0.0069 0.011 multivariate PercLandForest 1
5 0.0066 0.0038 multivariate PercWuiPop 2
6 -0.061 0.013 multivariate MeanPatchArea 2
7 0.064 0.014 multivariate PercLandForest 2
8 0.0064 0.0039 multivariate PercWuiPop 3
9 -0.011 0.0089 multivariate MeanPatchArea 3
10 0.027 0.0049 multivariate EdgeDensity 3
11 0.0055 0.0038 multivariate PercWuiPop 4
12 -0.028 0.019 multivariate MeanPatchArea 4
13 0.02 0.0086 multivariate EdgeDensity 4
14 0.021 0.023 multivariate PercLandForest 4
# A tibble: 14 × 5
Estimate Std_Error type covariate num
<dbl> <dbl> <chr> <chr> <dbl>
1 1.4 0.4 bivariate PercWuiPop 0
2 0.038 0.014 multivariate PercWuiPop 1
3 0.0093 0.0089 multivariate EdgeDensity 1
4 -0.031 0.014 multivariate PercLandForest 1
5 0.037 0.013 multivariate PercWuiPop 2
6 -0.038 0.019 multivariate MeanPatchArea 2
7 0.0031 0.026 multivariate PercLandForest 2
8 0.037 0.014 multivariate PercWuiPop 3
9 -0.035 0.012 multivariate MeanPatchArea 3
10 0.00038 0.011 multivariate EdgeDensity 3
11 0.037 0.014 multivariate PercWuiPop 4
12 -0.042 0.022 multivariate MeanPatchArea 4
13 -0.0025 0.011 multivariate EdgeDensity 4
14 0.008 0.027 multivariate PercLandForest 4
# A tibble: 14 × 5
Estimate Std_Error type covariate num
<dbl> <dbl> <chr> <chr> <dbl>
1 0.037 0.0087 bivariate PercWuiPop 0
2 0.038 0.014 multivariate PercWuiPop 1
3 -0.031 0.014 multivariate PercLandForest 1
4 0.0095 0.0089 multivariate EdgeDensity 1
5 0.037 0.013 multivariate PercWuiPop 2
6 0.0038 0.026 multivariate PercLandForest 2
7 -0.038 0.019 multivariate MeanPatchArea 2
8 0.037 0.014 multivariate PercWuiPop 3
9 0.0007 0.011 multivariate EdgeDensity 3
10 -0.035 0.012 multivariate MeanPatchArea 3
11 0.037 0.014 multivariate PercWuiPop 4
12 0.0084 0.027 multivariate PercLandForest 4
13 -0.0023 0.011 multivariate EdgeDensity 4
14 -0.042 0.022 multivariate MeanPatchArea 4
These tables contain all estimates and errors for every tested model.
A series of robust checks were also performed on the first stage for 5 different models containing the different combinations of covariates specified previously. These tests are the non-robust F test, the Effective F test, and the Specification test. Although here I will just perform the Effective F Weak Instruments Test.
Code
##~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
## ~ Model 1: No Covariates ----
##~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
<- eff_F(lyme_na, Y = "LogLyme0103_1113", D = "PercWuiPop", Z = "DensityResIndex", cl = "AgStatDistrict", FE = "as_factor(Year)")
no_covariates
##~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
## ~ Model 2: % Forest + Edge Density ----
##~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
<- eff_F(lyme_na, Y = "LogLyme0103_1113", D = "PercWuiPop", Z = "DensityResIndex", controls = c("PercLandForest", "EdgeDensity"), cl = "AgStatDistrict")
forest_edge
##~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
## ~ Model 3: % Forest + Mean Patch Area ----
##~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
<- eff_F(lyme_na, Y = "LogLyme0103_1113", D = "PercWuiPop", Z = "DensityResIndex", controls = c("PercLandForest", "MeanPatchArea"), cl = "AgStatDistrict")
forest_patch
##~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
## ~ Model 4: Edge Density + Mean Patch Area ----
##~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
<- eff_F(lyme_na, Y = "LogLyme0103_1113", D = "PercWuiPop", Z = "DensityResIndex", controls = c("EdgeDensity", "MeanPatchArea"), cl = "AgStatDistrict")
edge_patch
##~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
## ~ Model 5: All Covariates ----
##~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
<- eff_F(lyme_na, Y = "LogLyme0103_1113", D = "PercWuiPop", Z = "DensityResIndex", controls = c("PercLandForest", "EdgeDensity", "MeanPatchArea"), cl = "AgStatDistrict")
all_covariates
#....................Printing the F statistics...................
cat(" No covariates effective F stat: ", no_covariates, "**", "\n",
"% forest + edge density effective F stat: ", forest_edge, "\n",
"% forest + Mean Patch Area effective F stat: ", forest_patch, "\n",
"Edge density + mean match area effective F stat: ", edge_patch, "\n",
"All covariates effective F stat: ", all_covariates, "\n"
)
No covariates effective F stat: 23.6541 **
% forest + edge density effective F stat: 11.9416
% forest + Mean Patch Area effective F stat: 10.5339
Edge density + mean match area effective F stat: 12.8009
All covariates effective F stat: 11.4094
The only model that is significant to \(\alpha < 0.05\) is the no covariates model. The other models would be significant if we accept \(\tau < 0.30\), but that is very low precision.
The other robustness tests were not performed because I could not replicate their values, likely due to my own error. This being said, we can still check the F statistic of each of the models, as it is build into summary()
.
Code
<- lm(PercWuiPop ~ DensityResIndex + # instrumental variable
no_covariates_f as_factor(Year), # fixed effects
data = lyme_na) %>%
summary()
<- lm(PercWuiPop ~ DensityResIndex + # instrumental variable
forest_edge_f + EdgeDensity + # covariates
MeanPatchArea as_factor(Year), # fixed effects
data = lyme_na) %>%
summary()
<- lm(PercWuiPop ~ DensityResIndex + # instrumental variable
forest_patch_f + PercLandForest + # covariates
MeanPatchArea as_factor(Year), # fixed effects
data = lyme_na) %>%
summary()
<- lm(PercWuiPop ~ DensityResIndex + # instrumental variable
edge_patch_f + EdgeDensity + # covariates
MeanPatchArea as_factor(Year), # fixed effects
data = lyme_na) %>%
summary()
<- lm(PercWuiPop ~ DensityResIndex + # instrumental variable
all_covariates_f + EdgeDensity + PercLandForest + # covariates
MeanPatchArea as_factor(Year), # fixed effects
data = lyme_na) %>%
summary()
cat(" No covariates non-robust F stat: ", no_covariates_f$fstatistic[1], "**", "\n",
"% forest + edge density non-robust F stat: ", forest_edge_f$fstatistic[1], "\n",
"% forest + Mean Patch Area non-robust F stat: ", forest_patch_f$fstatistic[1], "\n",
"Edge density + mean match area non-robust F stat: ", edge_patch_f$fstatistic[1], "\n",
"All covariates non-robust F stat: ", all_covariates_f$fstatistic[1]
)
No covariates non-robust F stat: 37.18896 **
% forest + edge density non-robust F stat: 141.0745
% forest + Mean Patch Area non-robust F stat: 140.3725
Edge density + mean match area non-robust F stat: 141.0745
All covariates non-robust F stat: 116.9124
These F statistics are likely very overinflated, but are significantly above 10.
Conclusion
The study seems pretty rock solid in terms of validating the assumptions of the model, as CMSE accounts for clustering and biases, the effective F accounts for potential heteroscedasticity, and the fixed effects account for any autocorrelation issues (even though it didn’t have much of an effect). Furthermore, the IV approach potentially mitigated any endogeneity bias.
The only mark against the study is that the effective F was not significant when adding the covariates at a \(\tau\) of 10%. It was significant without them, but omitting them could lead of omitted variable bias.
This study possibly suggests that lot size regulations are a good instrument to measure Lyme incidence. At the very least, it suggests that this methodology could be more effective than previously used ones, as the endogeneity bias makes directly using habitat fragmentation as a predictor unreliable.
Citation
@online{paul_cohen2025,
author = {Paul Cohen, Joshua and Overbye, Amanda and Mull, Josh},
title = {MacDonald Et Al. (2018) {Breakdown}},
date = {2025-03-23},
url = {https://silkiemoth.github.io/},
langid = {en}
}