Skip to content

add warning for collinearity #411

@victorrssx

Description

@victorrssx

After running a panel model (5475 units, 13 years) with stats::lm(), I've been trying to estimate heteroskedasticity robust standard errors with estimatr::lm_robust(). It turns out that some coefficients estimated are (very) different in each function - which shouldn't happen (only se, t.test and p-values should change).

df <- readRDS(url("https://github.com/victorrssx/victorrssx/raw/main/df.RDS")) %>% 
      dplyr::mutate(ano = as.factor(ano))


lm_mod <- stats::lm(tx_hom_tot ~ pos2018 + res_ou:pos2018 + ti:pos2018 + res_ou:ti:pos2018 + ano, data = df)
summary(lm_mod)

    Coefficients: (1 not defined because of singularities)
                      Estimate Std. Error t value             Pr(>|t|)    
    (Intercept)         26.827      0.382   70.25 < 0.0000000000000002 ***
    pos2018              2.927      0.538    5.44     0.00000005226909 ***
    ano2011              0.220      0.540    0.41              0.68385    
    ano2012              1.182      0.536    2.20              0.02754 *  
    ano2013              0.892      0.537    1.66              0.09640 .  
    ano2014              1.960      0.534    3.67              0.00025 ***
    ano2015              2.985      0.533    5.60     0.00000002131005 ***
    ano2016              5.556      0.530   10.48 < 0.0000000000000002 ***
    ano2017              6.835      0.528   12.95 < 0.0000000000000002 ***
    ano2018              4.841      0.531    9.12 < 0.0000000000000002 ***
    ano2019             -1.633      0.520   -3.14              0.00169 ** 
    ano2020              0.280      0.518    0.54              0.58944    
    ano2021             -0.272      0.517   -0.53              0.59925    
    ano2022                 NA         NA      NA                   NA    
    pos2018:res_ou       0.862      0.643    1.34              0.18005    
    pos2018:ti           2.874      0.629    4.57     0.00000498294524 ***
    pos2018:res_ou:ti    9.824      1.365    7.20     0.00000000000063 ***
            
lm_mod <- estimatr::lm_robust(tx_hom_tot ~ pos2018 + res_ou:pos2018 + ti:pos2018 + res_ou:ti:pos2018 + ano, data = df, se_type = "HC1")
summary(lm_mod)

Standard error type:  HC1 

Coefficients:
                            Estimate             Std. Error     t value  Pr(>|t|)
(Intercept)                   26.930              10407.875  0.00258749 0.9979355
pos2018            2842202543499.796 324275566161917504.000  0.00000876 0.9999930
ano2011                        0.222                196.496  0.00112810 0.9990999
ano2012                        1.179                330.843  0.00356436 0.9971561
ano2013                        0.890                397.685  0.00223694 0.9982152
ano2014                        1.957                125.512  0.01558929 0.9875621
ano2015                        2.982                396.852  0.00751471 0.9940042
ano2016                        5.553                470.073  0.01181333 0.9905746
ano2017                        6.832                355.406  0.01922433 0.9846622
ano2018                        4.838                355.006  0.01362761 0.9891271
ano2019           -2842202543498.518 264769891052461504.000 -0.00001073 0.9999914
ano2020           -2842202543497.091 374441170828414656.000 -0.00000759 0.9999939
ano2021           -2842202543497.189 324275566161908160.000 -0.00000876 0.9999930
ano2022           -2842202543496.979 324275566161925760.000 -0.00000876 0.9999930
pos2018:res_ou                 0.863                  5.743  0.15017708 0.8806255
pos2018:ti                     2.874                  0.677  4.24755884 0.0000216
pos2018:res_ou:ti              9.852               2739.434  0.00359654 0.9971304

Specifically, the problem happens sharply on pos2018, ano2019, ano2020, ano2021 which are all dummies. The first is set to 1 for years after 2018 and 0 otherwise; the remainder are year-specific dummies. Even setting fixed_effects = ~ ano produce strange results.

lm_mod <- estimatr::lm_robust(tx_hom_tot ~ pos2018 + res_ou:pos2018 + ti:pos2018 + res_ou:ti:pos2018, data = df, se_type = "HC1", fixed_effects = ~ ano)
summary(lm_mod) 

Standard error type:  HC1 

Coefficients: (1 not defined because the design matrix is rank deficient)
                  Estimate Std. Error t value   Pr(>|t|) CI Lower CI Upper    DF
pos2018                 NA         NA      NA         NA       NA       NA    NA
pos2018:res_ou       0.862      0.707    1.22 0.22251985   -0.523     2.25 50693
pos2018:ti           2.874      0.640    4.49 0.00000709    1.620     4.13 50693
pos2018:res_ou:ti    9.824      2.252    4.36 0.00001292    5.409    14.24 50693

When using lmtest::coeftest(lm_mod, vcov. = plm::vcovHC(lm_mod, type = "HC1")) it seems to produce correct results. So why is this happening with lm_robust? Could it be due to the interpretation of the NAs in the data by the function?

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions