notebooks/4-factors-regression-margins.Rmd

---
author: "Maciej Beręsewicz"
title: "Regression and marginal effects"
output: 
  html_notebook: 
    toc: yes
---

Setting for notebook

```{r}
library(reticulate)
use_python("/usr/local/anaconda3/bin/python")
```

```{r}
knitr::opts_chunk$set(engine.path = list(
  python = "/usr/local/anaconda3/bin/python",
  julia = "/Applications/Julia-1.3.app/Contents/Resources/julia/bin/"
))
```


# Regression in R

## Regression with categorical data

```{r}
install.packages("margins")
library(margins)
```

```{r}
mtcars
```

Let's assume that we would like to build a model where 

            mpg ~ gear 

and we assume that gear is categorical

$$
mpg = \beta_0 + \beta_1 \times gear
$$

```{r}
m1 <- lm(formula = mpg ~ gear, data = mtcars)
summary(m1)
```

```{r}
table(mtcars$gear)
```

To inform R that gear is categorical variable we need to use function factor 

```{r}
mtcars$gear_f <- factor(mtcars$gear)
mtcars
```

Now, we will include gear_f into the model

$$
mpg = \beta_0 + \beta_1 \times (gear = 4) + \beta_2 \times (gear = 5)
$$

```{r}
m2 <- lm(formula = mpg ~ gear_f, data = mtcars)
summary(m2)
```

```{r}
aggregate(mpg ~ gear, data = mtcars, FUN = mean)
```

How to change the contrast

```{r}
m3 <- lm(mpg ~ gear_f, data  = mtcars, contrast = list(gear_f = "contr.SAS"))
summary(m3)
```

In order to change the reference level we can use function relevel (reference level).

```{r}
m4 <- lm(formula = mpg ~ relevel(x = gear_f, ref = "4"), data = mtcars)
summary(m4)
```

How to check what is the reference level 

```{r}
levels(mtcars$gear_f)
```

```{r}
mtcars$gear_f2 <- relevel(mtcars$gear_f, ref = "4")
levels(mtcars$gear_f2)
```

Now, we will use contr.sum to compare to overall mean

```{r}
m5 <- lm(formula = mpg ~ gear_f, data = mtcars, contrast = list(gear_f = "contr.sum"))
summary(m5)
```

```{r}
result <- aggregate(mpg ~ gear, data = mtcars, FUN = mean)[, "mpg"]
result
mean(result)
```

## Regression -- marginal effects

Margial effects

x1*x2 = x1 + x2 + x1:x2

```{r}
m6_1 <- lm(mpg ~ wt + gear_f + am, data = mtcars)
m6_2 <- lm(mpg ~ wt + I(wt^2) + gear_f + am, data = mtcars)
m6_3 <- lm(mpg ~ wt + I(wt^2) + gear_f*am, data = mtcars)

summary(m6_2)
```

Calculate **Average marginal effects** for model m6_1

```{r}
margins(m6_1)
summary(margins(m6_2))
```

Average marginal effect for model 

$$
Y = \beta_0 + \beta_1X_1 + \beta_2X_2 + \beta_3 X_1X_2
$$

$$
\frac{\partial Y}{\partial X_1} = \beta_1 + \beta_3 X_2
$$

$$
\frac{\partial Y_i}{\partial X_{i1}} = \beta_1 + \beta_3 X_{i2}
$$

and then we calculate Average Marginal Effects as follows

$$
AME = \frac{\sum_{i} \hat{\beta}_1+ \hat{\beta}_3X_{i2}}{n}
$$

Marginal effects at means (MEMs) as

$$
MEMs = \hat{\beta}_1 + \hat{\beta}_3 \bar{X}_{2}
$$

# Regression in Python

```{python}
import statsmodels.api as sm
import statsmodels.formula.api as smf
import numpy as np
import pandas as pd
```

Get the same data as in R

```{python}
df = sm.datasets.get_rdataset("mtcars").data
df.head()
```

Build the same model as in R i.e mpg ~ gear, data = mtcars

```{python}
model1 = smf.ols(formula='mpg ~ gear', data=df)
result1 = model1.fit()
print(result1.summary())
```

Now, we would like to treat gear as factor, so we can use `astype('category')`  to transform the data

```{python}
df["gear_f"]=df["gear"].astype('category')
model1 = smf.ols(formula='mpg ~ gear_f', data=df)
result1 = model1.fit()
print(result1.summary())
```

We see the following output

```
gear_f[T.4]     8.4267      1.823      4.621      0.000       4.697      12.156
gear_f[T.5]     5.2733      2.431      2.169      0.038       0.301      10.246
```

Which indicate, that python uses treatment contrasts. If we would like to do that manualy, we can use `C(X, treatment(reference=l))` notation that is based on patsy package https://www.statsmodels.org/devel/contrasts.html.

```{python}
model1 = smf.ols(formula='mpg ~ C(gear, Treatment(reference = 5))', data=df)
result1 = model1.fit()
print(result1.summary())
```

For more contrasts see https://www.statsmodels.org/devel/contrasts.html.


## Regression -- marginal effects

Let assume the following model mpg ~ wt + I(wt^2) + gear_f + am for which we would like to calculate marginal effects


```{python}
model3 = smf.ols(formula='mpg ~ wt + np.power(wt,2) + gear_f', data=df)
result3 = model3.fit()
print(result3.summary())
```

Unfortunately, there is no easy way to do so in statmodels. Statmodels offers `get_margeff` for GEE and discrete models (i.e. logistic regresso). There is no package like `margins`.

The way around is to calculat by ourselves the derevative with respect to variable that we are interested in. So, in the case of `wt`, the first derevative would be

$$
\frac{\partial \text{ mpg}}{\partial \text{ wt}} = -13.1277 + 1.1760*2*\text{wt}
$$
To do so, we need to get parameter estimates

```{python}
result3.params
```

```{python}
df["partial_wt"] = result3.params['wt'] + result3.params['np.power(wt, 2)']*2*df['wt']
df.head()
```

Then, we calculate AME so average

```{python}
np.mean(df.partial_wt)
```

Note that the result differ from the one obtained through `margins` package because we did not use numerical (symbolic) differentiation. To closely follow `margins` one should implement this approach with `symPy` module for symbolic differentiation (https://scipy-lectures.org/packages/sympy.html)

# Regression in Julia

TBA