notebooks/2-handling-categorical-data.Rmd

---
author: "Maciej Beręsewicz"
title: "Handling categorical data in R"
output: 
  html_notebook: 
    number_sections: yes
    toc: yes
---


# Packages

```{r}
library(tidyverse)
library(forcats)
library(haven)
library(reticulate)
use_python("/usr/local/anaconda3/bin/python")
```

# Settings

R setup 

```{r}
knitr::opts_chunk$set(engine.path = list(
  python = "/usr/local/anaconda3/bin/python",
  julia = "/Applications/Julia-1.3.app/Contents/Resources/julia/bin/"
))
```


# R


## factors / characters

We use `base::factor` function to create factor 

```{r}
data <- 1:10
data_f <- as.factor(data)
data_f
```

Using `base::levels` we get levels

```{r}
levels(data_f)
```

Now, create a factor with levels and labels defined

```{r}
f1 <- factor(x = c(1,2,3,1,2,3,-9), 
             levels = c(1,2,3,-9, 0),
             labels = c("Yes", "No", "Don't know", "N/A", "NULL"),
             ordered = FALSE)
f1
```

Note that `as.numeric` replaces original data!!! (no -9 values)

```{r}
as.numeric(f1)
```

Function `as.character` converts replaces numbers with labels.

```{r}
as.character(f1)
```

Using table

```{r}
table(f1)
```

To drop levels with 0 observations (not recommended for initial analysis) we can use `table(., exclude = "level")`

```{r}
table(f1, exclude = "NULL")
```

or `base::droplevels` function

```{r}
table(droplevels(f1))
```

or `xtabs(., drop.unused.levels = TRUE)`

```{r}
xtabs(~f1, drop.unused.levels = TRUE)
```

We can change the reference class (i.e. first level)  with `relevel` function.

```{r}
relevel(f1, ref = "No")
```

In R we can create an ordered factor (for ordered categorical data). 

```{r}
f2 <- factor(x = mtcars$cyl,
             levels = c(4, 6, 8),
             labels = c("Small", "Medium", "Big"),
             ordered = TRUE)

f2
```

Note `<` symbols in `levels`.


## other R packages

### haven

```{r}
x <- labelled(x = c(1, 2, 1, 2, 10, 9), labels = c(Unknown = 9, Refused = 10))
x
```

```{r}
as.numeric(x)
```

```{r}
as_factor(x)
```

### forcats

Main functions

+ `fct_reorder()`: Reordering a factor by another variable.
+ `fct_infreq()`: Reordering a factor by the frequency of values.
+ `fct_relevel()`: Changing the order of a factor by hand.
+ `fct_lump()`: Collapsing the least/most frequent values of a factor into “other”.

Other functions -- https://forcats.tidyverse.org/reference/index.html


```{r}
fct_inorder(f = f2, ordered = F) ## in orde of appearance
```

```{r}
fct_infreq(f = f2, ordered = F) ## by frequency
```

```{r}
fct_count(f2) ## frequency table (result: data.frame)
```

```{r}
fct_lump(f2, n = 1, other_level = "other levels") ## aggregates based frequency
```

```{r}
fct_collapse(f2,
             other  = c("Big", "Medium"))

```

About missing levels

```{r}
f1 <- factor(c("a", "a", NA, NA, "a", "b", NA, "c", "a", "c", "b"))
table(f1)

f2 <- fct_explicit_na(f1, na_level = "Missing")
table(f2)
```

### Other useful functions


```{r}
cut(1:10, breaks = c(0,5,10), right = T)
```

### Further reading

-- expss -- https://cran.r-project.org/web/packages/expss/vignettes/labels-support.html

# Python

Categorical data can be handled by pandas (https://pandas.pydata.org/pandas-docs/stable/user_guide/categorical.html).

```{python}
import pandas as pd
import numpy as np
```

## Pandas and categorical data

Create dummy variable `pd.Series` where we define that it is of `category` type.

```{python}
s = pd.Series(["a", "b", "c", "a"], dtype="category")
s
```

If we would like to convert character column to category one we need to use `astype` method. Below we do the following task:

1. create a pandas DataFrame with `pd.DataFrame`
2. Create new column and indicate its type `astype('category')` which is similar with R `as.factor`


```{python}
df = pd.DataFrame({"A": ["a", "b", "c", "a"]})
df["B"] = df["A"].astype('category')
df
```

We can also create numeric variable and convert it to `category`


```{python}
df = pd.DataFrame({"A": np.arange(0,10,1)})
df["B"] = df["A"].astype('category')
df
```

Further, we may use `cut` function which works similarly as the `cut` function in R


```{python}
df["group"] = pd.cut(df.A, [0,5,10], right = True, include_lowest = True, labels = ['0-5','5-10'])
df
```

Finally, there is a function `pd.Categorical` which explicitly creates factor / categorical variable.

```{python}
new_col = pd.Categorical(values= np.repeat([1,2,3], 3), categories=[1,2,3], ordered=True)
new_col
```


# Julia

Julia contains CategoricalArrays package that enables working with factors

```{julia}
using DataFrames
using CategoricalArrays
using FreqTables
```

Let's create a vector of 1,2,3 repeated 3 times which then we can convert to Categorical Array using `CategoricalArray` function from the `CategoricalArrays` package 

```{julia}
x = repeat([9,7,3], outer=3)
simple_cat_array = CategoricalArray(x)
```

We may see levels using `levels` function
```{julia}
levels(simple_cat_array)
```

Further, we may change order of levels with new one using `levels!` which indicate mutating in place function

```{julia}
levels!(simple_cat_array, [9,7,3]);
levels(simple_cat_array)
```

Next, we can create DataFrame with categorical variables

```{julia}
df = DataFrame(A = ["A", "B", "C", "D", "D", "A"],
               B = ["X", "X", "X", "Y", "Y", "Y"])
```

```{julia}
categorical!(df, :A)
```


Other useful functions can be found http://juliadata.github.io/CategoricalArrays.jl/stable/using.html