README.Rmd

---
output: github_document
---

<!-- README.md is generated from README.Rmd. Please edit that file -->

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.path = "man/figures/README-",
  out.width = "100%"
)
```

# Overview

<!-- badges: start -->
[![R-CMD-check](https://github.com/ncn-foreigners/singleRcapture/workflows/R-CMD-check/badge.svg)](https://github.com/ncn-foreigners/singleRcapture/actions)
[![Codecov test coverage](https://codecov.io/gh/ncn-foreigners/singleRcapture/branch/main/graph/badge.svg)](https://app.codecov.io/gh/ncn-foreigners/singleRcapture?branch=main)
[![CRAN status](https://www.r-pkg.org/badges/version/singleRcapture)](https://CRAN.R-project.org/package=singleRcapture)
[![CRAN total downloads](https://cranlogs.r-pkg.org/badges/grand-total/singleRcapture)](https://cran.r-project.org/package=singleRcapture)
[![CRAN downloads](https://cranlogs.r-pkg.org/badges/singleRcapture)](https://cran.r-project.org/package=singleRcapture)
[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.8436043.svg)](https://doi.org/10.5281/zenodo.8436043)
<!-- badges: end -->

Capture-recapture type experiments are used to estimate the total population size
in situations when observing only a part of such population is feasible. In recent
years these types of experiments have seen more interest.

Single source models are distinct from other capture-recapture models because
we cannot estimate the population size based on how many units were observed
in two or three sources which is the standard approach.

Instead in single source models we utilize count data regression models on positive
distributions (i.e. on counts greater than 0) where the dependent variable
is the number of times a particular unit was observed in source data.

This package aims to implement already existing and introduce new methods
of estimating population size from single source to simplify the research process.

Currently we've implemented most of the frequentist approaches used in literature such as:

- Zero truncated Poisson, geometric and negative binomial regression.
- Zero truncated one inflated and one inflated zero truncated Poisson and geometric models. (Negative binomial is currently in development.)
- Zero one truncated Poisson geometric and negative binomial models.
- Generalized Chao and Zelterman's models based on logistic regression.
- Three types of bootstrap parametric, semi-parametric and non parametric.
- And a wide range of additional functionalities associated with (vector) generalized linear models relevant to the topic.

## Installation

You can install the current version of singleRcapture from main branch [GitHub](https://github.com/ncn-foreigners/singleRcapture) with:

```{r, eval=FALSE}
# install.packages("devtools")
remotes::install_github("ncn-foreigners/singleRcapture")
```

or install the stable version from [CRAN](https://cran.r-project.org/package=singleRcapture) with:

```{r, eval=FALSE}
install.packages(singleRcapture)
```

### Examples

The main function of this package is `estimatePopsize` which fitts regression
on specified distribution and then uses fitted regression to estimate the
population size. 

Lets look at a model from 2003 publication :
Point and interval estimation of the population size using the truncated 
Poisson regression model Heijden, Peter GM van der et al.
The call to `estimatePopsize` will look very similar to anyone who used the 
`stats::glm` function:

```{r example}
library(singleRcapture)
model <- estimatePopsize(
  formula = capture ~ gender + age + nation, # specify formula
  data = netherlandsimmigrant,
  popVar = "analytic", # specify 
  model = "ztpoisson", # distribution used
  method = "IRLS", # fitting method one of three currently supported
  controlMethod = controlMethod(silent = TRUE) # ignore convergence at half step warning
)
summary(model) # a summary method for singleR class with standard glm-like output and population size estimation resutls
```

We implemented a method for `plot` function to visualise the model fit and other
useful diagnostic information. One of which is `rootogram`, a type of plot
that compares fitted and observed marginal frequencies:

```{r plot, fig.show="hold", out.width="75%"}
plot(model, plotType = "rootogram")
```

The possible values for `plotType` argument are:

- `qq` - the normal quantile-quantile plot for pearson residuals.
- `marginal` - a `matplot` comparing fitted and observed marginal frequencies.
- `fitresid` - plot of linear predictor values contrasted with pearson residuals.
- `bootHist` - histogram of bootstrap sample.
- `rootogram` - rootogram, example presented above.
- `dfpopContr` - contrasting two deletion effects to identify presence of influential observations.
- `dfpopBox` - boxplot of results from dfpopsize function see its documentation.
- `scaleLoc` - scale-location plot.
- `cooks` - plot of `cooks.values` for distributions for which it is defined.
- `hatplot` - plot of `hatvalues`.
- `strata` - plot of confidence intervals for selected such populations.

a user can also pass arguments to specify additional information such as plot 
title, subtitle etc. similar to calling `plot` on some data. For more info
check `plot.singleR` method documentation.

As we have seen there are some significant differences between fitted and
observed marginal frequencies. To check our intuition let's perform
goodness of fit test between fitted and observed marginal frequencies.

To do it we call a `summary` function of `marginalFreq` function which
computes marginal frequencies for the fitted `singleR` class object:

```{r}
summary(marginalFreq(model), df = 2, dropl5 = "group")
```

Finally let us check if we have any influential observations. We will do this
by comparing the deletion effect of every observation on population size estimate
by removing it entirely from the model (from population size estimate and regression)
and by only omitting it in pop size estimation (this is what is called the
contribution of an observation). If observation is not influential these two 
actions should have the approximately the same effect:

```{r, fig.show="hold", out.width="75%"}
plot(model, plotType = "dfpopContr")
```

it is easy to deduce from the plot above that we have influential observations in
our dataset (one in particular).  

Lastly `singleRcapture` offers some posthoc procedures for example a function 
`stratifyPopsize` that estimates sizes of user specified sub populations and
returns them in a `data.frame`:

```{r}
stratifyPopsize(model, alpha = c(.01, .02, .03, .05), # different significance level for each sub population
    strata = list(
    "Females from Surinam" = netherlandsimmigrant$gender == "female" & netherlandsimmigrant$nation == "Surinam",
    "Males from Turkey" = netherlandsimmigrant$gender == "male" & netherlandsimmigrant$nation == "Turkey",
    "Younger males" = netherlandsimmigrant$gender == "male" & netherlandsimmigrant$age == "<40yrs",
    "Older males" = netherlandsimmigrant$gender == "male" & netherlandsimmigrant$age == ">40yrs"
))
```

`strata` argument may be specified in various ways for example: 

```{r}
stratifyPopsize(model, strata = ~ gender / age)
```

The package was designed with convenience in mind, for example it is possible
to specify that weights provided on call are to be interpreted as number of 
occurrences of units in each row:

```{r}
df <- netherlandsimmigrant[, c(1:3,5)]
df$ww <- 0
### this is dplyr::count without dependencies
df <- aggregate(ww ~ ., df, FUN = length)
summary(estimatePopsize(
  formula = capture ~ nation + age + gender, 
  data = df, 
  model = ztpoisson, 
  weights = df$ww,
  controlModel = controlModel(weightsAsCounts = TRUE)
))
```

Methods such as regression diagnostics will be adjusted (values of weights will
be reduced instead of rows being removed etc.)

We also included option to use common non standardargument such as 
significance levels different from usual 5%:

```{r boot}
set.seed(123)
modelInflated <- estimatePopsize(
    formula = capture ~ gender + age,
    data = netherlandsimmigrant,
    model = "oiztgeom",
    method = "IRLS",
    # control parameters for population size estimation check documentation of controlPopVar
    controlPopVar = controlPopVar(
        alpha = .01, # significance level 
    )
)
summary(modelInflated)
```

and the option to estimate standard error of population size estimate by 
bootstrap, models with more than one distribution parameter being dependent 
on covariates and some non standard link functions for example:

```{r}
modelInflated2 <- estimatePopsize(
    formula = capture ~ age,
    data = netherlandsimmigrant,
    popVar = "bootstrap",
    model = oiztgeom(omegaLink = "cloglog"),
    method = "IRLS",
    controlPopVar = controlPopVar(
        B = 500,# number of boostrap samples
        alpha = .01, # significance level 
        # type of bootstrap see documentation for estimatePopsize
        bootType = "semiparametric",
        # control regression fitting on bootstrap samples
        bootstrapFitcontrol = controlMethod(
          epsilon = .Machine$double.eps, 
          silent = TRUE, 
          stepsize = 2
        )
    ),
    controlModel = controlModel(omegaFormula = ~ gender) # put covariates on omega i.e. the inflation parameter
)
popSizeEst(modelInflated2)
```

the results are significantly different
(the warning issued concerns the second derivative test for existence of local 
minimum, here it was inconclusive but we manually checked that fitting process
found the optimal regression coefficients it's here to provide more information
to the user):

```{r, fig.show="hold", out.width="75%"}
plot(modelInflated2, 
     plotType = "bootHist", 
     labels = TRUE, 
     ylim = c(0, 175),
     breaks = 15)
```

and information criteria support the second model:

```{r,echo=FALSE}
cat(" First model: AIC = ", AIC(modelInflated), " BIC = ", BIC(modelInflated),
    "\nSecond model: AIC = ", AIC(modelInflated2), " BIC = ", BIC(modelInflated2), 
    "\n", sep = "")
```

## Funding

Work on this package is supported by the the National Science Center, OPUS 20 grant no. 2020/39/B/HS4/00941.