-
Notifications
You must be signed in to change notification settings - Fork 1
/
README.Rmd
253 lines (206 loc) · 9.97 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
---
output: github_document
---
<!-- README.md is generated from README.Rmd. Please edit that file -->
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/README-",
out.width = "100%"
)
```
# Overview
<!-- badges: start -->
[![R-CMD-check](https://github.com/ncn-foreigners/singleRcapture/workflows/R-CMD-check/badge.svg)](https://github.com/ncn-foreigners/singleRcapture/actions)
[![Codecov test coverage](https://codecov.io/gh/ncn-foreigners/singleRcapture/branch/main/graph/badge.svg)](https://app.codecov.io/gh/ncn-foreigners/singleRcapture?branch=main)
[![CRAN status](https://www.r-pkg.org/badges/version/singleRcapture)](https://CRAN.R-project.org/package=singleRcapture)
[![CRAN total downloads](https://cranlogs.r-pkg.org/badges/grand-total/singleRcapture)](https://cran.r-project.org/package=singleRcapture)
[![CRAN downloads](https://cranlogs.r-pkg.org/badges/singleRcapture)](https://cran.r-project.org/package=singleRcapture)
[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.8436043.svg)](https://doi.org/10.5281/zenodo.8436043)
<!-- badges: end -->
Capture-recapture type experiments are used to estimate the total population size
in situations when observing only a part of such population is feasible. In recent
years these types of experiments have seen more interest.
Single source models are distinct from other capture-recapture models because
we cannot estimate the population size based on how many units were observed
in two or three sources which is the standard approach.
Instead in single source models we utilize count data regression models on positive
distributions (i.e. on counts greater than 0) where the dependent variable
is the number of times a particular unit was observed in source data.
This package aims to implement already existing and introduce new methods
of estimating population size from single source to simplify the research process.
Currently we've implemented most of the frequentist approaches used in literature such as:
- Zero truncated Poisson, geometric and negative binomial regression.
- Zero truncated one inflated and one inflated zero truncated Poisson and geometric models. (Negative binomial is currently in development.)
- Zero one truncated Poisson geometric and negative binomial models.
- Generalized Chao and Zelterman's models based on logistic regression.
- Three types of bootstrap parametric, semi-parametric and non parametric.
- And a wide range of additional functionalities associated with (vector) generalized linear models relevant to the topic.
## Installation
You can install the current version of singleRcapture from main branch [GitHub](https://github.com/ncn-foreigners/singleRcapture) with:
```{r, eval=FALSE}
# install.packages("devtools")
remotes::install_github("ncn-foreigners/singleRcapture")
```
or install the stable version from [CRAN](https://cran.r-project.org/package=singleRcapture) with:
```{r, eval=FALSE}
install.packages(singleRcapture)
```
### Examples
The main function of this package is `estimatePopsize` which fitts regression
on specified distribution and then uses fitted regression to estimate the
population size.
Lets look at a model from 2003 publication :
Point and interval estimation of the population size using the truncated
Poisson regression model Heijden, Peter GM van der et al.
The call to `estimatePopsize` will look very similar to anyone who used the
`stats::glm` function:
```{r example}
library(singleRcapture)
model <- estimatePopsize(
formula = capture ~ gender + age + nation, # specify formula
data = netherlandsimmigrant,
popVar = "analytic", # specify
model = "ztpoisson", # distribution used
method = "IRLS", # fitting method one of three currently supported
controlMethod = controlMethod(silent = TRUE) # ignore convergence at half step warning
)
summary(model) # a summary method for singleR class with standard glm-like output and population size estimation resutls
```
We implemented a method for `plot` function to visualise the model fit and other
useful diagnostic information. One of which is `rootogram`, a type of plot
that compares fitted and observed marginal frequencies:
```{r plot, fig.show="hold", out.width="75%"}
plot(model, plotType = "rootogram")
```
The possible values for `plotType` argument are:
- `qq` - the normal quantile-quantile plot for pearson residuals.
- `marginal` - a `matplot` comparing fitted and observed marginal frequencies.
- `fitresid` - plot of linear predictor values contrasted with pearson residuals.
- `bootHist` - histogram of bootstrap sample.
- `rootogram` - rootogram, example presented above.
- `dfpopContr` - contrasting two deletion effects to identify presence of influential observations.
- `dfpopBox` - boxplot of results from dfpopsize function see its documentation.
- `scaleLoc` - scale-location plot.
- `cooks` - plot of `cooks.values` for distributions for which it is defined.
- `hatplot` - plot of `hatvalues`.
- `strata` - plot of confidence intervals for selected such populations.
a user can also pass arguments to specify additional information such as plot
title, subtitle etc. similar to calling `plot` on some data. For more info
check `plot.singleR` method documentation.
As we have seen there are some significant differences between fitted and
observed marginal frequencies. To check our intuition let's perform
goodness of fit test between fitted and observed marginal frequencies.
To do it we call a `summary` function of `marginalFreq` function which
computes marginal frequencies for the fitted `singleR` class object:
```{r}
summary(marginalFreq(model), df = 2, dropl5 = "group")
```
Finally let us check if we have any influential observations. We will do this
by comparing the deletion effect of every observation on population size estimate
by removing it entirely from the model (from population size estimate and regression)
and by only omitting it in pop size estimation (this is what is called the
contribution of an observation). If observation is not influential these two
actions should have the approximately the same effect:
```{r, fig.show="hold", out.width="75%"}
plot(model, plotType = "dfpopContr")
```
it is easy to deduce from the plot above that we have influential observations in
our dataset (one in particular).
Lastly `singleRcapture` offers some posthoc procedures for example a function
`stratifyPopsize` that estimates sizes of user specified sub populations and
returns them in a `data.frame`:
```{r}
stratifyPopsize(model, alpha = c(.01, .02, .03, .05), # different significance level for each sub population
strata = list(
"Females from Surinam" = netherlandsimmigrant$gender == "female" & netherlandsimmigrant$nation == "Surinam",
"Males from Turkey" = netherlandsimmigrant$gender == "male" & netherlandsimmigrant$nation == "Turkey",
"Younger males" = netherlandsimmigrant$gender == "male" & netherlandsimmigrant$age == "<40yrs",
"Older males" = netherlandsimmigrant$gender == "male" & netherlandsimmigrant$age == ">40yrs"
))
```
`strata` argument may be specified in various ways for example:
```{r}
stratifyPopsize(model, strata = ~ gender / age)
```
The package was designed with convenience in mind, for example it is possible
to specify that weights provided on call are to be interpreted as number of
occurrences of units in each row:
```{r}
df <- netherlandsimmigrant[, c(1:3,5)]
df$ww <- 0
### this is dplyr::count without dependencies
df <- aggregate(ww ~ ., df, FUN = length)
summary(estimatePopsize(
formula = capture ~ nation + age + gender,
data = df,
model = ztpoisson,
weights = df$ww,
controlModel = controlModel(weightsAsCounts = TRUE)
))
```
Methods such as regression diagnostics will be adjusted (values of weights will
be reduced instead of rows being removed etc.)
We also included option to use common non standardargument such as
significance levels different from usual 5%:
```{r boot}
set.seed(123)
modelInflated <- estimatePopsize(
formula = capture ~ gender + age,
data = netherlandsimmigrant,
model = "oiztgeom",
method = "IRLS",
# control parameters for population size estimation check documentation of controlPopVar
controlPopVar = controlPopVar(
alpha = .01, # significance level
)
)
summary(modelInflated)
```
and the option to estimate standard error of population size estimate by
bootstrap, models with more than one distribution parameter being dependent
on covariates and some non standard link functions for example:
```{r}
modelInflated2 <- estimatePopsize(
formula = capture ~ age,
data = netherlandsimmigrant,
popVar = "bootstrap",
model = oiztgeom(omegaLink = "cloglog"),
method = "IRLS",
controlPopVar = controlPopVar(
B = 500,# number of boostrap samples
alpha = .01, # significance level
# type of bootstrap see documentation for estimatePopsize
bootType = "semiparametric",
# control regression fitting on bootstrap samples
bootstrapFitcontrol = controlMethod(
epsilon = .Machine$double.eps,
silent = TRUE,
stepsize = 2
)
),
controlModel = controlModel(omegaFormula = ~ gender) # put covariates on omega i.e. the inflation parameter
)
popSizeEst(modelInflated2)
```
the results are significantly different
(the warning issued concerns the second derivative test for existence of local
minimum, here it was inconclusive but we manually checked that fitting process
found the optimal regression coefficients it's here to provide more information
to the user):
```{r, fig.show="hold", out.width="75%"}
plot(modelInflated2,
plotType = "bootHist",
labels = TRUE,
ylim = c(0, 175),
breaks = 15)
```
and information criteria support the second model:
```{r,echo=FALSE}
cat(" First model: AIC = ", AIC(modelInflated), " BIC = ", BIC(modelInflated),
"\nSecond model: AIC = ", AIC(modelInflated2), " BIC = ", BIC(modelInflated2),
"\n", sep = "")
```
## Funding
Work on this package is supported by the the National Science Center, OPUS 20 grant no. 2020/39/B/HS4/00941.