(cache) 統計的問題を回避するためのデータ解析のプロトコル (Zuur et al. 2010): 4 データの中にゼロがたくさんあるか？

この記事では統計的問題を回避するためのデータ解析のプロトコル (Zuur et al. 2010)で扱われているゼロ過剰問題を取り扱っている。

uribo.hatenablog.com

離散値の整数かならるカウントデータの多くはポアソン分布に従うことが一般的である。しかし、ある生息地における生物の観察数やスポーツにおける試合の得点など、０を多く含むデータが存在する。

そうしたデータについて統計モデルを適用する場合、ポアソン分布や負の二項分布を仮定した一般化線形モデル GLMなどを行うと、ポアソン分布で期待されるよりも過剰（あるいは過少）にデータが観測されることがあり推定がうまくいかないことがある。そのデータのように０の割合が多いデータに対して有効なモデルがゼロ過剰なポアソン分布モデル Zero-inflated Poisson Distribution: ZIPモデルである。

📉 カウントデータにポアソン分布を用いる

"Analysis of Categorical Data with R (Christopher R. Bilder and Thomas M. Loughin 2014)"から例題として出されているデータセット(Tauber et al. 1996)を利用させてもらう。このデータセットはGalerucella nymphaeaeという甲虫のオスとメスのペアを異なる温度条件で飼育し、産卵数を観察したものである。

このデータから温度に対する産卵数の効果を検討してみたい。

library(data.table)
library(dplyr)

beetle_egg <- fread(input = "BeetleEggCrowding.txt",
                        data.table = FALSE) %>% 
  dplyr::filter(TRT == "I")

beetle_egg$NumEggs %>% {
  print(class(.))
  print(head(.))
}

## [1] "integer"
## [1] 8 0 3 5 1 3

まずはGLMでポアソン分布を当てはめてみる

poi_mod <- beetle_egg %>% glm(NumEggs ~ Temp, 
                              family = poisson(),
                              data   = .)
summary(poi_mod)

## 
## Call:
## glm(formula = NumEggs ~ Temp, family = poisson(), data = .)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -2.947  -2.662  -1.259   1.569   5.174  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.16026    0.91570  -0.175   0.8611
## Temp         0.06787    0.04034   1.682   0.0925
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 418.28  on 69  degrees of freedom
## Residual deviance: 415.43  on 68  degrees of freedom
## AIC: 566.09
## 
## Number of Fisher Scoring iterations: 6

mu_poi <- poi_mod %>% predict() %>% exp() 
zero_poi <- -mu_poi %>% exp() %>% sum()
round(zero_poi, digits = 2)

## [1] 1.47

実際の０の観察値（27）に対し、０の期待値が過少に評価されてしまった。

📈 ゼロ過剰なポアソン分布

Rのパッケージでは

Which is the best R package for zero-Inflated count data? - ResearchGate

という話題がでるほどたくさんあるのだが、ここでは最尤推定法を利用する{pscl}とMCMCサンプリングによるベイズ推定を行う（どちらも {rstan} のラッパーパッケージ）{brms}と{glmmstan}の３つのパッケージを試してみる。

library(pscl)
library(brms)
library(glmmstan)

📦 {pscl}

zip_mod <- beetle_egg %>% zeroinfl(NumEggs ~ Temp | 1, dist = "poisson", data = .)
#  distを明示しない場合と一緒
zip_mod %>% {
  print(summary(.))
  zero_zip <<- sum(predict(object = ., type = "prob")[,1]) %>% round(digits = 2)
    # type... prob, count, zero
}

## 
## Call:
## zeroinfl(formula = NumEggs ~ Temp | 1, data = ., dist = "poisson")
## 
## Pearson residuals:
##     Min      1Q  Median      3Q     Max 
## -1.0998 -1.0318 -0.6531  0.8632  3.2015 
## 
## Count model coefficients (poisson with log link):
##             Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.44898    0.92249  -1.571 0.116245
## Temp         0.14700    0.04061   3.620 0.000295
## 
## Zero-inflation model coefficients (binomial with logit link):
##             Estimate Std. Error z value Pr(>|z|)
## (Intercept)  -0.4720     0.2466  -1.914   0.0556
## 
## Number of iterations in BFGS optimization: 8 
## Log-likelihood:  -188 on 3 Df

zero_zip

## [1] 27.02

{pscl}のzeroinfl()で真の０の度数に近い値が得られた。

先ほどのポアソン分布を当てはめたモデルと比べてみても

vuong(zip_mod, poi_mod)

## Vuong Non-Nested Hypothesis Test-Statistic: 
## (test-statistic is asymptotically distributed N(0,1) under the
##  null that the models are indistinguishible)
## -------------------------------------------------------------
##               Vuong z-statistic             H_A      p-value
## Raw                    4.620874 model1 > model2 0.0000019106
## AIC-corrected          4.571228 model1 > model2 0.0000024244
## BIC-corrected          4.515414 model1 > model2 0.0000031597

ZIPモデルのほうが妥当であるというけっかが得られた。

📦 {brms}

続いて{brms}でZIPモデルを適用する。

zip_mod_brms <- beetle_egg %>% 
  brm(NumEggs ~ Temp, 
      data     = ., 
      seed     = 71,
      n.chains = 3,
      n.iter   = 10000,
      n.warmup = 8000,
      family   = "zero_inflated_poisson")

summary(zip_mod_brms)

##  Family: zero_inflated_poisson (log) 
## Formula: NumEggs ~ Temp 
##    Data: . (Number of observations: 70) 
## Samples: 3 chains, each with n.iter = 10000; n.warmup = 8000; n.thin = 1; 
##          total post-warmup samples = 6000
##    WAIC: 466.74
##  
## Fixed Effects: 
##           Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
## Intercept    -2.01      1.03    -3.90     0.03        528    1
## Temp          0.17      0.05     0.08     0.25        526    1
## 
## Samples were drawn using NUTS(diag_e). For each parameter, Eff.Sample is a 
## crude measure of effective sample size, and Rhat is the potential scale 
## reduction factor on split chains (at convergence, Rhat = 1).

📦 {glmmstan}

最後に{glmmstan}でZIPモデルを適用する。{glmmstan}は最近のバージョンでZIPモデルを始め、より多くの分布に対応したとのこと。

zip_mod_glmmstan <- beetle_egg %>% 
  glmmstan(NumEggs ~ Temp, 
           data   = ., 
           family = "zipoisson",
           chains = 3,
           iter   = 10000,
           warmup = 8000)

print(zip_mod_glmmstan, digits = 2, pars = "beta")

## Inference for Stan model: NumEggs~Temp [zipoisson].
## 3 chains, each with iter=10000; warmup=8000; thin=1; 
## post-warmup draws per chain=2000, total post-warmup draws=6000.
## 
##          mean se_mean   sd  2.5%   25%   50%   75% 97.5% n_eff Rhat
## beta[1] -1.46    0.03 0.93 -3.24 -2.10 -1.46 -0.82  0.31  1329    1
## beta[2]  0.15    0.00 0.04  0.07  0.12  0.15  0.18  0.23  1324    1
## 
## Samples were drawn using NUTS(diag_e) at Wed Nov 25 21:35:30 2015.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at 
## convergence, Rhat=1).

Stanのモデルコードについてはもう少し検討する必要があるみたいだが、これらのパッケージを利用してゼロが過剰なポアソン分布のデータに対応することができた。

☕ 雑

RでZIPモデルを行うパッケージ。怪しいものもある。上のものから気に入っている。下の３つは検証していない。

{pscl}... 良い。
{glmmstan}... 良い。{rstan}のラッパー
{brms}... Stanを利用した線形モデル
{blme}... {lme4}のベイズ拡張。ランダム効果を取り組んだモデルに使える
{glmmADMB} CRANにもGitHubにもないどこか怪しいパッケージ
{COZIGAM} アーカイブされている
{gamlss.dist}... 使えなさそう

おまけ。

## 負の二項分布
negbim_mod <- beetle_egg %>% glm.nb(NumEggs ~ Temp, 
                      data = ., 
                      init.theta = 0.61, 
                      link = log)
summary(negbim_mod)

## 
## Call:
## glm.nb(formula = NumEggs ~ Temp, data = ., init.theta = 0.4684948827, 
##     link = log)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.4773  -1.4185  -0.4396   0.4459   1.2875  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.16026    2.78026  -0.058    0.954
## Temp         0.06787    0.12320   0.551    0.582
## 
## (Dispersion parameter for Negative Binomial(0.4685) family taken to be 1)
## 
##     Null deviance: 74.169  on 69  degrees of freedom
## Residual deviance: 73.865  on 68  degrees of freedom
## AIC: 342.43
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  0.468 
##           Std. Err.:  0.106 
## 
##  2 x log-likelihood:  -336.429

📚 参考

Zuur et al. (2010) A protocol for data exploration to avoid common statistical problems. Methods in Ecology and Evolution 1: 3--14.
カウントデータの統計学
Analysis of Categorical Data with R
R Data Analysis Examples: Zero-Inflated Poisson Regression
StanでZero-inflated Poissonモデル：Taglibro de H：So-netブログ
Count data and GLMs: choosing among Poisson, negative binomial, and zero-inflated models | Datavore Consulting