BigQueryで統計量を出す時に使うクエリメモ

この記事は最終更新日から3年以上が経過しています。

これは何か

window関数を駆使するとUDFを使わなくとも簡潔に統計量が出せる。がよく書き方を忘れるのでメモ

基本

-- 平均
AVG(expr)

-- 中央値0.1%誤差
NTH(501, QUANTILES(expr, 1001))

-- 最頻値
TOP(expr, 1), COUNT(*) as count

-- 分散
VARIANCE(expr) 
-- 母集団の分散 (分母が n-1 の方)
VARIANCE_POP(expr)

-- 標準偏差
STDDEV(expr)
-- 母集団の標準偏差 (分母が n-1 の方)
STDDEV_POP(expr)

-- 変動係数
STDDEV(expr)/AVG(expr)

-- ピアソン相関係数
CORR(expr1, expr2)

情報量

Public DatasetsのUSA Namesから、ジェンダー別に名前毎の情報量

select
    gender,
    name,
    number,
    -log2(p_name) as bits
from (
    select
        gender,
        name,
        number,
        ratio_to_report(number) over (partition by gender) as p_name,
    from [bigquery-public-data:usa_names.usa_1910_2013]
    where state = 'CA' 
)

エントロピー

Public DatasetsのUSA Namesから、ジェンダー毎に名前のエントロピー

select
    state,
    gender,
    sum(p_name * -log2(p_name)) as entropy
from (
    select
        state,
        gender,
        name,
        number,
        ratio_to_report(number) over (partition by gender) as p_name,
    from [bigquery-public-data:usa_names.usa_1910_2013]
)
group by state, gender
order by entropy

KL-Divergence

KLDivergence=iP(i)log2P(i)Q(i)

KaggleのClick-Through Rate Predictionのデータを例に、広告枠毎にhourlyのインプレッション数とクリック数の確率分布の距離を取る

select
    site_id,
    sum(p_imp * log2(p_imp/(p_click + 0.0001))) as KLD_imp_and_click
from (
    select 
        site_id,
        hour,
        imp,
        click,
        ratio_to_report(imp) over (partition by site_id) as p_imp,
        ratio_to_report(click) over (partition by site_id) as p_click
    from (
        select 
            site_id,
            hour,
            count(*) as imp,
            sum(click) as click,
        from [hagino3000-bq-sandbox:ctr_competition.train]
        group by site_id, hour
        order by site_id, hour
    )
)
group by site_id
order by KLD_imp_and_click

時系列データの移動平均と差分系列

同じくKaggleのClick-Through Rate Predictionで広告枠ごとの3時間移動平均。欠損値を補完したい場合は軸を別途作ってJoinする事。

select
  site_id,
  hour,
  avg(imp) over (partition by site_id
                 order by hour
                 ROWS BETWEEN 3 preceding AND CURRENT ROW) as rolling_mean
from (
  select 
      site_id,
      hour,
      count(*) as imp
  from [hagino3000-bq-sandbox:ctr_competition.train]
      group by site_id, hour
)
order by site_id, hour

差分系列

select
    site_id,
    hour,
    imp,
    imp - ifnull(previous_imp, imp) as residual
from (
    select
        site_id,
        hour,
        imp,
        lag(imp, 1) over (partition by site_id order by hour) as previous_imp
    from (
        select 
            site_id,
            hour,
            count(*) as imp
        from [hagino3000-bq-sandbox:ctr_competition.train]
        group by site_id, hour
    )
)
order by site_id, hour

正規化した中央絶対偏差

ロバストな標準偏差の推定値

MADN=median(|ximean(x)|)0.674
select 
    a.stn as stn,
    a.year as year,
    a.mo as month,
    a.da as day,
    a.temp as temp, 
    avg(a.temp) over (partition by a.stn) as temp_mu,
    stddev(a.temp) over (partition by a.stn) as temp_std,
    b.median as temp_median,
    b.madn as temp_madn
from [bigquery-public-data:noaa_gsod.gsod1929] as a
inner join 
(
    select 
        org.stn as stn,
        agg.median as median,
        nth(501, quantiles(abs(org.temp - agg.median), 1001))/0.675 as madn,
    from [bigquery-public-data:noaa_gsod.gsod1929] as org
    inner join 
    (
        select 
            stn,  nth(501, quantiles(temp, 1001)) as median
        from [bigquery-public-data:noaa_gsod.gsod1929]
        group by stn
    ) as agg on org.stn = agg.stn
    group by stn, median
) as b on a.stn = b.stn

Z値

BigQuery public datasetのNOAA Weather Dataから、ステーション毎に気温のZ値

select
    stn,
    year,
    mo,
    da,
    temp,
    (temp - mu)/std AS Z
from (
    select 
        stn,
        year,
        mo,
        da,
        temp, 
        avg(temp) over (partition by stn) as mu,
        stddev(temp) over (partition by stn) as std
    from [bigquery-public-data:noaa_gsod.gsod1929]
)
order by stn, year, mo, da
ユーザー登録して、Qiitaをもっと便利に使ってみませんか。
  1. あなたにマッチした記事をお届けします
    ユーザーやタグをフォローすることで、あなたが興味を持つ技術分野の情報をまとめてキャッチアップできます
  2. 便利な情報をあとで効率的に読み返せます
    気に入った記事を「ストック」することで、あとからすぐに検索できます
コメント
この記事にコメントはありません。
あなたもコメントしてみませんか :)
すでにアカウントを持っている方は
ユーザーは見つかりませんでした