(cache)Python DatasourceでJSONをいい感じにパースして可視化していこう

この記事は Redash Advent Calendar 2017 の10日目の記事です。

Redashでは「Python Datasource」というものが使え、Redash上でPythonスクリプトを書くことができます。
これを使えば、Redash上でほぼなんでもできるようになります。

今回やるのは、「JSONのパース」です。

BigQueryでスキーマレスなデータを扱う

Redashの用途として、BigQueryのデータを解析・グラフ化というものがあると思います。

BigQueryのデータ内にJSONを入れ、そのJSONを解析していい感じにグラフ化したいという要望がきたとして
スキーマレスなJSONが入っていた場合、BigQueryでの解析は難しいです。

例えば以下のようなJSONがあるとします。

{"id": 1, "name": "Taro", "record": "[{\"Mathematics\": 80},{\"English\": 90}]"}
{"id": 2, "name": "Hanako", "record": "[{\"Mathematics\": 65},{\"English\": 70}]"}
{"id": 3, "name": "Pochi", "record": "[{\"Mathematics\": 100},{\"English\": 60}]"}
{"id": 4, "name": "Ken", "record": "[{\"Mathematics\": 40},{\"English\": 80}]"}

上記JSONをBigQueryにインポートしてRedashで普通に実行すると以下のようになります。

select
  id
  ,name
  ,record
from
  samplejson
order by
  id

ここから「各生徒の数学と英語の合計点を算出せよ」みたいな要件がある場合、こんな感じのSQLを書かないといけません。

select
  id
  ,name
  ,record
  ,SAFE_CAST(JSON_EXTRACT_SCALAR(record, '$.0.Mathematics') AS INT64) + SAFE_CAST(JSON_EXTRACT_SCALAR(record, '$.1.English') AS INT64) as result
from
  samplejson
order by
  id

かなり危ういSQLになってます。

今回のサンプルのJSONはわざとややこしい感じにしましたが、
こういう構造のJSONや、また配列があったりネストされているJSONもあったりします。
こういう場合、BigQueryが用意している JSON_EXTRACT_SCALAR などでは正直つらいです。
なのでこういうのはPythonでやってしまいたいと思います。

Python Datasourceを使う

Python Datasourceは「Datasource」の管理画面から設定を行えます。

では先ほどのJSONをパースし、表に出す処理をPythonで書いてみます。

import json

# Redashで保存したクエリーの結果をこれで参照できる
# query id = 8は上記のテーブルをそのまま出したもの
q_res = get_query_result(8)

result = {}
for row in q_res["rows"]:
    j = json.loads(row["record"])
    math = 0
    eng  = 0
    
    for record in j:
        if "Mathematics" in record:
            math = record["Mathematics"]
        if "English" in record:
            eng  = record["English"]

    add_result_row(result, {
        "id": row["id"],
        "name": row["name"],
        "result": math + eng
    })
add_result_column(result, 'id', '', 'integer')
add_result_column(result, 'name', '', 'string')
add_result_column(result, 'result', '', 'integer')

これを実行すると以下のようになります。

いい感じになりました。
Pythonでできることはなんでも出来るので、ネストされてようが配列だろうが好きに扱えます。

生のJSONを表にする

Pythonが書けるんだから、もうなんでも表にできます。
今回はAWSのEC2の料金表をRedashで表示してみます。
（本来は料金表をDBに入れてそれを操作したほうがよいです）

AWSの料金表はJSONで取れるので、それをいい感じにパースします。

# EC2の料金表
import json, urllib2

url = "https://pricing.us-east-1.amazonaws.com/offers/v1.0/aws/AmazonEC2/current/ap-northeast-1/index.json"

r = urllib2.urlopen(url)
root = json.loads(r.read())
r.close()

result = {}
p_key_list = []
for p_key in root["products"]:
    if root["products"][p_key]["productFamily"] != "Compute Instance":
        continue
    
    if root["products"][p_key]["attributes"]["tenancy"] != "Shared":
        continue
    
    if root["products"][p_key]["attributes"]["operatingSystem"] != "Linux":
        continue
    p_key_list.append(p_key)

for p_key in p_key_list:
    pricePerUnit = root["terms"]["OnDemand"][p_key][p_key + ".JRTCKXETXF"]["priceDimensions"][p_key + ".JRTCKXETXF.6YS6EN2CT7"]["pricePerUnit"]["USD"]
    pricePerMonth = float(pricePerUnit) * 24.0 * 31.0
    add_result_row(result, {
        "sku": p_key,
        "instanceType": root["products"][p_key]["attributes"]["instanceType"],
        "vcpu" : root["products"][p_key]["attributes"]["vcpu"],
        "memory" : root["products"][p_key]["attributes"]["memory"],
        "storage" : root["products"][p_key]["attributes"]["storage"],
        "price/hour":pricePerUnit,
        "price/month": pricePerMonth
    })

add_result_column(result, 'sku', '', 'string')
add_result_column(result, 'instanceType', '', 'string')
add_result_column(result, 'vcpu', '', 'string')
add_result_column(result, 'memory', '', 'string')
add_result_column(result, 'storage', '', 'string')
add_result_column(result, 'price/hour', '', 'float')
add_result_column(result, 'price/month', '', 'float')

はい、いい感じに表にできました。

最後に

このようにBigQueryでスキーマレスなJSONをパースしてみたり、ただのJSONをパースできたりします。
今回はJSONを焦点に当てましたが、どんなフォーマットでもPythonが解釈できればRedashで表にすることができます。

こんな感じで身の回りのものをRedashで可視化しやすくなってきたので、ぜひ利用してきましょう！

では！