Prometheus casual talk1

1. Hadoop, Fluentd cluster monitoring with Prometheus and Grafana 2016/06/14 @wyukawa Prometheus Casual Talks #1 #prometheuscasual

2. Agenda •  Prometheus History •  Prometheus Feature •  Prometheus Architecture •  My use case

3. History •  Started in 2012 by ex-Google Site Reliability Engineers •  WriLen in Go •  Inspired by Google’s Borgmon – Borgmon monitors Borg •  Public announcement in January 2015 hLp://www.slideshare.net/FabianReinartz/prometheus-a-next-gen-monitoring-system-3

4. Features •  pull architecture – easy ﬂow control – not easy to get through ﬁrewall •  Cloud Monitoring as a Service uses push model •  mulZ dimensional data model •  powerful query language •  alert

5. pull architecture hLps://prometheus.io/docs/introducZon/overview/

6. node_exporter example •  hLp://host:9100/metrics

7. mulZ dimensional data model •  metric types – counter – gauge – histogram – summary hLps://prometheus.io/docs/concepts/metric_types/

8. How to handle counter metric •  Do you use reset? hLp://www.robustpercepZon.io/how-does-a-prometheus-counter-work/ No! use rate/irate/increase funcZon! 100 - (avg by (instance) (irate(node_cpu{job="node",mode="idle"}[5m])) * 100) hLp://www.robustpercepZon.io/understanding-machine-cpu-usage/

9. powerful query language sum by(status) ( rate(hLp_response_status_total [1m])) ) ALERT DiskWillFillIn4Hours IF predict_linear(node_ﬁlesystem_free{job='node'}[1h], 4*3600) < 0 FOR 5m LABELS { severity="page" } hLp://www.robustpercepZon.io/reduce-noise-from-disk-space-alerts/

10. Alert •  Alertmanager has the role •  very young compared to Prometheus itself •  very promising •  aim to have as few alerts as possible – repeat_interval: 4hours

11. My use case •  At first I use file_sd_configs manually •  Now I use promgen! •  Exporters are executed by supervisord/ systemd •  Monitor middlewares and machines – Hadoop – Fluentd – ElasZcsearch

12. monitoring hadoop/hive •  developer always uses jmx_exporter to monitor java middleware •  But I implement namenode/ resourcemanager/jstat exporter because I want and I don’t want to restart daemon •  hLps://github.com/wyukawa/ hadoop_exporter •  hLps://github.com/wyukawa/jstat_exporter

13. Namenode block monitoring Grafana AnnotaZon Alert is also prometheus metrics so grafana can show alert as annotaZon

14. Resoucemanager job monitoring

15. Hiveserver2 jvm monitoring hLps://issues.apache.org/jira/browse/HIVE-13374

16. Fluentd buffer monitoring •  fluent-plugin-prometheus enables buffer monitoring

17. access log count •  ﬂuent-plugin-prometheus enable to count access log but need sampling because of high cpu usage(Flink/Storm/… may be necessary)

18. HTTP status count Although 4xx/5xx is not 0, it may become 0 because of sampling

19. HTTP status percentage

20. fluentd_exporter •  I implement fluentd_expoter because I want to monitor fluentd cpu usage hLp://d.hatena.ne.jp/wyukawa/20160603/1464934228

21. elasZcsearch_exporter hLps://github.com/elasZc/elasZcsearch/issues/18635

22. My impression •  Prometheus has a powerful query but someZmes diﬃcult to understand – sum(rate(accesslog_counts{tag="..."}[1m])) by (status, job) / ignoring(status) group_lew sum(rate(accesslog_counts{tag="..."}[1m])) by (job) •  Grafana is also great but to share link is a liLle weak