Hadoop,	Fluentd	cluster	monitoring	
with	Prometheus	and	Grafana	
2016/06/14	
@wyukawa	
Prometheus	Casual	Talks	#1	
#promet...
Agenda	
•  Prometheus	History	
•  Prometheus	Feature	
•  Prometheus	Architecture	
•  My	use	case
History	
•  Started	in	2012	by	ex-Google	Site	Reliability	
Engineers	
•  WriLen	in	Go	
•  Inspired	by	Google’s	Borgmon	
– ...
Features	
•  pull	architecture	
– easy	flow	control	
– not	easy	to	get	through	firewall	
•  Cloud	Monitoring	as	a	Service	us...
pull	architecture	
hLps://prometheus.io/docs/introducZon/overview/
node_exporter	example	
•  hLp://host:9100/metrics
mulZ	dimensional	data	model	
•  metric	types	
– counter	
– gauge	
– histogram	
– summary	
hLps://prometheus.io/docs/concep...
How	to	handle	counter	metric	
•  Do	you	use	reset?	
hLp://www.robustpercepZon.io/how-does-a-prometheus-counter-work/	
No!	...
powerful	query	language	
sum	by(status)	(	
		rate(hLp_response_status_total	[1m]))	
)		
ALERT	DiskWillFillIn4Hours	
		IF	p...
Alert	
•  Alertmanager	has	the	role	
•  very	young	compared	to	Prometheus	itself	
•  very	promising	
•  aim	to	have	as	few...
My	use	case	
•  At	first	I	use	file_sd_configs	manually	
•  Now	I	use	promgen!	
•  Exporters	are	executed	by	supervisord/
sys...
monitoring	hadoop/hive	
•  developer	always	uses	jmx_exporter	to	
monitor	java	middleware	
•  But	I	implement	namenode/
re...
Namenode	block	monitoring	
Grafana	AnnotaZon	
Alert	is	also	prometheus	metrics	so	grafana	can	show	alert	as	annotaZon
Resoucemanager	job	monitoring
Hiveserver2	jvm	monitoring	
hLps://issues.apache.org/jira/browse/HIVE-13374
Fluentd	buffer	monitoring 	
•  fluent-plugin-prometheus	enables	buffer	
monitoring
access	log	count	
•  fluent-plugin-prometheus	enable	to	count	
access	log	but	need	sampling	because	of	high	
cpu	usage(Flin...
HTTP	status	count	
Although	4xx/5xx	is	not	0,	it	may	become	0	
because	of	sampling
HTTP	status	percentage
fluentd_exporter	
•  I	implement	fluentd_expoter	because	I	want	
to	monitor	fluentd	cpu	usage	
hLp://d.hatena.ne.jp/wyukawa/2...
elasZcsearch_exporter	
hLps://github.com/elasZc/elasZcsearch/issues/18635
My	impression		
•  Prometheus	has	a	powerful	query	but	
someZmes	difficult	to	understand	
– sum(rate(accesslog_counts{tag="....
Upcoming SlideShare
Loading in …5
×

Prometheus casual talk1

746 views
581 views

Published on

prometheus

Published in: Data & Analytics
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
746
On SlideShare
0
From Embeds
0
Number of Embeds
107
Actions
Shares
0
Downloads
1
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Prometheus casual talk1

  1. 1. Hadoop, Fluentd cluster monitoring with Prometheus and Grafana 2016/06/14 @wyukawa Prometheus Casual Talks #1 #prometheuscasual
  2. 2. Agenda •  Prometheus History •  Prometheus Feature •  Prometheus Architecture •  My use case
  3. 3. History •  Started in 2012 by ex-Google Site Reliability Engineers •  WriLen in Go •  Inspired by Google’s Borgmon – Borgmon monitors Borg •  Public announcement in January 2015 hLp://www.slideshare.net/FabianReinartz/prometheus-a-next-gen-monitoring-system-3
  4. 4. Features •  pull architecture – easy flow control – not easy to get through firewall •  Cloud Monitoring as a Service uses push model •  mulZ dimensional data model •  powerful query language •  alert
  5. 5. pull architecture hLps://prometheus.io/docs/introducZon/overview/
  6. 6. node_exporter example •  hLp://host:9100/metrics
  7. 7. mulZ dimensional data model •  metric types – counter – gauge – histogram – summary hLps://prometheus.io/docs/concepts/metric_types/
  8. 8. How to handle counter metric •  Do you use reset? hLp://www.robustpercepZon.io/how-does-a-prometheus-counter-work/ No! use rate/irate/increase funcZon! 100 - (avg by (instance) (irate(node_cpu{job="node",mode="idle"}[5m])) * 100) hLp://www.robustpercepZon.io/understanding-machine-cpu-usage/
  9. 9. powerful query language sum by(status) ( rate(hLp_response_status_total [1m])) ) ALERT DiskWillFillIn4Hours IF predict_linear(node_filesystem_free{job='node'}[1h], 4*3600) < 0 FOR 5m LABELS { severity="page" } hLp://www.robustpercepZon.io/reduce-noise-from-disk-space-alerts/
  10. 10. Alert •  Alertmanager has the role •  very young compared to Prometheus itself •  very promising •  aim to have as few alerts as possible – repeat_interval: 4hours
  11. 11. My use case •  At first I use file_sd_configs manually •  Now I use promgen! •  Exporters are executed by supervisord/ systemd •  Monitor middlewares and machines – Hadoop – Fluentd – ElasZcsearch
  12. 12. monitoring hadoop/hive •  developer always uses jmx_exporter to monitor java middleware •  But I implement namenode/ resourcemanager/jstat exporter because I want and I don’t want to restart daemon •  hLps://github.com/wyukawa/ hadoop_exporter •  hLps://github.com/wyukawa/jstat_exporter
  13. 13. Namenode block monitoring Grafana AnnotaZon Alert is also prometheus metrics so grafana can show alert as annotaZon
  14. 14. Resoucemanager job monitoring
  15. 15. Hiveserver2 jvm monitoring hLps://issues.apache.org/jira/browse/HIVE-13374
  16. 16. Fluentd buffer monitoring •  fluent-plugin-prometheus enables buffer monitoring
  17. 17. access log count •  fluent-plugin-prometheus enable to count access log but need sampling because of high cpu usage(Flink/Storm/… may be necessary)
  18. 18. HTTP status count Although 4xx/5xx is not 0, it may become 0 because of sampling
  19. 19. HTTP status percentage
  20. 20. fluentd_exporter •  I implement fluentd_expoter because I want to monitor fluentd cpu usage hLp://d.hatena.ne.jp/wyukawa/20160603/1464934228
  21. 21. elasZcsearch_exporter hLps://github.com/elasZc/elasZcsearch/issues/18635
  22. 22. My impression •  Prometheus has a powerful query but someZmes difficult to understand – sum(rate(accesslog_counts{tag="..."}[1m])) by (status, job) / ignoring(status) group_lew sum(rate(accesslog_counts{tag="..."}[1m])) by (job) •  Grafana is also great but to share link is a liLle weak

×