1.
Masahiro Nakagawa
Apr 18, 2015
Game Server meetup #4
Fluentd /
Embulk
For easy to reliable transfer
2.
Who are you?
> Masahiro Nakagawa
> github/twitter: @repeatedly
> Treasure Data, Inc.
> Senior Software Engineer
> Fluentd / td-agent developer
> Living at OSS :)
> D language - Phobos committer
> Fluentd - Main maintainer
> MessagePack / RPC - D and Python (only RPC)
> The organizer of several meetups (Presto, DTM, etc…)
> etc…
4.
What’s Fluentd?
> Data collector for unified logging layer
> Streaming data transfer based on JSON
> Written in Ruby
> Gem based various plugins
> http://www.fluentd.org/plugins
> Working in production
> http://www.fluentd.org/testimonials
13.
Why JSON / MessagePack? (1
> Schema on Write (Traditional MPP DB)
> Writing data using schema for improving
query performance
> Pros
> minimum query overhead
> Cons
> Need to design schema and workload before
> Data load is expensive operation
14.
Why JSON / MessagePack? (2
> Schema on Read (Hadoop)
> Writing data without schema and map schema
at query time
> Pros
> Robust over schema and workload change
> Data load is cheap operation
> Cons
> High overhead at query time
16.
Core Plugins
> Divide & Conquer
> Buffering & Retrying
> Error handling
> Message routing
> Parallelism
> Read / receive data
> Parse data
> Filter data
> Buffer data
> Format data
> Write / send data
17.
Core Plugins
> Divide & Conquer
> Buffering & Retrying
> Error handling
> Message routing
> Parallelism
> Read / receive data
> Parse data
> Filter data
> Buffer data
> Format data
> Write / send data
Common
Concerns
Use Case
Specific
18.
> default second unit
> from data source
Event structure(log message)
✓ Time
> for message routing
> where is from?
✓ Tag
> JSON format
> MessagePack
internally
> schema-free
✓ Record
20.
Configuration and operation
> No central / master node
> include helps configuration sharing
> Operation depends on your environment
> Use your deamon management
> Use Chef in Treasure Data
> Apache like syntax
23.
Treasure Agent (td-agent)
> Treasure Data distribution of Fluentd
> include ruby, popular plugins and etc
> Treasure Agent 2 is current stable
> Update core components
> We recommend to use v2, not v1
> Latest version is 2.2.0 with fluentd v0.12
31.
Nagios
MongoDB
Hadoop
Alerting
Amazon S3
Analysis
Archiving
MySQL
Apache
Frontend
Access logs
syslogd
App logs
System logs
Backend
Databases
buffering / processing / routing
M x N → M + N
35.
# logs from a file
<source>
type tail
path /var/log/httpd.log
pos_file /tmp/pos_file
format apache2
tag backend.apache
</source>
!
# logs from client libraries
<source>
type forward
port 24224
</source>
!
# store logs to MongoDB
<match backend.*>
type mongo
database fluent
collection test
</match>
37.
Less Simple Forwarding
- At-most-once / At-least-once
- HA (failover)
- Load-balancing
38.
All data
Near realtime and batch combo!
Hot data
39.
# logs from a file
<source>
type tail
path /var/log/httpd.log
pos_file /tmp/pos_file
format apache2
tag web.access
</source>
!
# logs from client libraries
<source>
type forward
port 24224
</source>
!
# store logs to ES and HDFS
<match web.*>
type copy
<store>
type elasticsearch
logstash_format true
</store>
<store>
type webhdfs
host namenode
port 50070
path /path/on/hdfs/
</store>
</match>
40.
CEP for Stream Processing
Norikra is a SQL based CEP engine: http://norikra.github.io/
42.
> Kubernetes
!
!
!
!
!
> Google Compute Engine
> https://cloud.google.com/logging/docs/install/compute_install
Fluentd on Kubernetes / GCE
43.
Treasure Data
Frontend
Job Queue
Worker
Hadoop
Presto
Fluentd
Applications push
metrics to Fluentd
(via local Fluentd)
Datadog
for realtime monitoring
Treasure Data
for historical analysis
Fluentd sums up data minutes
(partial aggregation)
44.
hundreds of app servers
sends event logs
sends event logs
sends event logs
Rails app td-agent
td-agent
td-agent
Google
Spreadsheet
Treasure Data
MySQL
Logs are available
after several mins.
Daily/Hourly
Batch
KPI
visualizationFeedback rankings
Rails app
Rails app
Unlimited scalability
Flexible schema
Realtime
Less performance impact
Cookpad
✓ Over 100 RoR servers (2012/2/4)
49.
fluent-bit
> Made for Embedded Linux
> OpenEmbedded & Yocto Project
> Intel Edison, RasPi & Beagle Black boards
> https://github.com/fluent/fluent-bit
> Standalone application or Library mode
> Built-in plugins
> input: cpu, kmsg, output: fluentd
> First release at the end of Mar 2015
50.
fluentd-forwarder
> Forwarding agent written in Go
> Focusing log forwarding to Fluentd
> Work on Windows
> Bundle TCP input/output and TD output
> No flexible plugin mechanizm
> We have a plan to add some input/output
> Similar product
> fluent-agent-lite, fluent-agent-hydra, ik
51.
fluentd-ui
> Manage Fluentd instance via Web UI
> https://github.com/fluent/fluentd-ui
53.
The problems at Treasure Data
> Treasure Data Service on the Cloud
> Customers want to try Treasure Data, but
> SEs write scripts to bulk load their data.
Hard work :(
> Customers want to migrate their big data, but
> Hard work :(
> Fluentd solved streaming data collection, but
> bulk data loading is another problem.
54.
Embulk
> Bulk Loader version of Fluentd
> Pluggable architecture
> JRuby, JVM languages
> High performance parallel processing
> Share your script as a plugin
> https://github.com/embulk
55.
The problems of bulk load
> Data cleaning (normalization)
> How to normalize broken records?
> Error handling
> How to remove broken records?
> Idempotent retrying
> How to retry without duplicated loading?
> Performance optimization
66.
Other cases
> Treasure Data
> Embulk worker for automatic import
> Web services
> Restore existing logs to Elasticsearch
> Business / Batch systems
> Database to Database
> etc…
67.
Check: treasuredata.com
Cloud service for the entire data pipeline
Be the first to comment