Netflix and Containers - Titus

42,999
-1

Published on

Slides from the 2016/01/28 Advanced Amazon Web Services (AWS) Meetup. Netflix overviewed the usage of containers at Netflix. We covered technologies we are working on in the runtime (Titus) and developer experience (Newt). We talked about how the Titus container management system was different from others and our journey with Docker, Mesos, Netflix Fenzo and eventually Amazon Elastic Container Service (ECS).

Published in: Technology

Netflix and Containers - Titus

  1. 1. Netflix and Containers Titus Overview, January 2016 Andrew Spyker Cloud Platform Engineer
  2. 2. About Netflix ● 75M+ members ● #NetflixEverywhere (Worldwide) ● 42.5B hours watched 2015 ● > ⅓ NA internet download traffic ● 1000’s Microservices ● Many 10’s of thousands VM’s ● 3 regions across the world ● 2000+ employees 2
  3. 3. About me ● Cloud platform technologies ○ Distributed configuration, service discovery, RPC, application frameworks, non-Java sidecar ● Container cloud ○ Resource management and scheduling, making Docker containers operational in Amazon EC2/ECS ● Open Source ○ Organize @NetflixOSS meetups & internal group ● Performance ○ Assist across Netflix, but focused mainly on cloud platform perf With Netflix for ~ 1 year. Previously at IBM. @aspyker ispyker. blogspot. com 3
  4. 4. Team members @aspyker @amit_joshee Andrew Leung @podila Andrei Ushakov @william thurston @timbozarth @dzapata 4
  5. 5. Agenda ● Why Containers for Netflix? ● Container runtime platform ● Container development experience 5
  6. 6. Why containers operationally? Case 1: I have a job I want run reliably and efficiently, but I don’t want to manage clusters myself Case 2: I have lots of services and I want to reduce the number of the VM’s I need to manage with isolation between process instances
  7. 7. History - Project Titan ● Container management system ○ Predominantly batch processing system ● Higher level frameworks drive tasks ○ General workflow engine ○ DAG base data processing ○ Misc reports, big data processing stages, interactive notebooks ● Tech ○ Rudimentary scheduling with Dynamo storage ○ Proven Docker execution environment ○ Using Mesos and Fenzo 7
  8. 8. History - Project Mantis ● Real time operational intelligence for streaming experience ○ Ad hoc and perpetual stream processing ● Tech ○ Proven scheduling with C* storage ○ Mantis fatjars deployed in cgroups ○ Using Mesos and Fenzo 8
  9. 9. Fenzo overview ● A generic, plug-ins based scheduling library for Apache Mesos frameworks ● Features ○ Heterogenous resources match with varied tasks ○ Autoscaling of underlying cluster ○ Plugins for constraints and fitness ○ Support for fast (ms) scheduling rate ○ Visibility of scheduling actions github.com/Netflix/Fenzo 9
  10. 10. Fenzo: fitness, constraints plugins Fitness value (0.0 - 1.0) ● Degree of fitness - first fit, best fit, worst fit ○ Real world tradeoff between perfection and speed ● Composable evaluators ● e.g., bin packing Constraints ● Hard constraints filter appropriate resources ● Soft constraints specify preferences ● e.g., zone balancing, instance type preferences 10
  11. 11. Project Titus ● Mantis (Scheduling, Job Mgmt) + Titan (Docker execution) ------------------------------------------ Titus (Andromedon) ● Titan API -> Mantis job mgmt/scheduler -> Titan executor ● Rolled out Q4 2015, took over all jobs in Jan 2016 11
  12. 12. Why Titus? ● Many other container management & scheduling systems, why build another? ● Key unique values ○ Deeply support Amazon (not trying to abstract IaaS) ○ Narrow focus (just container management) ○ Deep integration with existing Netflix systems ○ Complex job scheduling reqs and scale/reliability 12
  13. 13. Current Titus Numbers ● Autoscaling 100’s of r3.8xl’s (32 vCPU, 244G) ● Peak ○ thousands of cores, tens of TB’s memory ● thousands containers/day ● < 100 different images 13
  14. 14. Also in containers ● Already ○ Long running data pipeline service style routing tier ■ 850 c3.4xl instances with ~10K long running containers ○ Mantis cgroups ■ 1000’s cores running varied stream processing jobs ● Soon ○ Media encoding (10 of thousands of cores) ○ Service style (potentially VERY large) 14
  15. 15. Titus UITitus UI Docker Registry Docker Registry Titus high level architecture Rhea container container container docker Mesos Agent metrics agent container container container docker executor logging agent zfsmesos agent docker RheaTitus API Cassandra Titus Master Job Management & Scheduler S3 Zookeeper Docker Registry 15 EC2 Autocaling API Mesos Master Titus UI (CI/CD) Fenzo
  16. 16. Titus User Console 16
  17. 17. Titus Spinnaker Integration ● Spinnaker is our CI/CD system ● Titus integration coming soon 17
  18. 18. POST http://titusapi/v2/jobs GET http://titusapi/v2/jobs/JOBID GET http://titusapi/v2/tasks/TASKID Titus API (today) JOB Titus-12345 Task Index = 0 Num = 2 Task Index = 1 Num = 3 Task Index = 2 Num = 4 Task Index = 1 Num = 5 titus-12345-worker-1-5 18
  19. 19. ● Disparate use cases in a single API ○ Going beyond batch to service, stream and cron ● SLA based on job attributes ○ For batch, completion time ○ For service, user focused SLA (autoscaling, etc.) ● Ownership and cost accounting/metering ○ Group costs to owner and teams ● Aligned with existing continuous deployment system ○ Apps, clusters, asgs in Spinnaker Titus API (coming) 19
  20. 20. Titus Operational Views Also API’s for ● cluster state ● cluster rolling updates ● leadership ● Titus app managed through Spinnaker 20
  21. 21. Dependency Versions (as of 1/16) Docker ● Registry - 2.0.1 ● Engine - 1.9.1 ○ Plus Netflix logging driver Mesos ● 0.24.1 Using Netflix C*, Zookeeper shared services 21
  22. 22. Container Agent Features (existing) ● Volumes with quota ○ Using ZFS with snapshots and S3 archival ● Logging ○ Streaming live stdout/err logs ○ Rotation & shipping stdout/err & app logs to S3 ● Networking ○ IP per container integration with VPC ● Metrics ○ cgroup metrics tagged by job/task id and image 22
  23. 23. Container Agent Features (planned) ● Networking/Security ○ Extend driver to support security groups & IAM Roles ● Volume Drivers ○ Persistent volumes as required by EBS/EFS ● Isolation ○ Beyond CPU, Memory, Disk - Networking I/O Bandwidth ● Security ○ Host and container security hardening (AppArmor/SELinux) ● Insight ○ Performance (Vector) and adhoc debugging (ssh) 23
  24. 24. Unique Titus Scheduler Technology ● Job managers are separate from resource allocation ○ Less monolithic, more extensible ● Fenzo benefits ○ Bin packing, autoscaling, fitness/constraint configurability ○ Visibility into current state of the cluster ● Mesos reconciliation and task heartbeats ● Rate limiting of failing jobs and agents ● Thresholds and alerts for key aspects ○ Queue depth, idle hosts, etc 24
  25. 25. Integration with Netflix Infrastructure ● Goal: Make containers work with existing cloud systems (designed for virtual machines) vs. replace ● Areas ○ Service registration and discovery (Eureka) ○ IPC (Ribbon) ○ Continuous Delivery (Spinnaker) ○ Telemetry (Atlas) ○ Reliability (Chaos, Performance Insight) 25
  26. 26. Path to ECS ● Why we are considering ECS ○ Resource/cluster mgmt undifferentiated heavy lifting ○ Expect ECS to have strong integration /w EC2/AWS ● Have prototyped a Titus/Fenzo ECS port ○ Using our job mgmt/scheduling on top of ECS ● Working with the ECS team to add in ○ Simpler start task API (w/o define task first) ○ Event stream to power real time scheduling info ○ Extensibility in ECS events, resource types 26
  27. 27. Why containers for developers? Case 1: I want a consistent local development and cloud deployment experience (in both directions) Case 2: I want to specify what it means to run my process, not integrate into a one size fits most VM image 27
  28. 28. Developer Experience (coming) Titus 28
  29. 29. Developer experience NEWT ● One stop shop for creation, development, deployment of containers Netflix Docker base layers ● Already integrated with runtime expectations ● Continuously rebuilt with small and controlled common support Netflix Docker build tools ● Extend our bakery to produce Docker images and run locally ● More advanced image creation tools ○ Multi-inheritance, guaranteed metadata, metrics 29
  30. 30. We’re hiring Come advance containers at Netflix! Senior Software Engineer Container Platform - https://jobs.netflix.com/jobs/860487 30
  31. 31. Questions? 31
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×