Your SlideShare is downloading. ×
Library Operating System for Linux #netdev01
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Saving this for later?

Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime - even offline.

Text the download link to your phone

Standard text messaging rates apply

Library Operating System for Linux #netdev01

8,433
views

Published on

Library operating system with mainline Linux kernel at netdev0.1, Ottawa, Feb., 2015

Library operating system with mainline Linux kernel at netdev0.1, Ottawa, Feb., 2015

Published in: Technology

0 Comments
9 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
8,433
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
22
Comments
0
Likes
10
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Library Operating System with Mainline Linux Network Stack ! Hajime Tazaki, Ryo Nakamura, Yuji Sekiya netdev0.1, Feb. 2015
  • 2. Motivation Why kernel space ? Packets were expensive in 1970’ Why not userspace ? well grown in decades, costs degrades obtain network stack personalization controllable by userspace utilities 2
  • 3. Userspace network stacks A lot of userspace network stack full scratch: mTCP, Mirage, lwIP Porting: OSv, Sandstorm, libuinet (FreeBSD), Arrakis (lwIP), OpenOnload (lwIP?) Motivated by their own problems (specialized NIC, cloud, high-speed Apps) Writing a network stack is 1-week DIY, but writing opera-table network stack is decades DIY (which is not DIY) 3
  • 4. Questions How to benefit matured network stack in userspace ? How to trivially introduce your idea on network stack ? xxTCP, IPvX, etc.. How to flexibly test your code with a complex scenario ? 4
  • 5. The answers Using Linux network stack as-is ! as a userspace Library (library operating system) 5
  • 6. This talk is about an introduction of a library operating system for Linux and its implementation with a couple of useful use cases 6
  • 7. Outlook (design) hardware-independent arch (arch/lib) 3 components Host backend layer Kernel layer POSIX layer 7 https://github.com/libos-nuse/net-next-nuse
  • 8. Outlook (cont’d) 8 ARP Qdisc TCP UDP DCCP SCTP ICMP IPv4IPv6 Netlink BridgingNetfilter IPSec Tunneling Kernel layer Host backend layer bottom halves/ rcu/timer/ interrupt struct net_device scheduler netdev clock source POSIX glue layer Application 1) Build Linux srctree w/ glues as a library 2) put backend! (vNIC, clock source,! scheduler) and bind 3) add POSIX glue code 4) applications magically runs
  • 9. Kernel glue code 9 https://github.com/libos-nuse/net-next-nuse/blob/nuse/arch/lib/sched.c void schedule(void)! {! ! lib_task_wait();! }! signed long schedule_timeout(signed long timeout)! {! ! u64 ns;! ! struct SimTask *self;! ! ! if (timeout == MAX_SCHEDULE_TIMEOUT) {! ! ! lib_task_wait();! ! ! return MAX_SCHEDULE_TIMEOUT;! ! }! ! lib_assert(timeout >= 0);! ! ns = ((__u64)timeout) * (1000000000 / HZ);! ! self = lib_task_current();! ! lib_event_schedule_ns(ns, &trampoline, self);! ! lib_task_wait();! ! /* we know that we are always perfectly on time. */! ! return 0;! }
  • 10. POSIX glue code 10 https://github.com/libos-nuse/net-next-nuse/blob/nuse/arch/lib/nuse-glue.c int nuse_socket(int domain, int type, int protocol)! {! ! lib_update_jiffies();! ! struct socket *kernel_socket = malloc(sizeof(struct socket));! ! int ret, real_fd;! ! ! memset(kernel_socket, 0, sizeof(struct socket));! ! ret = lib_sock_socket(domain, type, protocol, &kernel_socket);! ! if (ret < 0)! ! ! errno = -ret;! (snip)! ! lib_softirq_wakeup();! ! return real_fd;! }! weak_alias(nuse_socket, socket);
  • 11. Implementations (Instances) Direct Code Execution (DCE) network simulator integration (ns-3) for more testing Network Stack in Userspace (NUSE) gives new platform of Linux network stack for ad-hoc network stack 11
  • 12. Direct Code Execution ns-3 integration deterministic scheduler single-process model virtualization dlmopen(3)-like virtualization full control over multiple network stacks 12
  • 13. Execution (DCE) main() => dlmopen(ping,liblinux.so)
 => main()=>socket(2)=>dce_socket()
 => (do whatever) 13
  • 14. 14
  • 15. 15
  • 16. Network Stack in Userspace Userspace network stack running on Linux (POSIX) platform Network stack personalization Full features by design (full stack) ARP/ND, UDP/TCP (all cc algorithm), SCTP, DCCP, QDISC, XFRM, netfilter, etc. 16
  • 17. 17 Application ARP Qdisc TCP UDP DCCP SCTP ICMP IPv4IPv6 Netlink BridgingNetfilter IPSec Tunneling Kernel layer Host backend layer (NUSE) POSIX glue layer bottom halves/ rcu/timer/ interrupt struct net_device RAW DPDK netmap ... NIC scheduler netdev clock source system call hijack Application master process slave processes rump syscall proxy rump server
  • 18. Execution (NUSE) LD_PRELOAD=libnuse-linux.so 
 ping www.google.com ping(8) => socket(2) => nuse_socket() => raw(7) => (network) 18
  • 19. When it’s useful? ad-hoc network stack (network stack personalization) LD_PRELOAD=liblinux-mptcp.so firefox Bundle with kernel bypasses Intel DPDK / netmap / PF_RING / etc. debugging/testing with ns-3 19
  • 20. Testing workflow 1.Write/modify code (patches) 2.Write a test code (incl. packet exchanges) 3.if PASS; accept pull-request
 else; rejects 20
  • 21. continuous integration (CI) 21 http://ns-3-dce.cloud.wide.ad.jp/jenkins/job/daily-net-next-sim/
  • 22. T1) write a patch 22 Fixes: de3b7a06dfe1 ("xfrm6: Fix transport header offset in _decode_session6.")! Signed-off-by: Hajime Tazaki <tazaki@sfc.wide.ad.jp>! ---! net/ipv6/xfrm6_policy.c | 1 +! 1 file changed, 1 insertion(+)! ! diff --git a/net/ipv6/xfrm6_policy.c b/net/ipv6/xfrm6_policy.c! index 48bf5a0..8d2d01b 100644! --- a/net/ipv6/xfrm6_policy.c! +++ b/net/ipv6/xfrm6_policy.c! @@ -200,6 +200,7 @@ _decode_session6(struct sk_buff *skb, struct flowi *fl, int reverse)! ! #if IS_ENABLED(CONFIG_IPV6_MIP6)! ! ! case IPPROTO_MH:! +! ! ! offset += ipv6_optlen(exthdr);! ! ! ! if (!onlyproto && pskb_may_pull(skb, nh + offset + 3 - skb->data)) {! ! ! ! ! struct ip6_mh *mh;! http://patchwork.ozlabs.org/patch/436351/
  • 23. T2) write a test As ns-3 scenario C++ or python create a topology config nodes run/check results (e.g., ping6) 23 +-----------+! | HA |! +-----------+! |sim0! +----------+------------+! |sim0 |sim0! sim2+----+---+ +----+---+! - - -| AR1 | | AR2 |! +---+----+ +----+---+! |sim1 |sim1! | |! sim0 sim0! +----+------+ (Movement) +----+-----+! | MR | <=====> | MR |! +-----------+ +----------+! |sim1 |sim1! +---------+ +---------+! | MNN | | MNN |! +---------+ +---------+! http://code.nsnam.org/thehajime/ns-3-dce-umip/file/tip/test/dce-umip-test.cc
  • 24. 24 #!/usr/bin/python! ! from ns.dce import *! from ns.core import *! ! nodes = NodeContainer()! nodes.Create (100)! dce = DceManagerHelper()! dce.SetNetworkStack ("liblinux.so")! dce.Install (nodes)! ! app = DceApplicationHelper()! app.SetBinary ("ping6")! app.Install (nodes)! (snip)! ! NS_TEST_ASSERT_MSG_EQ (m_pingStatus, true, "Umip test " << m_testname! << " did not return successfully: " << g_testError)! ! Simulator.Stop (Seconds(1000.0))! Simulator.Run ()
  • 25. Performance of NUSE 10G Ethernet back-to-back transmission IP forwarding native Linux, raw socket, tap, dpdk, netmap 25
  • 26. Performance: setup 26 10G10G NUSE node Tx/Rx nodes CPU Xeon E5-2650v2 @ 2.60GHz (16 core) Xeon L3426 @ 1.87GHz (8 core) Memory 32GB 4GB NIC Intel X520 Intel X520 OS host:3.13.0-32 nuse: 3.17.0-rc1 host:3.13.0-32 ping! flowgen vnstat! (packet count) Tx NUSE Rx ping! flowgen
  • 27. Host Tx 27 RxNUSE ping (RTT) throughput (1024byte,UDP) 0 1000 2000 3000 4000 5000 6000 dpdk native netmap raw tap Throughput(Mbps) 0 0.2 0.4 0.6 0.8 1 dpdk native netmap raw tap RTT(ms) native: ping A.B.C.D! others: ./nuse ping A.B.C.D
  • 28. L3 Routing Sender->NUSE->Receiver 28 Tx RxNUSE ping (RTT) throughput (1024byte,UDP) 0 1000 2000 3000 4000 5000 6000 dpdk native netmap raw tap Throughput(Mbps) 0 0.2 0.4 0.6 0.8 1 dpdk native netmap raw tap RTT(ms)
  • 29. Alternatives UML/LKL (1proc/1vm, no POSIX i/f) Containers (can’t change kernel) scratch-based (mTCP,Mirage) rumpkernel (in NetBSD) 29
  • 30. Limitations ad-hoc kernel glues required when we changed a member of a struct, LibOS needs to follow it Performance drawbacks on NUSE adapt known techniques (mTCP) 30
  • 31. (not) Conclusions An abstraction for multiple benefits Conservative Use past decades effort as much with a small amount of effort Planing to RFC for upstreaming 31
  • 32. github: https://github.com/libos-nuse/net- next-nuse DCE: http://bit.ly/ns-3-dce twitter: @thehajime 32
  • 33. Backups
  • 34. Bug reproducibility 34 Wi-Fi Wi-Fi Home Agent AP1 AP2 handoff ping6 mobile node correspondent node (gdb) b mip6_mh_filter if dce_debug_nodeid()==0
 Breakpoint 1 at 0x7ffff287c569: file net/ipv6/mip6.c, line 88. <continue> (gdb) bt 4 #0  mip6_mh_filter (sk=0x7ffff7f69e10, skb=0x7ffff7cde8b0) at net/ipv6/mip6.c:109 #1  0x00007ffff2831418 in ipv6_raw_deliver (skb=0x7ffff7cde8b0, nexthdr=135) 
 at net/ipv6/raw.c:199 #2  0x00007ffff2831697 in raw6_local_deliver (skb=0x7ffff7cde8b0, nexthdr=135) 
 at net/ipv6/raw.c:232 #3  0x00007ffff27e6068 in ip6_input_finish (skb=0x7ffff7cde8b0) at net/ipv6/ip6_input.c:197
  • 35. Debugging Memory error detection among distributed nodes in a single process using Valgrind ! ! 35 ==5864== Memcheck, a memory error detector ==5864== Copyright (C) 2002-2009, and GNU GPL'd, by Julian Seward et al. ==5864== UsingValgrind-3.6.0.SVN and LibVEX; rerun with -h for copyright in ==5864== Command: ../build/bin/ns3test-dce-vdl --verbose ==5864== ==5864== Conditional jump or move depends on uninitialised value(s) ==5864== at 0x7D5AE32: tcp_parse_options (tcp_input.c:3782) ==5864== by 0x7D65DCB: tcp_check_req (tcp_minisocks.c:532) ==5864== by 0x7D63B09: tcp_v4_hnd_req (tcp_ipv4.c:1496) ==5864== by 0x7D63CB4: tcp_v4_do_rcv (tcp_ipv4.c:1576) ==5864== by 0x7D6439C: tcp_v4_rcv (tcp_ipv4.c:1696) ==5864== by 0x7D447CC: ip_local_deliver_finish (ip_input.c:226) ==5864== by 0x7D442E4: ip_rcv_finish (dst.h:318) ==5864== by 0x7D2313F: process_backlog (dev.c:3368) ==5864== by 0x7D23455: net_rx_action (dev.c:3526) ==5864== by 0x7CF2477: do_softirq (softirq.c:65) ==5864== by 0x7CF2544: softirq_task_function (softirq.c:21) ==5864== by 0x4FA2BE1: ns3::TaskManager::Trampoline(void*) (task-manage ==5864== Uninitialised value was created by a stack allocation ==5864== at 0x7D65B30: tcp_check_req (tcp_minisocks.c:522) ==5864==
  • 36. Fine-grained parameter coverage 36 Code coverage measurement with DCE With fine-grained network, node, protocol parameters
  • 37. 1) kernel build build kernel source tree w/ the patch make menuconfig ARCH=sim make library ARCH=sim ➔ libnuse-linux-3.17-rc1.so 37
  • 38. Example: How timer works 38 add_timer() TIMER_SOFTIRQ timer_list run_timer_softirq () timer handler timer thread (timer_create (2))
  • 39. Tx callgraph 39 sendmsg () (socket API) lib_sock_sendmsg () (NUSE) sock_sendmsg () ip_send_skb () ip_finish_output2 () dst_neigh_output () (existing neigh_resolve_output () -kernel) arp_solicit () dev_queue_xmit () lib_dev_xmit () (NUSE) nuse_vif_raw_write ()
  • 40. start_thread () (pthread) nuse_netdev_rx_trampoline () nuse_vif_raw_read () (NUSE) lib_dev_rx () netif_rx () (ex-kernel) Rx callgraph 40 start_thread () (pthread) do_softirq () (NUSE) net_rx_action () process_backlog () (ex-kernel) __netif_receive_skb_core () ip_rcv () vNIC! rx softirq! rx