eBPF and BCC

Table of Contents

1. eBPF简介

BPF (Berkeley Packet Filter),最初构想提出于 1992 年,其目的是为了提供一种过滤包的方法,并且要避免从内核空间到用户空间的数据包复制行为。它最初是由从用户空间注入到内核的一个简单的字节码构成,它在那个位置利用一个校验器进行检查(以避免内核崩溃或者安全问题),并附着到一个套接字上,接着在每个接收到的包上运行字节码。几年后它被移植到 Linux 上,并且应用于一小部分应用程序上(例如,tcpdump 就使用了 BPF)。其简化的语言以及存在于内核中的即时编译器(JIT),使 BPF 成为一个性能卓越的工具。

在 2013 年,Alexei Starovoitov 对 BPF 进行彻底地改造,并增加了新的功能,改善了性能。这个新版本被命名为 eBPF (意思是 “extended BPF”)。新版本出现了如映射和尾调用这样的新特性,并且 JIT 编译器也被重写了。

为了兼容,传统的 BPF 仍被保留了下来,并被重命名为 classical BPF(cBPF)。 相对于 cBPF,新的 eBPF 的应该范围要广得多,它为内核追踪(Kernel Tracing)、应用性能调优/监控、流控(Traffic Control)等领域带来了激动人心的变革。

eBPF 的架构如图 1 所示(摘自:eBPF 简史)。

ebpf_arch.png

Figure 1: Architecture of eBPF

eBPF 这个名称多少有点词不达意,Brendan Gregg 认为 VKIS (Virtual Kernel Instruction Set) 可能更恰当。Brendan Gregg 对 eBPF 有下面的类比说明:

eBPF does to Linux what JavaScript does to HTML. (Sort of.) So instead of a static HTML website, JavaScript lets you define mini programs that run on events like mouse clicks, which are run in a safe virtual machine in the browser. And with eBPF, instead of a fixed kernel, you can now write mini programs that run on events like disk I/O, which are run in a safe virtual machine in the kernel. In reality, eBPF is more like the v8 virtual machine that runs JavaScript, rather than JavaScript itself. eBPF is part of the Linux kernel.

Programming in eBPF directly is incredibly hard, the same as coding in v8 bytecode. But no one codes in v8: they code in JavaScript, or often a framework on top of JavaScript (jQuery, Angular, React, etc). It's the same with eBPF. People will use it and code in it via frameworks. For tracing, the main ones are bcc and bpftrace.

摘自:http://www.brendangregg.com/blog/2019-01-01/learn-ebpf-tracing.html

1.1. Front-ends for eBPF

直接编写 eBPF 字节码显然不太现实,我们需要一些工具来生成 eBPF 字节码,如表 1 所示,其中工具 bcc 和 bpftrace 值得推荐。

Table 1: Front-ends for eBPF
Front end Difficulty Pros Cons References
BPF bytecode Brutal Precise control Insanely difficult Kernel source: struct bpf_insn prog in samples/bpf/sock_example.c
C Hard Build stand-alone binaries Difficult Kernel source: samples/bpf/tracex1_kern.c and samples/bpf/tracex1_user.c
perf Hard Use perf's capabilities: custom events, stack walking Difficult, not yet well documented See: http://www.brendangregg.com/ebpf.html#perf
bcc Moderate Custom output, Python libraries, large community, production use (eg Facebook, Netflix) Verbose See: http://www.brendangregg.com/ebpf.html#bcc
bpftrace Easy Powerful one-liners, many capabilities, growing community, production use (eg, Netflix) Limited control of code and output See: http://www.brendangregg.com/ebpf.html#bpftrace
ply Easy Powerful one-liners, small binary, for embedded Limited control of code and output Github: github.com/iovisor/ply.

参考:http://www.brendangregg.com/ebpf.html

2. BCC (BPF Compiler Collection)

bcc 提供了一系列基于 eBPF 的工具(如图 2 所示),以方便使用。其在各个系统中的安装方法参见:https://github.com/iovisor/bcc/blob/master/INSTALL.md

ebpf_bcc_tracing_tools_2017.gif

Figure 2: Linux bcc/BPF tracing tools

bcc 提供的这些工具,可以参见:https://github.com/iovisor/bcc/tree/master/tools, 这个仓库中,每个工具都有一个对应的 example.txt 文件,可以帮助我们快速了解工具的使用。

下面是这些工具的一个列表(Ubuntu系统中为其增加了-bpfcc后缀,它们绝大部分都是python脚本):

/usr/sbin/bashreadline-bpfcc         /usr/sbin/javaflow-bpfcc        /usr/sbin/rubyobjnew-bpfcc
/usr/sbin/biolatency-bpfcc           /usr/sbin/javagc-bpfcc          /usr/sbin/rubystat-bpfcc
/usr/sbin/biosnoop-bpfcc             /usr/sbin/javaobjnew-bpfcc      /usr/sbin/runqlat-bpfcc
/usr/sbin/biotop-bpfcc               /usr/sbin/javastat-bpfcc        /usr/sbin/runqlen-bpfcc
/usr/sbin/bitesize-bpfcc             /usr/sbin/javathreads-bpfcc     /usr/sbin/slabratetop-bpfcc
/usr/sbin/bpflist-bpfcc              /usr/sbin/killsnoop-bpfcc       /usr/sbin/softirqs-bpfcc
/usr/sbin/btrfsdist-bpfcc            /usr/sbin/llcstat-bpfcc         /usr/sbin/solisten-bpfcc
/usr/sbin/btrfsslower-bpfcc          /usr/sbin/mdflush-bpfcc         /usr/sbin/sslsniff-bpfcc
/usr/sbin/cachestat-bpfcc            /usr/sbin/memleak-bpfcc         /usr/sbin/stackcount-bpfcc
/usr/sbin/cachetop-bpfcc             /usr/sbin/mountsnoop-bpfcc      /usr/sbin/statsnoop-bpfcc
/usr/sbin/capable-bpfcc              /usr/sbin/mysqld_qslower-bpfcc  /usr/sbin/syncsnoop-bpfcc
/usr/sbin/cobjnew-bpfcc              /usr/sbin/nfsdist-bpfcc         /usr/sbin/syscount-bpfcc
/usr/sbin/cpudist-bpfcc              /usr/sbin/nfsslower-bpfcc       /usr/sbin/tcpaccept-bpfcc
/usr/sbin/cpuunclaimed-bpfcc         /usr/sbin/nodegc-bpfcc          /usr/sbin/tcpconnect-bpfcc
/usr/sbin/dbslower-bpfcc             /usr/sbin/nodestat-bpfcc        /usr/sbin/tcpconnlat-bpfcc
/usr/sbin/dbstat-bpfcc               /usr/sbin/offcputime-bpfcc      /usr/sbin/tcplife-bpfcc
/usr/sbin/dcsnoop-bpfcc              /usr/sbin/offwaketime-bpfcc     /usr/sbin/tcpretrans-bpfcc
/usr/sbin/dcstat-bpfcc               /usr/sbin/oomkill-bpfcc         /usr/sbin/tcptop-bpfcc
/usr/sbin/deadlock_detector-bpfcc    /usr/sbin/opensnoop-bpfcc       /usr/sbin/tcptracer-bpfcc
/usr/sbin/deadlock_detector.c-bpfcc  /usr/sbin/phpcalls-bpfcc        /usr/sbin/tplist-bpfcc
/usr/sbin/execsnoop-bpfcc            /usr/sbin/phpflow-bpfcc         /usr/sbin/trace-bpfcc
/usr/sbin/ext4dist-bpfcc             /usr/sbin/phpstat-bpfcc         /usr/sbin/ttysnoop-bpfcc
/usr/sbin/ext4slower-bpfcc           /usr/sbin/pidpersec-bpfcc       /usr/sbin/vfscount-bpfcc
/usr/sbin/filelife-bpfcc             /usr/sbin/profile-bpfcc         /usr/sbin/vfsstat-bpfcc
/usr/sbin/fileslower-bpfcc           /usr/sbin/pythoncalls-bpfcc     /usr/sbin/wakeuptime-bpfcc
/usr/sbin/filetop-bpfcc              /usr/sbin/pythonflow-bpfcc      /usr/sbin/xfsdist-bpfcc
/usr/sbin/funccount-bpfcc            /usr/sbin/pythongc-bpfcc        /usr/sbin/xfsslower-bpfcc
/usr/sbin/funclatency-bpfcc          /usr/sbin/pythonstat-bpfcc      /usr/sbin/zfsdist-bpfcc
/usr/sbin/funcslower-bpfcc           /usr/sbin/reset-trace-bpfcc     /usr/sbin/zfsslower-bpfcc
/usr/sbin/gethostlatency-bpfcc       /usr/sbin/rubycalls-bpfcc
/usr/sbin/hardirqs-bpfcc             /usr/sbin/rubyflow-bpfcc

2.1. execsnoop(监控新进程的创建)

execsnoop 可以监控通过系统调用 exec 创建的进程(当然是启动工具后再创建的新进程),并且把它们的启动参数打印出来。如:

$ sudo execsnoop
PCOMM            PID    PPID   RET ARGS
ls               6220   6193     0 /bin/ls --color=auto -l
sshd             6222   1151     0 /usr/sbin/sshd -D -R
ping             6226   6193     0 /bin/ping www.bing.com
curl             6230   6193     0 /usr/bin/curl www.baidu.com

2.2. tcpconnect(监控主动的tcp连接)

tcpconnect 可以监控主动的tcp连接,即监控系统调用 connect() 。如:

$ sudo tcpconnect
PID    COMM         IP SADDR            DADDR            DPORT
25333  recordProgra 4  127.0.0.1        127.0.0.1        28527
25338  curl         4  100.66.3.172     52.22.109.254    80
25340  curl         4  100.66.3.172     31.13.73.36      80
25342  curl         4  100.66.3.172     104.20.25.153    80
25344  curl         4  100.66.3.172     50.56.53.173     80
25365  recordProgra 4  127.0.0.1        127.0.0.1        28527
26119  ssh          6  ::1              ::1              22
25388  recordProgra 4  127.0.0.1        127.0.0.1        28527
25220  ssh          6  fe80::8a3:9dff:fed5:6b19 fe80::8a3:9dff:fed5:6b19 22
[...]

2.3. tcpaccept(监控被动的tcp连接)

tcpaccept 可以监控被动的tcp连接,即监控系统调用 accept() 。如:

# tcpaccept
PID    COMM         IP RADDR            LADDR            LPORT
2287   sshd         4  11.16.213.254    100.66.3.172     22
4057   redis-server 4  127.0.0.1        127.0.0.1        28527
4057   redis-server 4  127.0.0.1        127.0.0.1        28527
4057   redis-server 4  127.0.0.1        127.0.0.1        28527
4057   redis-server 4  127.0.0.1        127.0.0.1        28527
2287   sshd         6  ::1              ::1              22
4057   redis-server 4  127.0.0.1        127.0.0.1        28527
4057   redis-server 4  127.0.0.1        127.0.0.1        28527
2287   sshd         6  fe80::8a3:9dff:fed5:6b19 fe80::8a3:9dff:fed5:6b19 22
4057   redis-server 4  127.0.0.1        127.0.0.1        28527
[...]

2.4. oomkill(监控系统的OOM)

oomkill 可以监控系统的OOM killer(Out Of Memory killer),参考:https://github.com/iovisor/bcc/blob/master/tools/oomkill_example.txt

2.5. gethostlatency(监控DNS解析时间)

gethostlatency 可以监控系统调用 getaddrinfo(), gethostbyname(), and gethostbyname2() ,从而统计出DNS解析时间。如:

$ sudo gethostlatency
TIME      PID    COMM          LATms HOST
06:10:24  28011  wget          90.00 www.iovisor.org
06:10:28  28127  wget           0.00 www.iovisor.org
06:10:41  28404  wget           9.00 www.netflix.com
06:10:48  28544  curl          35.00 www.netflix.com.au
06:11:10  29054  curl          31.00 www.plumgrid.com
06:11:16  29195  curl           3.00 www.facebook.com
06:11:25  29404  curl          72.00 foo
06:11:28  29475  curl           1.00 foo

2.6. mysqld_qslower(监控Mysql慢查询)

MySQL本身就支持记录慢查询log。bcc的工具 mysqld_qslower 提供了另外一种监控MySQL慢查询的思路,参考:http://www.brendangregg.com/blog/2016-10-04/linux-bcc-mysqld-qslower.html

2.7. funccount(统计函数调用次数)

funccount 可以统计函数调用次数,参考:https://github.com/iovisor/bcc/blob/master/tools/funccount_example.txt

想要统计的函数通过 pattern 来指定,pattern 的规则如下:

func            -- probe a kernel function
lib:func        -- probe a user-space function in the library 'lib'
/path:func      -- probe a user-space function in binary '/path'
p::func         -- same thing as 'func'
p:lib:func      -- same thing as 'lib:func'
t:cat:event     -- probe a kernel tracepoint
u:lib:probe     -- probe a USDT tracepoint

2.8. funcslower(监控执行时间大于阈值的函数)

funcslower 可以监控执行时间大于阈值的函数,参考:https://github.com/iovisor/bcc/blob/master/tools/funcslower_example.txt

2.9. trace(监控函数调用,输出其参数和返回值)

trace 可以监控函数调用,并输出其参数和返回值,参考:https://github.com/iovisor/bcc/blob/master/tools/trace_example.txt

2.10. Tips

2.10.1. 跟踪Golang函数

为避免代码中的函数由于内联优化而消失,在运行 go build 时需要指定 -gcflags="-l"

golang编译时并没有严格遵守AMD64 ABI,这导致使用 bcc 工具打印函数参数时可能得到错误的值。因为AMD64 ABI要求函数的前几个参数分别在寄存器rdi/rsi/rdx/r10/r8/r9中:

       arch/ABI      arg1  arg2  arg3  arg4  arg5  arg6  arg7  Notes
       ──────────────────────────────────────────────────────────────────
[...]
       x86_64        rdi   rsi   rdx   r10   r8    r9    -

但是,golang却是通过栈来传递参数。

关于使用bcc工具调试跟踪Golang程序的更多技巧可参考:http://www.brendangregg.com/blog/2017-01-31/golang-bcc-bpf-function-tracing.html
这里有一个临时的解决办法:https://github.com/iovisor/bcc/issues/934

3. 参考

Author: cig01

Created: <2019-01-11 Fri>

Last updated: <2019-10-25 Fri>

Creator: Emacs 27.1 (Org mode 9.4)