Contributing to BCC
Rather than writing an article about Linux system tracing/profiling, i believe it would be a lot more beneficial to write about my experience contributing a couple of tools to the bcc project.
There are a large number of great blogs & videos about Linux tracing & profiling and i will link to them at the end of this post.
The landscape for Linux system profiling and tracing has for a long time been rather fragmented, there are a numerous solutions proposed and developed, below are just a few being actively developed and supported:
In this post i am specifically focused on BCC, which is a frontend of eBPF, which IMHO is the most promising of the lot. It is readily available in the kernel since 3.9 (but you really need 4.1), it is also very actively developed and is gaining industry wide support. It has extremely low overhead, production safe (will not cause a kernel panic) and incredibly flexible. The only downside at the moment is that the barrier to entry is still high - it requires the eBPF program to be written in C compiled into BPF bytecode using llvm and loaded into the kernel via the bpf syscall. This is where BCC comes in, it is an attempt to make the powers of eBPF more accessible to System engineers, administrators and anyone interested in understanding how their application runs on the linux kernel.
The BCC project has a number of useful tools that help you analyze various portions of the linux system. Take a look at the below diagram
BCC already has numerous tools that help you debug various layers of your stack, from the very top - Application level tracing using USDT and uprobes, to the lowest level of the kernel using kprobes, static tracepoints and hardware performance counters.
I decided that in order to get more familiar with this framework i needed to write and contribute a few tools back to this project. I am hoping this blog will help any potential developers navigate the process of doing just that.
Picking a tool
Looking at the diagram above its pretty clear that a lot of the critical parts of the kernel and application layers have good tools that help you introspect those areas. For others, there are general tools like trace and funclatency that can get you a majority of the way there. So picking which areas of tools to cover needed a bit of think.
The industry that i work in is heavily reliant on large NAS based storage devices that can scale out. These NAS boxes need to be extremely high performance as they front traffic from users as well as compute clusters that are reading large files and performing lots of IOps. All of these NAS boxes are exported to the users and compute clusters via NFS mounts, so tuning, debugging and troubleshooting NFS operations is fairly standard in my industry.
For a while now the act of collecting performance metrics for NFS meant looking at nfsstat on the client and relying on the expensive NAS servers to export some useful data. But debugging performance problems on the client usually went nowhere, you could look at debug output from the kernel by switching on nfs_debug in the kernel (via /proc/sys/sunrpc/nfs_debug), but this usually meant you would have to parse out debug messages from log files and try and aggregate data as a post process. But this is fraught with problems as the overhead of these printed messages is large enough to skew any results. So for the most part we relied on using the server side metrics to try and diagnose a problem with a client.
Looking at the current suite of filesystem specific tools in bcc {btrfs, xfs, zfs, ext4}[dist | slower] it was clear that adding a suite of NFS tools would be a good idea. So i set out to develop nfsdist and nfsslower.
Learning how to write a BCC tool
The structure of a BCC tool consists of 2 parts, a C component that is compiled into eBPF bytecode and loaded into the kernel and a userspace component that reads data via eBPF maps and presents formatted output to the user. The BCC project provides you with 3 frontend languages to write the userspace component (C, Lua and Python), but the actual kernel bytecode needs to be written in C. Now this C code can be embedded in the frontend language as a block of text or can be a separate c file. The BCC library provides a good API to attach to tracepoints, kprobes, uprobes and USDT. It also contains API calls to access the trace data via maps or simple prints issued to the trace pipe. All of this info can be learned from the user docs and reading example code.
Given that i was just adding support for an additional filesystem, i could lift most of the code i needed from the existing ext4slower and ext4dist tools.
Kprobes vs Static Tracepoints
The nfsslower and nfsdist tools profile common NFS operations and provide analysis in different forms about the performance of these operations. The nfsslower tool analizes all NFS read, write, open and getattr operations and outputs information about any calls that had a latency greater than 10ms (this value can be set via a command line argument). The nfsdist tool analyzes all NFS read, write and getattr operations and prints out a histogram of latency per call. Both these tools time each read, write and getattr operation and use an eBPF map structure to store the data.
Given that we are timing the call from when it hits the NFS filesystem layer and not above, we need to identify when a read, write or getattr call is actually triggered. This is where you have 2 options Tracepoints or Kprobes
Tracepoints - are defined places in code (in this case kernel code) where you can attach a probe. These points are part of the kernel source and once defined, are expected to be supported and maintained across kernel versions.
Kprobes - Are a debugging framework in the kernel that allow you to insert a probe into pretty much any kernel function/instruction. This allows you to do things like probe function call and function returns.
As a rule of thumb when writing BCC tools that need to examine some portion of the kernel, the preference is to use Tracepoints. Kprobes are only used if there are no available tracepoints for the kind of data you are trying to probe. Since Tracepoints are part of the kernel source and maintained across versions, you can be fairly confident that your tool will work with future version of the kernel. Kprobes on the other hand are less reliable, since Kernel developers might choose to modify or remove certain kernel functions which could end up breaking your tool.
You can get a list of available tracepoints by:
sudo perf list | grep Tracepoint
or
cat /sys/kernel/debug/tracing/available_events
In order to inspect NFS reads, write and getattr calls we need to trace the start and end of each of these operations. These operations could span multiple function calls and might also be partly asynchronous. The ideal scenario would be, if the kernel developers placed tracepoints in just the right places for read, write and getattr calls, but that would be too easy.
Looking at the available tracepoints, it was clear that apart from getattr, all other calls lacked appropriate enter and exit tracepoints.
getattr
root@ubuntu:/sys/kernel/debug/tracing# cat available_events | grep getattr
nfs4:nfs4_getattr
nfs4:nfs4_cb_getattr
nfs:nfs_getattr_enter
nfs:nfs_getattr_exit
syscalls:sys_exit_sched_getattr
syscalls:sys_enter_sched_getattr
read
root@ubuntu:/sys/kernel/debug/tracing# cat available_events | grep nfs | grep read
nfs4:nfs4_readlink
nfs4:nfs4_readdir
nfs4:nfs4_read
nfs4:nfs4_pnfs_read
write
root@ubuntu:/sys/kernel/debug/tracing# cat available_events | grep nfs | grep write
nfs4:nfs4_write
nfs4:nfs4_pnfs_write
nfs:nfs_writeback_page_enter
nfs:nfs_writeback_page_exit
nfs:nfs_writeback_inode_enter
nfs:nfs_writeback_inode_exit
open
root@ubuntu:/sys/kernel/debug/tracing# cat available_events | grep nfs | grep open
nfs4:nfs4_open_reclaim
nfs4:nfs4_open_expired
nfs4:nfs4_open_file
nfs4:nfs4_cached_open
nfs4:nfs4_test_open_stateid
nfs:nfs_atomic_open_enter
nfs:nfs_atomic_open_exit
With the absence of appropriate tracepoints, i needed to find the appropriate functions to attach kprobes to. Additionally i needed to work out which NFS version did i want my tool to support. This meant spending a bunch of time reading the kernel source and doing a bunch of tests.
The process of identifying the correct functions to trace meant reading the kernel source, coming up with a few test candidates and then testing them with Ftrace. The function and function_graph tracers were invaluable in this process. The function tracer runs if a function was called and the function_graph tracer traces the function and all of its children, giving you a call graph.
Below is an output from running a function_graph tracer on nfs_file_read. This output is incredibly helpful in trying to identify how far up/down the call graph you need to trace. Too far down and you might miss important latency measurements. Too far up and you might be capturing calls from the VFS layer that are not necessarily NFS related. Getting the right functions involves getting really familiar with the linux source and looking at call graphs to see how a running system operates.
echo 0 > tracing_on
echo nfs_file_read > set_graph_function
echo function_graph > current_tracer
echo 1 > tracing_on
cat trace_pipe | tee /tmp/nfs_file_read.log
echo 0 > tracing_on
cat /tmp/nfs_file_read.log
------------------------------------------
4) systemd-29689 => cp-31422
------------------------------------------
4) | nfs_file_read [nfs]() {
4) | nfs_start_io_read [nfs]() {
4) | down_read() {
4) 0.136 us | _cond_resched();
4) 1.189 us | }
4) 2.110 us | }
4) | nfs_revalidate_mapping [nfs]() {
4) | nfs_mapping_need_revalidate_inode [nfs]() {
4) | nfs_check_cache_invalid [nfs]() {
4) | nfs4_have_delegation [nfsv4]() {
4) 0.101 us | nfs4_is_valid_delegation [nfsv4]();
4) 0.951 us | }
4) 2.013 us | }
4) 2.842 us | }
4) 0.106 us | _cond_resched();
4) 0.125 us | _raw_spin_lock();
4) 5.415 us | }
4) | generic_file_read_iter() {
4) 0.104 us | _cond_resched();
4) | pagecache_get_page() {
4) 0.183 us | find_get_entry();
4) 0.985 us | }
Coding & Testing
Like i stated earlier, most of the code for these tools were lifted from the existing suite of xfs,zfs,ext4 tools. Other than small modifications to ensure the filename was being reported correctly, coding for these tools was very quick.
Testing it was a little more challenging. Given that i needed to test NFS operations, i had to setup an NFS server and client. Starting out, most of my development was being done on an Ubuntu-16.04 linux VM that i ran on VMWare Fusion on my macbook, Spinning up a client and server VM was fairly easy and was good enough to get started. I later switched to using a bare metal Linux machine as a client to run my final tests. This was because, i had noticed kernel lockups when i was doing more extensive Ftrace function_graph testing earlier on and i really wanted to ensure that any code that wrote would be robust to run in production.
Pull Request and Code Review
The BCC repository has a great guide to submitting new python tools. I recommend getting on the BCC mailing list to ask any questions and especially if you are thinking about writing new tools. The maintainers of BCC have a very clear idea for which tools will be included and are really helpful and responsive.
Submitting a Pull Request is fairly straightforward and i would recommend paying close attention to all the documentation and examples that go along with your tool. Its important that you spell check all your code and include appropriate tests in the smoke test script. You can look at my two pull requests as examples
nfsslower - https://github.com/iovisor/bcc/pull/1331
nfsdist - https://github.com/iovisor/bcc/pull/1347
All done and on to the next one
Having done with writing some basic NFS profiling tools, i am planning to build out a more comprehensive NFS-top type tool, that can summarize latency, throughput and IOps for each PID. This tool will aid in identifying clients that are exerting exceptionally high NFS request on the server and as a result slowing it down for all clients.
Links
- BCC repo
- Brendan Greggs blog - invaluable resource for all things related to systems performance
- kprobes
- ftrace
- tracepoints
- Cillium BFP docs
- eBPF blog post - 1, 2