Linux ftrace: trace your kernel functions(trace-cmd, KernelShark)

本文主要是介绍Linux ftrace: trace your kernel functions(trace-cmd, KernelShark)，希望对大家解决编程问题提供一定的参考价值，需要的开发者们随着小编来一起学习吧！

《Documentation/ftrace.txt：ftrace - Function Tracer》

what’s ftrace?

first steps with ftrace

getting started with trace-cmd: let’s trace just one function

next ftrace trick: let’s trace a process!

“function graph” tracing

How to know what functions you can trace

one last thing: events!

how does ftrace work?

use ftrace more easily: brendan gregg’s tools & kernelshark

KernelShark

Table of Contents

Introduction

a new superpower

an index of ftrace articles

Hello! Today we’re going to talk about a debugging tool we haven’t talked about much before on this blog: ftrace. What could be more exciting than a new debugging tool?!

Better yet, ftrace isn’t new! It’s been around since Linux kernel 2.6, or about 2008. here’s the earliest documentation I found with some quick Gooogling. So you might be able to use it even if you’re debugging an older system!

I’ve known that ftrace exists for about 2.5 years now, but hadn’t gotten around to really learning it yet. I’m supposed to run a workshop tomorrow where I talk about ftrace, so today is the day we talk about it!

what’s ftrace?

ftrace is a Linux kernel feature that lets you trace Linux kernel function calls. Why would you want to do that? Well, suppose you’re debugging a weird problem, and you’ve gotten to the point where you’re staring at the source code for your kernel version and wondering what exactly is going on.

I don’t read the kernel source code very often when debugging, but occasionally I do! For example this week at work we had a program that was frozen and stuck spinning inside the kernel. Looking at what functions were being called helped us understand better what was happening in the kernel and what systems were involved (in that case, it was the virtual memory system)!

I think ftrace is a bit of a niche tool (it’s definitely less broadly useful and harder to use than strace) but that it’s worth knowing about. So let’s learn about it!

first steps with ftrace

Unlike strace and perf, ftrace isn’t a program exactly – you don’t just run ftrace my_cool_function. That would be too easy!

If you read Debugging the kernel using Ftrace it starts out by telling you to cd /sys/kernel/debug/tracing and then do various filesystem manipulations.

For me this is way too annoying – a simple example of using ftrace this way is something like

cd /sys/kernel/debug/tracing
echo function > current_tracer
echo do_page_fault > set_ftrace_filter
cat trace

This filesystem interface to the tracing system (“put values in these magic files and things will happen”) seems theoretically possible to use but really not my preference.

Luckily, team ftrace also thought this interface wasn’t that user friendly and so there is an easier-to-use interface called trace-cmd!!! trace-cmd is a normal program with command line arguments. We’ll use that! I found an intro to trace-cmd on LWN at trace-cmd: A front-end for Ftrace.

getting started with trace-cmd: let’s trace just one function

First, I needed to install trace-cmd with sudo apt-get install trace-cmd. Easy enough.

sudo apt-get install trace-cmd

if Centos/RedHat Linux

sudo yum install -y trace-cmd

For this first ftrace demo, I decided I wanted to know when my kernel was handling a page fault. When Linux allocates memory, it often does it lazily (“you weren’t really planning to use that memory, right?“). This means that when an application tries to actually write to memory that it allocated, there’s a page fault and the kernel needs to give the application physical memory to use.

Let’s start trace-cmd and make it trace the do_page_fault function!

$ sudo trace-cmd record -p function -l do_page_faultplugin 'function'
Hit Ctrl^C to stop recording

I ran it for a few seconds and then hit Ctrl+C. Awesome! It created a 2.5MB file called trace.dat. Let’s see what’s that file!

$ sudo trace-cmd reportchrome-15144 [000] 11446.466121: function:             do_page_faultchrome-15144 [000] 11446.467910: function:             do_page_faultchrome-15144 [000] 11446.469174: function:             do_page_faultchrome-15144 [000] 11446.474225: function:             do_page_faultchrome-15144 [000] 11446.474386: function:             do_page_faultchrome-15144 [000] 11446.478768: function:             do_page_faultCompositorTileW-15154 [001] 11446.480172: function:             do_page_faultchrome-1830  [003] 11446.486696: function:             do_page_faultCompositorTileW-15154 [001] 11446.488983: function:             do_page_faultCompositorTileW-15154 [001] 11446.489034: function:             do_page_faultCompositorTileW-15154 [001] 11446.489045: function:             do_page_fault

Or my test:

$ sudo trace-cmd record -p function -l schedule^C
$ trace-cmd report | more
cpus=2trace-cmd-8976  [001]   417.202648: function:             schedulekworker/1:1-7600  [001]   417.202665: function:             scheduletrace-cmd-8978  [001]   417.202690: function:             scheduletrace-cmd-8977  [000]   417.202695: function:             scheduletrace-cmd-8978  [001]   417.202696: function:             schedulesshd-1550  [000]   417.202804: function:             schedulercu_sched-9     [001]   417.205066: function:             schedulevmtoolsd-808   [001]   417.220343: function:             schedulexfsaild/dm-0-432   [001]   417.252104: function:             schedulemysqld-8948  [000]   417.282055: function:             schedulexfsaild/dm-0-432   [001]   417.302182: function:             scheduletrace-cmd-8978  [001]   417.302203: function:             scheduletrace-cmd-8977  [000]   417.303179: function:             schedulevmtoolsd-808   [001]   417.320707: function:             schedulekworker/1:1-7600  [001]   417.337222: function:             schedulexfsaild/dm-0-432   [001]   417.352262: function:             schedulekworker/1:1-7600  [001]   417.370696: function:             schedulercu_sched-9     [001]   417.371597: function:             schedulercu_sched-9     [001]   417.374586: function:             schedulercu_sched-9     [001]   417.377621: function:             schedulemysqld-8948  [000]   417.382627: function:             scheduleabrt-watch-log-738   [000]   417.382732: function:             schedulexfsaild/dm-0-432   [001]   417.402742: function:             scheduletrace-cmd-8978  [001]   417.402749: function:             schedule
...

This is neat – it shows me the process name (chrome), process ID (15144), CPU (000), and function that got traced.

By looking at the whole report, (sudo trace-cmd report | grep chrome) I can see that we traced for about 1.5 seconds and in that time Chrome had about 500 page faults. Cool! We have done our first ftrace!

next ftrace trick: let’s trace a process!

Okay, but just seeing one function is kind of boring! Let’s say I want to know everything that’s happening for one program. I use a static site generator called Hugo. What’s the kernel doing for Hugo?

Hugo’s PID on my computer right now is 25314, so I recorded all the kernel functions with:

sudo trace-cmd record --help # I read the help!
sudo trace-cmd record -p function -P 25314 # record for PID 25314

sudo trace-cmd report printed out 18,000 lines of output. If you’re interested, you can see all 18,000 lines here.

Or my test:

$ sudo ps  -ef | more
UID         PID   PPID  C STIME TTY          TIME CMD
root          1      0  0 14:41 ?        00:00:02 /usr/lib/systemd/systemd --switched-root --system --deserialize 21
root          2      0  0 14:41 ?        00:00:00 [kthreadd]
root          3      2  0 14:41 ?        00:00:00 [ksoftirqd/0]# 追踪 systemd
$ sudo trace-cmd record -p function -P 1$ trace-cmd report | more
cpus=2
CPU:1 [143405 EVENTS DROPPED]grep-11442 [001]   547.533024: function:             up_readgrep-11442 [001]   547.533024: function:             _raw_spin_lockgrep-11442 [001]   547.533024: function:             do_set_ptegrep-11442 [001]   547.533024: function:                add_mm_counter_fastgrep-11442 [001]   547.533024: function:                page_add_file_rmapgrep-11442 [001]   547.533024: function:             unlock_pagegrep-11442 [001]   547.533024: function:                __smp_mb__after_atomicgrep-11442 [001]   547.533024: function:                page_waitqueuegrep-11442 [001]   547.533025: function:                __wake_up_bitgrep-11442 [001]   547.533025: function:             up_readgrep-11442 [001]   547.533025: function:             do_page_faultgrep-11442 [001]   547.533026: function:                __do_page_faultgrep-11442 [001]   547.533026: function:                   down_read_trylock
...

18,000 lines is a lot so here are some interesting excerpts.

This looks like what happens when the clock_gettime system call runs. Neat!

 compat_SyS_clock_gettimeSyS_clock_gettimeclockid_to_kclockposix_clock_realtime_getgetnstimeofday64__getnstimeofday64arch_counter_read__compat_put_timespec

This is something related to process scheduling:

 cpufreq_sched_irq_workwake_up_processtry_to_wake_up_raw_spin_lock_irqsavedo_raw_spin_lock_raw_spin_lockdo_raw_spin_lockwalt_ktime_clockktime_getarch_counter_readwalt_update_task_ravgexiting_task

Being able to see all these function calls is pretty cool, even if I don’t quite understand them.

“function graph” tracing

There’s another tracing mode called function_graph. This is the same as the function tracer except that it instruments both entering and exiting a function. Here’s the output of that tracer

sudo trace-cmd record -p function_graph -P 25314

Again, here’s a snipped (this time from the futex code)

             |      futex_wake() {|        get_futex_key() {|          get_user_pages_fast() {1.458 us   |            __get_user_pages_fast();4.375 us   |          }|          __might_sleep() {0.292 us   |            ___might_sleep();2.333 us   |          }0.584 us   |          get_futex_key_refs();|          unlock_page() {0.291 us   |            page_waitqueue();0.583 us   |            __wake_up_bit();5.250 us   |          }0.583 us   |          put_page();
+ 24.208 us  |        }

We see in this example that get_futex_key gets called right after futex_wake. Is that what really happens in the source code? We can check!! Here’s the definition of futex_wake in Linux 4.4 (my kernel version).

Or my test:

[rongtao@toa ~]$ sudo trace-cmd record -p function_graph -P 1plugin 'function_graph'
Hit Ctrl^C to stop recording
^CCPU 0: 1348372 events lost
CPU 1: 3638894 events lost
CPU0 data recorded at offset=0x44600010620928 bytes in size
CPU1 data recorded at offset=0xe6700037269504 bytes in size
[rongtao@toa ~]$ trace-cmd report | more
cpus=2
CPU:1 [72263 EVENTS DROPPED]systemd-1     [001]   730.363294: funcgraph_exit:       + 27.648 us  |                              }systemd-1     [001]   730.363294: funcgraph_entry:                   |                              sidtab_context_to_sid() {systemd-1     [001]   730.363294: funcgraph_entry:        0.025 us   |                                ebitmap_cmp();systemd-1     [001]   730.363295: funcgraph_entry:        0.029 us   |                                ebitmap_cmp();systemd-1     [001]   730.363295: funcgraph_exit:         0.515 us   |                              }systemd-1     [001]   730.363295: funcgraph_entry:        0.030 us   |                              kfree();systemd-1     [001]   730.363295: funcgraph_entry:        0.026 us   |                              ebitmap_destroy();systemd-1     [001]   730.363295: funcgraph_entry:        0.025 us   |                              ebitmap_destroy();systemd-1     [001]   730.363296: funcgraph_entry:        0.040 us   |                              kfree();systemd-1     [001]   730.363296: funcgraph_entry:        0.024 us   |                              kfree();systemd-1     [001]   730.363296: funcgraph_exit:       + 31.000 us  |                            }systemd-1     [001]   730.363296: funcgraph_exit:       + 31.236 us  |                          }systemd-1     [001]   730.363296: funcgraph_exit:       + 31.511 us  |                        }systemd-1     [001]   730.363297: funcgraph_exit:       + 31.750 us  |                      }systemd-1     [001]   730.363297: funcgraph_exit:       + 32.002 us  |                    }systemd-1     [001]   730.363297: funcgraph_entry:        0.025 us   |                    set_nlink();systemd-1     [001]   730.363297: funcgraph_exit:       + 32.495 us  |                  }

I’ll save you a click: it looks like this:

static int
futex_wake(u32 __user *uaddr, unsigned int flags, int nr_wake, u32 bitset)
{struct futex_hash_bucket *hb;struct futex_q *this, *next;union futex_key key = FUTEX_KEY_INIT;int ret;WAKE_Q(wake_q);if (!bitset)return -EINVAL;ret = get_futex_key(uaddr, flags & FLAGS_SHARED, &key, VERIFY_READ);

So the first function called in futex_wake really is get_futex_key! Neat! Reading the function trace was definitely an easier way to find that out than by reading the kernel code, and it’s nice to see how long all of the functions took.

How to know what functions you can trace

If you run sudo trace-cmd list -f you’ll get a list of all the functions you can trace. That’s pretty simple but it’s important.

[rongtao@toa ~]$ sudo trace-cmd list -f | grep epoll
SyS_epoll_create1
SyS_epoll_create
SyS_epoll_ctl
SyS_epoll_wait
SyS_epoll_pwait
compat_SyS_epoll_pwait[rongtao@toa ~]$ sudo trace-cmd list -f | grep eventfd
eventfd_poll
eventfd_signal
eventfd_ctx_fileget
eventfd_free
eventfd_ctx_remove_wait_queue
eventfd_fget
eventfd_release
eventfd_ctx_put
eventfd_ctx_get
eventfd_ctx_fdget
eventfd_show_fdinfo
eventfd_ctx_read
eventfd_read
eventfd_write
eventfd_file_create
SyS_eventfd2
SyS_eventfd

one last thing: events!

So, now we know how to trace functions in the kernel! That’s really cool!

There’s one more class of thing we can trace though! Some events don’t correspond super well to function calls. For example, you might want to knowwhen a program is scheduled on or off the CPU! You might be able to figure that out by peering at function calls, but I sure can’t.

So the kernel also gives you a few events so you can see when a few important things happen. You can see a list of all these events with sudo cat /sys/kernel/debug/tracing/available_events

I looked at all the sched_switch events. I’m not exactly sure what sched_switch is but it’s something to do with scheduling I guess.

sudo cat /sys/kernel/debug/tracing/available_events
sudo trace-cmd record -e sched:sched_switch
sudo trace-cmd report

The output looks like this:

 16169.624862:   Chrome_ChildIOT:24817 [112] S ==> chrome:15144 [120]16169.624992:   chrome:15144 [120] S ==> swapper/3:0 [120]16169.625202:   swapper/3:0 [120] R ==> Chrome_ChildIOT:24817 [112]16169.625251:   Chrome_ChildIOT:24817 [112] R ==> chrome:1561 [112]16169.625437:   chrome:1561 [112] S ==> chrome:15144 [120]

Or my test:

[rongtao@toa ~]$ sudo cat /sys/kernel/debug/tracing/available_events | grep epoll
syscalls:sys_exit_epoll_pwait
syscalls:sys_enter_epoll_pwait
syscalls:sys_exit_epoll_wait
syscalls:sys_enter_epoll_wait
syscalls:sys_exit_epoll_ctl
syscalls:sys_enter_epoll_ctl
syscalls:sys_exit_epoll_create
syscalls:sys_enter_epoll_create
syscalls:sys_exit_epoll_create1
syscalls:sys_enter_epoll_create1[rongtao@toa ~]$ sudo trace-cmd record -e syscalls:sys_enter_epoll_wait
Hit Ctrl^C to stop recording
^CCPU0 data recorded at offset=0x4460004096 bytes in size
CPU1 data recorded at offset=0x4470004096 bytes in size[rongtao@toa ~]$ trace-cmd report | more
cpus=2postgres-1578  [001]  1015.834627: sys_enter_epoll_wait: epfd: 0x00000003, events: 0x01d22528, maxevents: 0x00000001, timeout: 0x00002710dbus-daemon-744   [001]  1015.960879: sys_enter_epoll_wait: epfd: 0x00000004, events: 0x7ffc42237d70, maxevents: 0x00000040, timeout: 0xffffffffsystemd-1     [000]  1015.960928: sys_enter_epoll_wait: epfd: 0x00000004, events: 0x7ffef100f520, maxevents: 0x00000045, timeout: 0xffffffffauditd-696   [000]  1015.960982: sys_enter_epoll_wait: epfd: 0x0000000b, events: 0x557e58accde0, maxevents: 0x00000040, timeout: 0x0000e95fsystemd-journal-511   [001]  1015.961170: sys_enter_epoll_wait: epfd: 0x00000007, events: 0x7ffc56d33980, maxevents: 0x00000024, timeout: 0xffffffffsystemd-journal-511   [001]  1015.961330: sys_enter_epoll_wait: epfd: 0x00000007, events: 0x7ffc56d33980, maxevents: 0x00000024, timeout: 0xffffffffsystemd-journal-511   [001]  1015.961504: sys_enter_epoll_wait: epfd: 0x00000007, events: 0x7ffc56d33980, maxevents: 0x00000024, timeout: 0xffffffffsystemd-logind-734   [001]  1015.961799: sys_enter_epoll_wait: epfd: 0x00000004, events: 0x7ffca5567f30, maxevents: 0x0000000d, timeout: 0xffffffffsystemd-1     [000]  1015.961861: sys_enter_epoll_wait: epfd: 0x00000004, events: 0x7ffef100f520, maxevents: 0x00000045, timeout: 0xffffffffsystemd-1     [000]  1015.961895: sys_enter_epoll_wait: epfd: 0x00000004, events: 0x7ffef100f520, maxevents: 0x00000045, timeout: 0xffffffffdbus-daemon-744   [000]  1015.962083: sys_enter_epoll_wait: epfd: 0x00000004, events: 0x7ffc42237d70, maxevents: 0x00000040, timeout: 0xffffffff

so you can see it switching from PID 24817 -> 15144 -> kernel -> 24817 -> 1561 -> 15114. (all of these events are on the same CPU)

how does ftrace work?

ftrace is a dynamic tracing system. This means that when I start ftracing a kernel function, the function’s code gets changed. So – let’s suppose that I’m tracing that do_page_fault function from before. The kernel will insert some extra instructions in the assembly for that function to notify the tracing system every time that function gets called. The reason it can add extra instructions is that Linux compiles in a few extra NOP instructions into every function, so there’s space to add tracing code when needed.

This is awesome because it means that when I’m not using ftrace to trace my kernel, it doesn’t affect performance at all. When I do start tracing, the more functions I trace, the more overhead it’ll have.

(probably some of this is wrong, but this is how I think ftrace works anyway)

use ftrace more easily: brendan gregg’s tools & kernelshark

As we’ve seen in this post, you need to think quite a lot about what individual kernel functions / events do to use ftrace directly. This is cool, but it’s also a lot of work!

Brendan Gregg (our linux debugging tools hero) has repository of tools that use ftrace to give you information about various things like IO latency. They’re all in his perf-tools repository on GitHub.

The tradeoff here is that they’re easier to use, but you’re limited to things that Brendan Gregg thought of & decided to make a tool for. Which is a lot of things! :)

Another tool for visualizing the output of ftrace better is kernelshark. I haven’t played with it much yet but it looks useful. You can install it with sudo apt-get install kernelshark.

download：https://git.kernel.org/pub/scm/utils/trace-cmd/trace-cmd.git/snapshot/trace-cmd-kernelshark-v1.0.tar.gz

KernelShark

https://www.kernelshark.org/Documentation.html

Introduction

Graph Info Area
Graph Control Area
The Graph Window
Plot Title
List Area

The Graph View

Zooming In

Zooming Out
Graph Markers
Deselecting a marker
Graph Plots
Task Plots
CPU Plots

The List View

Selecting an event
Graph follows toggle

Filters

Event Filter Dialog
Task Filter Dialog
Advanced Event Filter
Multiple filters

Sessions

Introduction

KernelShark is a front end reader of trace-cmd(1) output. "trace-cmd record" and "trace-cmd extract" create a trace.dat (trace-cmd.dat(5)) file. kernelshark can read this file and produce a graph and list view of its data.

a new superpower

I’m really happy I took the time to learn a little more about ftrace today! Like any kernel tool, it’ll work differently between different kernel versions, but I hope that you find it useful one day.

an index of ftrace articles

Finally, here’s a list of a bunch of ftrace articles I found. Many of them are on LWN (Linux Weekly News), which is a pretty great source of writing on Linux. (you can buy a subscription!)

Debugging the kernel using Ftrace - part 1 (Dec 2009, Steven Rostedt)
Debugging the kernel using Ftrace - part 2 (Dec 2009, Steven Rostedt)
Secrets of the Linux function tracer (Jan 2010, Steven Rostedt)
trace-cmd: A front-end for Ftrace (Oct 2010, Steven Rostedt)
Using KernelShark to analyze the real-time scheduler (2011, Steven Rostedt)
Ftrace: The hidden light switch (2014, Brendan Gregg)
the kernel documentation: (which is quite useful) Documentation/ftrace.txt
documentation on events you can trace Documentation/events.txt
some docs on ftrace design for linux kernel devs (not as useful, but interesting) Documentation/ftrace-design.txt

这篇关于Linux ftrace: trace your kernel functions(trace-cmd, KernelShark)的文章就介绍到这儿，希望我们推荐的文章对编程师们有所帮助！