eBPF：03. Anatomy of an eBPF Program

eBPF 程序从源代码到执行过程所经历的各个阶段:

C（或 Rust）源代码会被编译为 eBPF 字节码，而这些 eBPF 字节码又会通过即时编译（JIT）或解释的方式，转换为原生的机器码指令。

eBPF 程序是一组 eBPF 字节码指令。目前，绝大多数 eBPF 代码是用 C 语言编写的，然后编译成 eBPF 字节码。这些字节码运行在内核中的 eBPF 虚拟机内。

3.1 The eBPF Virtual Machine

eBPF 虚拟机接收以 eBPF 字节码指令形式表示的程序，并将这些指令转换为在 CPU 上运行的原生机器指令。

早期，字节码指令是在内核中解释执行的。目前，很大程度上被即时（just-in-time，JIT）编译替代。编译(compilation)意味着当程序加载到内核时，从字节码到本机机器指令的转换只发生一次。

eBPF 字节码由一组指令组成，这些指令作用于（虚拟的）eBPF 寄存器。

3.1.1 eBPF Registers

eBPF 虚拟机使用 10 个通用寄存器，编号从 0 到 9。此外，还有一个寄存器 10 被用作栈帧指针（只能读取，不能写入）。在执行 BPF 程序时，这些寄存器中存储的值用于跟踪状态。

可以在 Linux 内核源代码的 include/uapi/linux/bpf.h 头文件中看到 BPF_REG_0 到 BPF_REG_10 的定义：

/* Register numbers */
enum {
	BPF_REG_0 = 0,      // 函数的返回值
	BPF_REG_1,          // 在 eBPF 程序开始执行之前，上下文(context)参数被加载到该寄存器
	BPF_REG_2,
	BPF_REG_3,
	BPF_REG_4,
	BPF_REG_5,
	BPF_REG_6,
	BPF_REG_7,
	BPF_REG_8,
	BPF_REG_9,
	BPF_REG_10,         // 栈帧指针 (a stack frame pointer)
	__MAX_BPF_REG,
};

在调用 eBPF 代码中的函数之前，该函数的参数被放置在 BPF_REG_1 到 BPF_REG_5 中（如果参数少于五个，则不会使用所有寄存器）。

3.1.2 eBPF Instructions

include/uapi/linux/bpf.h 中定义了一个名为 bpf_insn 的结构体，代表一个 BPF 指令。

struct bpf_insn {
	__u8	code;		/* opcode */
	__u8	dst_reg:4;	/* dest register */
	__u8	src_reg:4;	/* source register */
	__s16	off;		/* signed offset */
	__s32	imm;		/* signed immediate constant */
};

code: 每条指令都包含一个操作码（opcode），该操作码定义了这条指令需要执行的操作：例如，将一个数值加到寄存器的内容中，或者跳转到程序内的另一条指令处。
- Unofficial eBPF spec 中列出了有效指令的列表
dst_reg 和 src_reg: 不同的操作可能涉及最多两个寄存器。
off 和 imm: 根据操作的不同，可能还会有一个偏移值 off 和/或一个“立即” (immediate) 整数值

bpf_insn 结构体的长度为 64 位（8 字节）。然而，有时一条指令可能需要多于 8 字节的空间。在这些情况下，指令使用总长度为 16 字节的宽指令编码 (wide instruction encoding)。

当加载到内核中时，eBPF 程序的字节码由一系列 bpf_insn 结构体表示。验证器 (verifier) 对这些信息进行多项检查，以确保代码的运行安全。

3.2 eBPF “Hello World” for a Network Interface

常见的约定:

为了将 eBPF 程序与可能存在于相同源代码目录中的用户空间 C 代码区分开来，将 eBPF 程序放在以文件名以 bpf.c 结尾的文件中。

示例 [hello.bpf.c]: 这是一个附加到网络接口上的 XDP 钩子点的 eBPF 程序示例。您可以将 XDP 事件视为在网络数据包到达（物理或虚拟）网络接口时立即触发。

#include <linux/bpf.h>
#include <bpf/bpf_helpers.h>

// 全局变量
int counter = 0;

// 定义一个 eXpress Data Path（XDP）类型的 eBPF 程序
SEC("xdp")
int hello(struct xdp_md *ctx) {
    bpf_printk("Hello World %d", counter);
    counter++; 
    return XDP_PASS;
}

char LICENSE[] SEC("license") = "Dual BSD/GPL";

XDP_PASS: 这是一条告知内核的判定结果，指示其按常规流程处理该网络数据包。
SEC("license"):
- 定义许可证字符串的 SEC() 宏，这是 eBPF 程序的关键要求。
- 内核中的一些 BPF 辅助函数被定义为“仅限 GPL（GPL only）”。如果您想使用这些函数，您的 BPF 代码必须声明为具有 GPL 兼容的许可证。
- 如果声明的许可证与程序使用的函数不兼容，验证器会拒绝加载。

3.3 Compiling an eBPF Object File

eBPF 源代码需要编译成 eBPF 虚拟机能理解的机器指令：eBPF 字节码。Clang 编译器需要指定 -target bpf。

以下是从 Makefile 中截取的用于进行编译的部分：

%.bpf.o: %.bpf.c
	clang \
	    -target bpf \
		-I/usr/include/$(shell uname -m)-linux-gnu \
		-g \
	    -O2 -o $@ -c $<

这将从 hello.bpf.c 源代码生成一个名为 hello.bpf.o 的目标文件。

这里的 -g 标志是可选的，它可以生成调试信息，这样当你查看目标文件时，可以同时看到源代码与字节码。

3.4 Inspecting an eBPF Object File

使用 file命令来确定文件的内容：

1 2	$ file hello.bpf.o hello.bpf.o: ELF 64-bit LSB relocatable, eBPF, version 1 (SYSV), with debug_info, not stripped

这表明它是一个 ELF（Executable and Linkable Format，可执行和可链接格式）文件，包含 eBPF 代码，适用于具有 LSB（最低有效位）架构的 64 位平台。

如果在编译步骤中使用了 -g 标志，它将包含调试信息。

可以使用 llvm-objdump 进一步检查此目标文件，以查看其中的 eBPF 指令：

$ llvm-objdump -S hello.bpf.o

hello.bpf.o:    file format elf64-bpf

Disassembly of section xdp:

0000000000000000 <hello>:
;     bpf_printk("Hello World %d", counter);
       0:       18 06 00 00 00 00 00 00 00 00 00 00 00 00 00 00 r6 = 0x0 ll
       2:       61 63 00 00 00 00 00 00 r3 = *(u32 *)(r6 + 0x0)
       3:       18 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 r1 = 0x0 ll
       5:       b7 02 00 00 0f 00 00 00 r2 = 0xf
       6:       85 00 00 00 06 00 00 00 call 0x6
;     counter++; 
       7:       61 61 00 00 00 00 00 00 r1 = *(u32 *)(r6 + 0x0)
       8:       07 01 00 00 01 00 00 00 r1 += 0x1
       9:       63 16 00 00 00 00 00 00 *(u32 *)(r6 + 0x0) = r1
;     return XDP_PASS;
      10:       b7 00 00 00 02 00 00 00 r0 = 0x2
      11:       95 00 00 00 00 00 00 00 exit

在每行字节码的左侧，可以看到该指令在内存中相对于 hello 所在位置的偏移量。eBPF 指令长度通常是 8 字节，在 64 位平台上，每个内存位置可以容纳 8 字节，因此偏移量通常会每条指令递增 1。

然而，该程序中的第一条指令恰好需要 16 字节的宽指令编码，以便将寄存器 6 设置为 64 位值 0。因此，输出的第二行指令的偏移量为 2。

1	5: b7 02 00 00 0f 00 00 00 r2 = 0xf

操作码 (opcode) 是 0xb7, 查阅 Unofficial eBPF spec 其对应的伪代码是 dst = imm，可以理解为将目标寄存器设置为立即数。
0x02 代表寄存器 2
0x0f 是立即数，代表十进制中的 15

因此，这条指令可以理解为：将 Register 2 设置为值 15。

1	10: b7 00 00 00 02 00 00 00 r0 = 0x2

与之类似的，该指令表示将 Register 0 设置为值 2。

3.5 Loading the Program into the Kernel

NOTE:

您可能需要以 root 身份（或使用 sudo）获得 bpftool 所需的 BPF 权限。

使用 bpftool 将程序加载到内核。该操作从已编译的目标文件中加载 eBPF 程序，并将其固定到路径 /sys/fs/bpf/hello 下。

1	$ bpftool prog load hello.bpf.o /sys/fs/bpf/hello

查看是否加载成功：

1 2	$ ls /sys/fs/bpf hello

3.6 Inspecting the Loaded Program

查看加载到内核中的所有程序:

$ bpftool prog list
...
174: xdp  name hello  tag d35b94b4c0c10efb  gpl
        loaded_at 2025-12-12T11:42:16+0800  uid 0
        xlated 96B  jited 71B  memlock 4096B  map_ids 90,91
        btf_id 187

将输出内容整理为格式化的 JSON 格式:

$ bpftool prog show id 174 --pretty
{
    "id": 174,
    "type": "xdp",
    "name": "hello",
    "tag": "d35b94b4c0c10efb",
    "gpl_compatible": true,
    "loaded_at": 1765510936,
    "uid": 0,
    "orphaned": false,
    "bytes_xlated": 96,
    "jited": true,
    "bytes_jited": 71,
    "bytes_memlock": 4096,
    "map_ids": [90,91
    ],
    "btf_id": 187
}

"uid": 0, 表示 root 用户加载的程序。
"bytes_xlated": 96, 此程序中有 96 字节的翻译后的 eBPF 字节码。
"jited": true,"bytes_jited": 71, 该程序已经过 JIT 编译，编译产生了 71 字节的机器码。
"bytes_memlock": 4096, 此程序保留了 4096 字节的内存，这些内存不会被分页。
"map_ids": [90,91], 该程序引用了 ID 为 90 和 91 的 BPF map。(与全局变量有关)。
"btf_id": 187 表示该程序有一个 BTF 信息块。只有在使用 -g 标志进行编译时，才会将此信息包含在目标文件中。

3.6.1 The BPF Program Tag

标签（tag）是所有程序指令的 SHA（Secure Hashing Algorithm，安全哈希算法）散列值，可以用作程序的另一个标识符。

NOTE:

每次加载或卸载程序时，ID 可能会变化，但标签(tag)将保持不变。

bpftool prog show id 174
bpftool prog show name hello
bpftool prog show tag d35b94b4c0c10efb
bpftool prog show pinned /sys/fs/bpf/hello

3.6.2 The Translated Bytecode

bytes_xlated 字段告诉我们有多少字节的“翻译后”eBPF 代码。这是 eBPF 字节码在通过验证器之后（并可能被内核修改）得到的结果。

查看翻译后的 eBPF 代码：

$ bpftool prog dump xlated name hello
int hello(struct xdp_md * ctx):
; bpf_printk("Hello World %d", counter);
   0: (18) r6 = map[id:24][0]+0
   2: (61) r3 = *(u32 *)(r6 +0)
   3: (18) r1 = map[id:25][0]+0
   5: (b7) r2 = 15
   6: (85) call bpf_trace_printk#-82848
; counter++;
   7: (61) r1 = *(u32 *)(r6 +0)
   8: (07) r1 += 1
   9: (63) *(u32 *)(r6 +0) = r1
; return XDP_PASS;
  10: (b7) r0 = 2
  11: (95) exit

这与之前从 llvm-objdump 输出中看到的反汇编代码非常相似。偏移地址相同，指令也相似。例如，可以看到偏移量为 5 的指令是 r2 = 15。

3.6.3 The JIT-Compiled Machine Code

翻译后的字节码非常底层，但它还不完全是机器码。eBPF 会使用即时编译器（JIT），将 eBPF 字节码转换为可在目标 CPU 上原生运行的机器码。

bytes_jited 字段显示，经过该转换后，程序的长度为 71 字节。

bpftool 工具可以生成这份即时编译（JIT）代码的汇编语言 dump 文件（即汇编代码快照）:

$ bpftool prog dump jited name hello
int hello(struct xdp_md * ctx):
bpf_prog_d35b94b4c0c10efb_hello:
; bpf_printk("Hello World %d", counter);
    0: hint #34
    4: stp x29, x30, [sp, #-16]!
    8: mov x29, sp
    c: stp x19, x20, [sp, #-16]!
    10: stp x21, x22, [sp, #-16]!
    14: stp x25, x26, [sp, #-16]!
    18: mov x25, sp
    1c: mov x26, #0
    20: hint #36
    24: sub sp, sp, #0
    28: mov x19, #-140733193388033
    2c: movk x19, #2190, lsl #16
    30: movk x19, #49152
    34: mov x10, #0
    38: ldr w2, [x19, x10]
    3c: mov x0, #-205419695833089
    40: movk x0, #709, lsl #16
    44: movk x0, #5904
    48: mov x1, #15
    4c: mov x10, #-6992
    50: movk x10, #29844, lsl #16
    54: movk x10, #56832, lsl #32
    58: blr x10
    5c: add x7, x0, #0
; counter++;
    60: mov x10, #0
    64: ldr w0, [x19, x10]
    68: add x0, x0, #1
    6c: mov x10, #0
    70: str w0, [x19, x10]
; return XDP_PASS;
    74: mov x7, #2
    78: mov sp, sp
    7c: ldp x25, x26, [sp], #16
    80: ldp x21, x22, [sp], #16
    84: ldp x19, x20, [sp], #16
    88: ldp x29, x30, [sp], #16
    8c: add x0, x7, #0
    90: ret

注：

我在执行该命令时，并没有执行成功，而是返回如下错误信息：

1 2	$ bpftool prog dump jited name hello Error: No JIT disassembly support

结合查到的信息，应该是我使用的内核（Ubuntu 24.04）没有编译这些功能。

$ zcat /proc/config.gz | grep -E 'CONFIG_BPF_JIT|CONFIG_BPF_JIT_DISASM|CONFIG_DEBUG_INFO_BPF'
CONFIG_BPF_JIT=y
CONFIG_BPF_JIT_ALWAYS_ON=y
CONFIG_BPF_JIT_DEFAULT_ON=y

# 值为 2 表示启用 JIT + 反汇编
# 值为 1 表示仅启用JIT
$ cat /proc/sys/net/core/bpf_jit_enable
1

3.7 Attaching to an Event

Hello World 这个 eBPF 程序被加载到了内核中，但此时它还没有与任何事件 (Event) 相关联，因此不会有任何触发条件使其运行。它需要被挂载到某个事件(Event)上。

eBPF 程序类型必须与其要挂载的事件类型相匹配。

使用 bpftool 将示例 eBPF 程序附加到网络接口上的 XDP 事件：

1	$ bpftool net attach xdp tag d35b94b4c0c10efb dev eth0

查看挂载到网络协议栈 (network-attached) 的 BPF 程序:

$ bpftool net list
xdp:
eth0(2) driver id 174

tc:

flow_dissector:

netfilter:

这份输出中还可以看到网络协议栈中其他可挂载 eBPF 程序的潜在事件，如 tc 和 flow_dissector。

查看网络接口：

$ ip link
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 xdp qdisc mq state UP mode DEFAULT group default qlen 1000
    link/ether 00:01:02:03:04:05 brd ff:ff:ff:ff:ff:ff
    prog/xdp id 174 name hello tag d35b94b4c0c10efb jited

查看输出信息：

1	$ cat /sys/kernel/debug/tracing/trace_pipe

或者：

1	$ bpftool prog tracelog

3.8 全局变量

eBPF map 是一种可以从 eBPF 程序或者用户空间访问的数据结构。同一程序的不同流程可以多次访问同一个 map，多个程序也可以访问同一个 map。由于这些特性，eBPF map 可以作为全局变量使用。

NOTE:

eBPF 在 2019 年才支持全局变量。

查看加载到内核中的 map

$ bpftool map list
165: array name hello.bss   flags 0x400
    key 4B value 4B max_entries 1 memlock 4096B
    btf_id 254
166: array name hello.rodata flags 0x80
    key 4B value 15B max_entries 1 memlock 4096B
    btf_id 254 frozen

查看 map 的内容

在从 C 程序编译的目标文件中，bss 段通常保存全局变量

$ bpftool map dump name hello.bss
[{
        "value": {
            ".bss": [{
                    "counter": 11127
                }
            ]
        }
    }
]

只有当 BTF 信息可用时，bpftool 才能美观地打印出 map 中的字段名；而要包含该 BTF 信息，需在编译时添加 -g 标志。

$ bpftool map dump name hello.rodata
[{
        "value": {
            ".rodata": [{
                "hello.____fmt": "Hello World %d"
                }
            ]
        }
    }
]

3.9 Detaching the Program

将程序从网络接口分离（detach）

1	$ bpftool net detach xdp dev eth0

列出挂载到网络栈的 BPF 程序

$ bpftool net list
xdp:

tc:

flow_dissector:

netfilter:

但是，程序仍加载在内核中：

$ bpftool prog show name hello
395: xdp name hello tag 9d0e949f89f1a82c gpl
    loaded_at 2022-12-19T18:20:32+0000 uid 0
    xlated 48B jited 108B memlock 4096B map_ids 4

3.10 Unloading the Program

目前，还没有 bpftool prog load 的反向命令，可以通过删除固定的伪文件来从内核中移除该程序：

1 2	$ rm /sys/fs/bpf/hello $ bpftool prog show name hello

3.11 BPF to BPF Calls

在上一章中看到了尾调用的应用，现在还可以从 eBPF 程序中调用函数。

示例：[hello-func.bpf.c]

1
2
3

static __attribute((noinline)) int get_opcode(struct bpf_raw_tracepoint_args *ctx) {
    return ctx->args[1];
}

__attribute((noinline)) 确保编译器不会内联该函数

调用该函数的 eBPF 函数如下所示：

SEC("raw_tp")
int hello(struct bpf_raw_tracepoint_args *ctx) {
    int opcode = get_opcode(ctx);
    bpf_printk("Syscall: %d", opcode);
    return 0;
}

将其编译为 eBPF 目标文件后，可以使用 bpftool 将其加载到内核中，并确认它已加载：

$ bpftool prog load hello-func.bpf.o /sys/fs/bpf/hello
$ bpftool prog list name hello
893: raw_tracepoint name hello tag 3d9eb0c23d4ab186 gpl
    loaded_at 2023-01-05T18:57:31+0000 uid 0
    xlated 80B  jited 208B   memlock 4096B   map_ids 204
    btf_id 302

值得注意的是在 eBPF 字节码中查看 get_opcode() 函数：

$ bpftool prog dump xlated name hello
int hello(struct bpf_raw_tracepoint_args * ctx):
; int opcode = get_opcode(ctx);
   0: (85) call pc+7#bpf_prog_cbacc90865b1b9a5_get_opcode
; bpf_printk("Syscall: %d", opcode);
   1: (18) r1 = map[id:39][0]+0
   3: (b7) r2 = 12
   4: (bf) r3 = r0
   5: (85) call bpf_trace_printk#-82848
; return 0;
   6: (b7) r0 = 0
   7: (95) exit
int get_opcode(struct bpf_raw_tracepoint_args * ctx):
; return ctx->args[1];
   8: (79) r0 = *(u64 *)(r1 +8)
; return ctx->args[1];
   9: (95) exit

其中，

1	0: (85) call pc+7#bpf_prog_cbacc90865b1b9a5_get_opcode

0x85 在 Unofficial eBPF spec 中可以看到该指令是函数调用 (Function call)。

因此，接下来不会继续执行下一条指令（即偏移量为 1 的指令），而是会跳过七条指令（pc+7），这意味着将执行偏移量为 8 的指令。

函数调用 (Function call) 指令需要将当前状态放在 eBPF 虚拟机的栈空间，以便在被调用函数退出时，可以在调用函数中继续执行。由于栈大小限制为 512 字节，因此 BPF 到 BPF 的调用不能嵌套得太深。

3.12 Summary

JIT (just-in-time) compilation: 即时编译

3.13 Exercises

learning-ebpf-exerciseshttps://github.com/gaoyangu/learning-ebpf-exercises