|
| 1 | + |
| 2 | +=================== |
| 3 | +Classic BPF vs eBPF |
| 4 | +=================== |
| 5 | + |
| 6 | +eBPF is designed to be JITed with one to one mapping, which can also open up |
| 7 | +the possibility for GCC/LLVM compilers to generate optimized eBPF code through |
| 8 | +an eBPF backend that performs almost as fast as natively compiled code. |
| 9 | + |
| 10 | +Some core changes of the eBPF format from classic BPF: |
| 11 | + |
| 12 | +- Number of registers increase from 2 to 10: |
| 13 | + |
| 14 | + The old format had two registers A and X, and a hidden frame pointer. The |
| 15 | + new layout extends this to be 10 internal registers and a read-only frame |
| 16 | + pointer. Since 64-bit CPUs are passing arguments to functions via registers |
| 17 | + the number of args from eBPF program to in-kernel function is restricted |
| 18 | + to 5 and one register is used to accept return value from an in-kernel |
| 19 | + function. Natively, x86_64 passes first 6 arguments in registers, aarch64/ |
| 20 | + sparcv9/mips64 have 7 - 8 registers for arguments; x86_64 has 6 callee saved |
| 21 | + registers, and aarch64/sparcv9/mips64 have 11 or more callee saved registers. |
| 22 | + |
| 23 | + Thus, all eBPF registers map one to one to HW registers on x86_64, aarch64, |
| 24 | + etc, and eBPF calling convention maps directly to ABIs used by the kernel on |
| 25 | + 64-bit architectures. |
| 26 | + |
| 27 | + On 32-bit architectures JIT may map programs that use only 32-bit arithmetic |
| 28 | + and may let more complex programs to be interpreted. |
| 29 | + |
| 30 | + R0 - R5 are scratch registers and eBPF program needs spill/fill them if |
| 31 | + necessary across calls. Note that there is only one eBPF program (== one |
| 32 | + eBPF main routine) and it cannot call other eBPF functions, it can only |
| 33 | + call predefined in-kernel functions, though. |
| 34 | + |
| 35 | +- Register width increases from 32-bit to 64-bit: |
| 36 | + |
| 37 | + Still, the semantics of the original 32-bit ALU operations are preserved |
| 38 | + via 32-bit subregisters. All eBPF registers are 64-bit with 32-bit lower |
| 39 | + subregisters that zero-extend into 64-bit if they are being written to. |
| 40 | + That behavior maps directly to x86_64 and arm64 subregister definition, but |
| 41 | + makes other JITs more difficult. |
| 42 | + |
| 43 | + 32-bit architectures run 64-bit eBPF programs via interpreter. |
| 44 | + Their JITs may convert BPF programs that only use 32-bit subregisters into |
| 45 | + native instruction set and let the rest being interpreted. |
| 46 | + |
| 47 | + Operation is 64-bit, because on 64-bit architectures, pointers are also |
| 48 | + 64-bit wide, and we want to pass 64-bit values in/out of kernel functions, |
| 49 | + so 32-bit eBPF registers would otherwise require to define register-pair |
| 50 | + ABI, thus, there won't be able to use a direct eBPF register to HW register |
| 51 | + mapping and JIT would need to do combine/split/move operations for every |
| 52 | + register in and out of the function, which is complex, bug prone and slow. |
| 53 | + Another reason is the use of atomic 64-bit counters. |
| 54 | + |
| 55 | +- Conditional jt/jf targets replaced with jt/fall-through: |
| 56 | + |
| 57 | + While the original design has constructs such as ``if (cond) jump_true; |
| 58 | + else jump_false;``, they are being replaced into alternative constructs like |
| 59 | + ``if (cond) jump_true; /* else fall-through */``. |
| 60 | + |
| 61 | +- Introduces bpf_call insn and register passing convention for zero overhead |
| 62 | + calls from/to other kernel functions: |
| 63 | + |
| 64 | + Before an in-kernel function call, the eBPF program needs to |
| 65 | + place function arguments into R1 to R5 registers to satisfy calling |
| 66 | + convention, then the interpreter will take them from registers and pass |
| 67 | + to in-kernel function. If R1 - R5 registers are mapped to CPU registers |
| 68 | + that are used for argument passing on given architecture, the JIT compiler |
| 69 | + doesn't need to emit extra moves. Function arguments will be in the correct |
| 70 | + registers and BPF_CALL instruction will be JITed as single 'call' HW |
| 71 | + instruction. This calling convention was picked to cover common call |
| 72 | + situations without performance penalty. |
| 73 | + |
| 74 | + After an in-kernel function call, R1 - R5 are reset to unreadable and R0 has |
| 75 | + a return value of the function. Since R6 - R9 are callee saved, their state |
| 76 | + is preserved across the call. |
| 77 | + |
| 78 | + For example, consider three C functions:: |
| 79 | + |
| 80 | + u64 f1() { return (*_f2)(1); } |
| 81 | + u64 f2(u64 a) { return f3(a + 1, a); } |
| 82 | + u64 f3(u64 a, u64 b) { return a - b; } |
| 83 | + |
| 84 | + GCC can compile f1, f3 into x86_64:: |
| 85 | + |
| 86 | + f1: |
| 87 | + movl $1, %edi |
| 88 | + movq _f2(%rip), %rax |
| 89 | + jmp *%rax |
| 90 | + f3: |
| 91 | + movq %rdi, %rax |
| 92 | + subq %rsi, %rax |
| 93 | + ret |
| 94 | +
|
| 95 | + Function f2 in eBPF may look like:: |
| 96 | + |
| 97 | + f2: |
| 98 | + bpf_mov R2, R1 |
| 99 | + bpf_add R1, 1 |
| 100 | + bpf_call f3 |
| 101 | + bpf_exit |
| 102 | + |
| 103 | + If f2 is JITed and the pointer stored to ``_f2``. The calls f1 -> f2 -> f3 and |
| 104 | + returns will be seamless. Without JIT, __bpf_prog_run() interpreter needs to |
| 105 | + be used to call into f2. |
| 106 | + |
| 107 | + For practical reasons all eBPF programs have only one argument 'ctx' which is |
| 108 | + already placed into R1 (e.g. on __bpf_prog_run() startup) and the programs |
| 109 | + can call kernel functions with up to 5 arguments. Calls with 6 or more arguments |
| 110 | + are currently not supported, but these restrictions can be lifted if necessary |
| 111 | + in the future. |
| 112 | + |
| 113 | + On 64-bit architectures all register map to HW registers one to one. For |
| 114 | + example, x86_64 JIT compiler can map them as ... |
| 115 | + |
| 116 | + :: |
| 117 | + |
| 118 | + R0 - rax |
| 119 | + R1 - rdi |
| 120 | + R2 - rsi |
| 121 | + R3 - rdx |
| 122 | + R4 - rcx |
| 123 | + R5 - r8 |
| 124 | + R6 - rbx |
| 125 | + R7 - r13 |
| 126 | + R8 - r14 |
| 127 | + R9 - r15 |
| 128 | + R10 - rbp |
| 129 | + |
| 130 | + ... since x86_64 ABI mandates rdi, rsi, rdx, rcx, r8, r9 for argument passing |
| 131 | + and rbx, r12 - r15 are callee saved. |
| 132 | + |
| 133 | + Then the following eBPF pseudo-program:: |
| 134 | + |
| 135 | + bpf_mov R6, R1 /* save ctx */ |
| 136 | + bpf_mov R2, 2 |
| 137 | + bpf_mov R3, 3 |
| 138 | + bpf_mov R4, 4 |
| 139 | + bpf_mov R5, 5 |
| 140 | + bpf_call foo |
| 141 | + bpf_mov R7, R0 /* save foo() return value */ |
| 142 | + bpf_mov R1, R6 /* restore ctx for next call */ |
| 143 | + bpf_mov R2, 6 |
| 144 | + bpf_mov R3, 7 |
| 145 | + bpf_mov R4, 8 |
| 146 | + bpf_mov R5, 9 |
| 147 | + bpf_call bar |
| 148 | + bpf_add R0, R7 |
| 149 | + bpf_exit |
| 150 | + |
| 151 | + After JIT to x86_64 may look like:: |
| 152 | + |
| 153 | + push %rbp |
| 154 | + mov %rsp,%rbp |
| 155 | + sub $0x228,%rsp |
| 156 | + mov %rbx,-0x228(%rbp) |
| 157 | + mov %r13,-0x220(%rbp) |
| 158 | + mov %rdi,%rbx |
| 159 | + mov $0x2,%esi |
| 160 | + mov $0x3,%edx |
| 161 | + mov $0x4,%ecx |
| 162 | + mov $0x5,%r8d |
| 163 | + callq foo |
| 164 | + mov %rax,%r13 |
| 165 | + mov %rbx,%rdi |
| 166 | + mov $0x6,%esi |
| 167 | + mov $0x7,%edx |
| 168 | + mov $0x8,%ecx |
| 169 | + mov $0x9,%r8d |
| 170 | + callq bar |
| 171 | + add %r13,%rax |
| 172 | + mov -0x228(%rbp),%rbx |
| 173 | + mov -0x220(%rbp),%r13 |
| 174 | + leaveq |
| 175 | + retq |
| 176 | + |
| 177 | + Which is in this example equivalent in C to:: |
| 178 | + |
| 179 | + u64 bpf_filter(u64 ctx) |
| 180 | + { |
| 181 | + return foo(ctx, 2, 3, 4, 5) + bar(ctx, 6, 7, 8, 9); |
| 182 | + } |
| 183 | + |
| 184 | + In-kernel functions foo() and bar() with prototype: u64 (*)(u64 arg1, u64 |
| 185 | + arg2, u64 arg3, u64 arg4, u64 arg5); will receive arguments in proper |
| 186 | + registers and place their return value into ``%rax`` which is R0 in eBPF. |
| 187 | + Prologue and epilogue are emitted by JIT and are implicit in the |
| 188 | + interpreter. R0-R5 are scratch registers, so eBPF program needs to preserve |
| 189 | + them across the calls as defined by calling convention. |
| 190 | + |
| 191 | + For example the following program is invalid:: |
| 192 | + |
| 193 | + bpf_mov R1, 1 |
| 194 | + bpf_call foo |
| 195 | + bpf_mov R0, R1 |
| 196 | + bpf_exit |
| 197 | + |
| 198 | + After the call the registers R1-R5 contain junk values and cannot be read. |
| 199 | + An in-kernel verifier.rst is used to validate eBPF programs. |
| 200 | + |
| 201 | +Also in the new design, eBPF is limited to 4096 insns, which means that any |
| 202 | +program will terminate quickly and will only call a fixed number of kernel |
| 203 | +functions. Original BPF and eBPF are two operand instructions, |
| 204 | +which helps to do one-to-one mapping between eBPF insn and x86 insn during JIT. |
| 205 | + |
| 206 | +The input context pointer for invoking the interpreter function is generic, |
| 207 | +its content is defined by a specific use case. For seccomp register R1 points |
| 208 | +to seccomp_data, for converted BPF filters R1 points to a skb. |
| 209 | + |
| 210 | +A program, that is translated internally consists of the following elements:: |
| 211 | + |
| 212 | + op:16, jt:8, jf:8, k:32 ==> op:8, dst_reg:4, src_reg:4, off:16, imm:32 |
| 213 | + |
| 214 | +So far 87 eBPF instructions were implemented. 8-bit 'op' opcode field |
| 215 | +has room for new instructions. Some of them may use 16/24/32 byte encoding. New |
| 216 | +instructions must be multiple of 8 bytes to preserve backward compatibility. |
| 217 | + |
| 218 | +eBPF is a general purpose RISC instruction set. Not every register and |
| 219 | +every instruction are used during translation from original BPF to eBPF. |
| 220 | +For example, socket filters are not using ``exclusive add`` instruction, but |
| 221 | +tracing filters may do to maintain counters of events, for example. Register R9 |
| 222 | +is not used by socket filters either, but more complex filters may be running |
| 223 | +out of registers and would have to resort to spill/fill to stack. |
| 224 | + |
| 225 | +eBPF can be used as a generic assembler for last step performance |
| 226 | +optimizations, socket filters and seccomp are using it as assembler. Tracing |
| 227 | +filters may use it as assembler to generate code from kernel. In kernel usage |
| 228 | +may not be bounded by security considerations, since generated eBPF code |
| 229 | +may be optimizing internal code path and not being exposed to the user space. |
| 230 | +Safety of eBPF can come from the verifier.rst. In such use cases as |
| 231 | +described, it may be used as safe instruction set. |
| 232 | + |
| 233 | +Just like the original BPF, eBPF runs within a controlled environment, |
| 234 | +is deterministic and the kernel can easily prove that. The safety of the program |
| 235 | +can be determined in two steps: first step does depth-first-search to disallow |
| 236 | +loops and other CFG validation; second step starts from the first insn and |
| 237 | +descends all possible paths. It simulates execution of every insn and observes |
| 238 | +the state change of registers and stack. |
| 239 | + |
| 240 | +opcode encoding |
| 241 | +=============== |
| 242 | + |
| 243 | +eBPF is reusing most of the opcode encoding from classic to simplify conversion |
| 244 | +of classic BPF to eBPF. |
| 245 | + |
| 246 | +For arithmetic and jump instructions the 8-bit 'code' field is divided into three |
| 247 | +parts:: |
| 248 | + |
| 249 | + +----------------+--------+--------------------+ |
| 250 | + | 4 bits | 1 bit | 3 bits | |
| 251 | + | operation code | source | instruction class | |
| 252 | + +----------------+--------+--------------------+ |
| 253 | + (MSB) (LSB) |
| 254 | + |
| 255 | +Three LSB bits store instruction class which is one of: |
| 256 | + |
| 257 | + =================== =============== |
| 258 | + Classic BPF classes eBPF classes |
| 259 | + =================== =============== |
| 260 | + BPF_LD 0x00 BPF_LD 0x00 |
| 261 | + BPF_LDX 0x01 BPF_LDX 0x01 |
| 262 | + BPF_ST 0x02 BPF_ST 0x02 |
| 263 | + BPF_STX 0x03 BPF_STX 0x03 |
| 264 | + BPF_ALU 0x04 BPF_ALU 0x04 |
| 265 | + BPF_JMP 0x05 BPF_JMP 0x05 |
| 266 | + BPF_RET 0x06 BPF_JMP32 0x06 |
| 267 | + BPF_MISC 0x07 BPF_ALU64 0x07 |
| 268 | + =================== =============== |
| 269 | + |
| 270 | +The 4th bit encodes the source operand ... |
| 271 | + |
| 272 | + :: |
| 273 | + |
| 274 | + BPF_K 0x00 |
| 275 | + BPF_X 0x08 |
| 276 | + |
| 277 | + * in classic BPF, this means:: |
| 278 | + |
| 279 | + BPF_SRC(code) == BPF_X - use register X as source operand |
| 280 | + BPF_SRC(code) == BPF_K - use 32-bit immediate as source operand |
| 281 | + |
| 282 | + * in eBPF, this means:: |
| 283 | + |
| 284 | + BPF_SRC(code) == BPF_X - use 'src_reg' register as source operand |
| 285 | + BPF_SRC(code) == BPF_K - use 32-bit immediate as source operand |
| 286 | + |
| 287 | +... and four MSB bits store operation code. |
| 288 | + |
| 289 | +If BPF_CLASS(code) == BPF_ALU or BPF_ALU64 [ in eBPF ], BPF_OP(code) is one of:: |
| 290 | + |
| 291 | + BPF_ADD 0x00 |
| 292 | + BPF_SUB 0x10 |
| 293 | + BPF_MUL 0x20 |
| 294 | + BPF_DIV 0x30 |
| 295 | + BPF_OR 0x40 |
| 296 | + BPF_AND 0x50 |
| 297 | + BPF_LSH 0x60 |
| 298 | + BPF_RSH 0x70 |
| 299 | + BPF_NEG 0x80 |
| 300 | + BPF_MOD 0x90 |
| 301 | + BPF_XOR 0xa0 |
| 302 | + BPF_MOV 0xb0 /* eBPF only: mov reg to reg */ |
| 303 | + BPF_ARSH 0xc0 /* eBPF only: sign extending shift right */ |
| 304 | + BPF_END 0xd0 /* eBPF only: endianness conversion */ |
| 305 | + |
| 306 | +If BPF_CLASS(code) == BPF_JMP or BPF_JMP32 [ in eBPF ], BPF_OP(code) is one of:: |
| 307 | + |
| 308 | + BPF_JA 0x00 /* BPF_JMP only */ |
| 309 | + BPF_JEQ 0x10 |
| 310 | + BPF_JGT 0x20 |
| 311 | + BPF_JGE 0x30 |
| 312 | + BPF_JSET 0x40 |
| 313 | + BPF_JNE 0x50 /* eBPF only: jump != */ |
| 314 | + BPF_JSGT 0x60 /* eBPF only: signed '>' */ |
| 315 | + BPF_JSGE 0x70 /* eBPF only: signed '>=' */ |
| 316 | + BPF_CALL 0x80 /* eBPF BPF_JMP only: function call */ |
| 317 | + BPF_EXIT 0x90 /* eBPF BPF_JMP only: function return */ |
| 318 | + BPF_JLT 0xa0 /* eBPF only: unsigned '<' */ |
| 319 | + BPF_JLE 0xb0 /* eBPF only: unsigned '<=' */ |
| 320 | + BPF_JSLT 0xc0 /* eBPF only: signed '<' */ |
| 321 | + BPF_JSLE 0xd0 /* eBPF only: signed '<=' */ |
| 322 | + |
| 323 | +So BPF_ADD | BPF_X | BPF_ALU means 32-bit addition in both classic BPF |
| 324 | +and eBPF. There are only two registers in classic BPF, so it means A += X. |
| 325 | +In eBPF it means dst_reg = (u32) dst_reg + (u32) src_reg; similarly, |
| 326 | +BPF_XOR | BPF_K | BPF_ALU means A ^= imm32 in classic BPF and analogous |
| 327 | +src_reg = (u32) src_reg ^ (u32) imm32 in eBPF. |
| 328 | + |
| 329 | +Classic BPF is using BPF_MISC class to represent A = X and X = A moves. |
| 330 | +eBPF is using BPF_MOV | BPF_X | BPF_ALU code instead. Since there are no |
| 331 | +BPF_MISC operations in eBPF, the class 7 is used as BPF_ALU64 to mean |
| 332 | +exactly the same operations as BPF_ALU, but with 64-bit wide operands |
| 333 | +instead. So BPF_ADD | BPF_X | BPF_ALU64 means 64-bit addition, i.e.: |
| 334 | +dst_reg = dst_reg + src_reg |
| 335 | + |
| 336 | +Classic BPF wastes the whole BPF_RET class to represent a single ``ret`` |
| 337 | +operation. Classic BPF_RET | BPF_K means copy imm32 into return register |
| 338 | +and perform function exit. eBPF is modeled to match CPU, so BPF_JMP | BPF_EXIT |
| 339 | +in eBPF means function exit only. The eBPF program needs to store return |
| 340 | +value into register R0 before doing a BPF_EXIT. Class 6 in eBPF is used as |
| 341 | +BPF_JMP32 to mean exactly the same operations as BPF_JMP, but with 32-bit wide |
| 342 | +operands for the comparisons instead. |
| 343 | + |
| 344 | +For load and store instructions the 8-bit 'code' field is divided as:: |
| 345 | + |
| 346 | + +--------+--------+-------------------+ |
| 347 | + | 3 bits | 2 bits | 3 bits | |
| 348 | + | mode | size | instruction class | |
| 349 | + +--------+--------+-------------------+ |
| 350 | + (MSB) (LSB) |
| 351 | + |
| 352 | +Size modifier is one of ... |
| 353 | + |
| 354 | +:: |
| 355 | + |
| 356 | + BPF_W 0x00 /* word */ |
| 357 | + BPF_H 0x08 /* half word */ |
| 358 | + BPF_B 0x10 /* byte */ |
| 359 | + BPF_DW 0x18 /* eBPF only, double word */ |
| 360 | + |
| 361 | +... which encodes size of load/store operation:: |
| 362 | + |
| 363 | + B - 1 byte |
| 364 | + H - 2 byte |
| 365 | + W - 4 byte |
| 366 | + DW - 8 byte (eBPF only) |
| 367 | + |
| 368 | +Mode modifier is one of:: |
| 369 | + |
| 370 | + BPF_IMM 0x00 /* used for 32-bit mov in classic BPF and 64-bit in eBPF */ |
| 371 | + BPF_ABS 0x20 |
| 372 | + BPF_IND 0x40 |
| 373 | + BPF_MEM 0x60 |
| 374 | + BPF_LEN 0x80 /* classic BPF only, reserved in eBPF */ |
| 375 | + BPF_MSH 0xa0 /* classic BPF only, reserved in eBPF */ |
| 376 | + BPF_ATOMIC 0xc0 /* eBPF only, atomic operations */ |
0 commit comments