Skip to content

Commit e63a023

Browse files
committed
Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Alexei Starovoitov says: ==================== pull-request: bpf-next 2021-12-30 The following pull-request contains BPF updates for your *net-next* tree. We've added 72 non-merge commits during the last 20 day(s) which contain a total of 223 files changed, 3510 insertions(+), 1591 deletions(-). The main changes are: 1) Automatic setrlimit in libbpf when bpf is memcg's in the kernel, from Andrii. 2) Beautify and de-verbose verifier logs, from Christy. 3) Composable verifier types, from Hao. 4) bpf_strncmp helper, from Hou. 5) bpf.h header dependency cleanup, from Jakub. 6) get_func_[arg|ret|arg_cnt] helpers, from Jiri. 7) Sleepable local storage, from KP. 8) Extend kfunc with PTR_TO_CTX, PTR_TO_MEM argument support, from Kumar. ==================== Signed-off-by: David S. Miller <[email protected]>
2 parents ce2b6eb + 9e6b19a commit e63a023

File tree

221 files changed

+3506
-1591
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

221 files changed

+3506
-1591
lines changed
Lines changed: 376 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,376 @@
1+
2+
===================
3+
Classic BPF vs eBPF
4+
===================
5+
6+
eBPF is designed to be JITed with one to one mapping, which can also open up
7+
the possibility for GCC/LLVM compilers to generate optimized eBPF code through
8+
an eBPF backend that performs almost as fast as natively compiled code.
9+
10+
Some core changes of the eBPF format from classic BPF:
11+
12+
- Number of registers increase from 2 to 10:
13+
14+
The old format had two registers A and X, and a hidden frame pointer. The
15+
new layout extends this to be 10 internal registers and a read-only frame
16+
pointer. Since 64-bit CPUs are passing arguments to functions via registers
17+
the number of args from eBPF program to in-kernel function is restricted
18+
to 5 and one register is used to accept return value from an in-kernel
19+
function. Natively, x86_64 passes first 6 arguments in registers, aarch64/
20+
sparcv9/mips64 have 7 - 8 registers for arguments; x86_64 has 6 callee saved
21+
registers, and aarch64/sparcv9/mips64 have 11 or more callee saved registers.
22+
23+
Thus, all eBPF registers map one to one to HW registers on x86_64, aarch64,
24+
etc, and eBPF calling convention maps directly to ABIs used by the kernel on
25+
64-bit architectures.
26+
27+
On 32-bit architectures JIT may map programs that use only 32-bit arithmetic
28+
and may let more complex programs to be interpreted.
29+
30+
R0 - R5 are scratch registers and eBPF program needs spill/fill them if
31+
necessary across calls. Note that there is only one eBPF program (== one
32+
eBPF main routine) and it cannot call other eBPF functions, it can only
33+
call predefined in-kernel functions, though.
34+
35+
- Register width increases from 32-bit to 64-bit:
36+
37+
Still, the semantics of the original 32-bit ALU operations are preserved
38+
via 32-bit subregisters. All eBPF registers are 64-bit with 32-bit lower
39+
subregisters that zero-extend into 64-bit if they are being written to.
40+
That behavior maps directly to x86_64 and arm64 subregister definition, but
41+
makes other JITs more difficult.
42+
43+
32-bit architectures run 64-bit eBPF programs via interpreter.
44+
Their JITs may convert BPF programs that only use 32-bit subregisters into
45+
native instruction set and let the rest being interpreted.
46+
47+
Operation is 64-bit, because on 64-bit architectures, pointers are also
48+
64-bit wide, and we want to pass 64-bit values in/out of kernel functions,
49+
so 32-bit eBPF registers would otherwise require to define register-pair
50+
ABI, thus, there won't be able to use a direct eBPF register to HW register
51+
mapping and JIT would need to do combine/split/move operations for every
52+
register in and out of the function, which is complex, bug prone and slow.
53+
Another reason is the use of atomic 64-bit counters.
54+
55+
- Conditional jt/jf targets replaced with jt/fall-through:
56+
57+
While the original design has constructs such as ``if (cond) jump_true;
58+
else jump_false;``, they are being replaced into alternative constructs like
59+
``if (cond) jump_true; /* else fall-through */``.
60+
61+
- Introduces bpf_call insn and register passing convention for zero overhead
62+
calls from/to other kernel functions:
63+
64+
Before an in-kernel function call, the eBPF program needs to
65+
place function arguments into R1 to R5 registers to satisfy calling
66+
convention, then the interpreter will take them from registers and pass
67+
to in-kernel function. If R1 - R5 registers are mapped to CPU registers
68+
that are used for argument passing on given architecture, the JIT compiler
69+
doesn't need to emit extra moves. Function arguments will be in the correct
70+
registers and BPF_CALL instruction will be JITed as single 'call' HW
71+
instruction. This calling convention was picked to cover common call
72+
situations without performance penalty.
73+
74+
After an in-kernel function call, R1 - R5 are reset to unreadable and R0 has
75+
a return value of the function. Since R6 - R9 are callee saved, their state
76+
is preserved across the call.
77+
78+
For example, consider three C functions::
79+
80+
u64 f1() { return (*_f2)(1); }
81+
u64 f2(u64 a) { return f3(a + 1, a); }
82+
u64 f3(u64 a, u64 b) { return a - b; }
83+
84+
GCC can compile f1, f3 into x86_64::
85+
86+
f1:
87+
movl $1, %edi
88+
movq _f2(%rip), %rax
89+
jmp *%rax
90+
f3:
91+
movq %rdi, %rax
92+
subq %rsi, %rax
93+
ret
94+
95+
Function f2 in eBPF may look like::
96+
97+
f2:
98+
bpf_mov R2, R1
99+
bpf_add R1, 1
100+
bpf_call f3
101+
bpf_exit
102+
103+
If f2 is JITed and the pointer stored to ``_f2``. The calls f1 -> f2 -> f3 and
104+
returns will be seamless. Without JIT, __bpf_prog_run() interpreter needs to
105+
be used to call into f2.
106+
107+
For practical reasons all eBPF programs have only one argument 'ctx' which is
108+
already placed into R1 (e.g. on __bpf_prog_run() startup) and the programs
109+
can call kernel functions with up to 5 arguments. Calls with 6 or more arguments
110+
are currently not supported, but these restrictions can be lifted if necessary
111+
in the future.
112+
113+
On 64-bit architectures all register map to HW registers one to one. For
114+
example, x86_64 JIT compiler can map them as ...
115+
116+
::
117+
118+
R0 - rax
119+
R1 - rdi
120+
R2 - rsi
121+
R3 - rdx
122+
R4 - rcx
123+
R5 - r8
124+
R6 - rbx
125+
R7 - r13
126+
R8 - r14
127+
R9 - r15
128+
R10 - rbp
129+
130+
... since x86_64 ABI mandates rdi, rsi, rdx, rcx, r8, r9 for argument passing
131+
and rbx, r12 - r15 are callee saved.
132+
133+
Then the following eBPF pseudo-program::
134+
135+
bpf_mov R6, R1 /* save ctx */
136+
bpf_mov R2, 2
137+
bpf_mov R3, 3
138+
bpf_mov R4, 4
139+
bpf_mov R5, 5
140+
bpf_call foo
141+
bpf_mov R7, R0 /* save foo() return value */
142+
bpf_mov R1, R6 /* restore ctx for next call */
143+
bpf_mov R2, 6
144+
bpf_mov R3, 7
145+
bpf_mov R4, 8
146+
bpf_mov R5, 9
147+
bpf_call bar
148+
bpf_add R0, R7
149+
bpf_exit
150+
151+
After JIT to x86_64 may look like::
152+
153+
push %rbp
154+
mov %rsp,%rbp
155+
sub $0x228,%rsp
156+
mov %rbx,-0x228(%rbp)
157+
mov %r13,-0x220(%rbp)
158+
mov %rdi,%rbx
159+
mov $0x2,%esi
160+
mov $0x3,%edx
161+
mov $0x4,%ecx
162+
mov $0x5,%r8d
163+
callq foo
164+
mov %rax,%r13
165+
mov %rbx,%rdi
166+
mov $0x6,%esi
167+
mov $0x7,%edx
168+
mov $0x8,%ecx
169+
mov $0x9,%r8d
170+
callq bar
171+
add %r13,%rax
172+
mov -0x228(%rbp),%rbx
173+
mov -0x220(%rbp),%r13
174+
leaveq
175+
retq
176+
177+
Which is in this example equivalent in C to::
178+
179+
u64 bpf_filter(u64 ctx)
180+
{
181+
return foo(ctx, 2, 3, 4, 5) + bar(ctx, 6, 7, 8, 9);
182+
}
183+
184+
In-kernel functions foo() and bar() with prototype: u64 (*)(u64 arg1, u64
185+
arg2, u64 arg3, u64 arg4, u64 arg5); will receive arguments in proper
186+
registers and place their return value into ``%rax`` which is R0 in eBPF.
187+
Prologue and epilogue are emitted by JIT and are implicit in the
188+
interpreter. R0-R5 are scratch registers, so eBPF program needs to preserve
189+
them across the calls as defined by calling convention.
190+
191+
For example the following program is invalid::
192+
193+
bpf_mov R1, 1
194+
bpf_call foo
195+
bpf_mov R0, R1
196+
bpf_exit
197+
198+
After the call the registers R1-R5 contain junk values and cannot be read.
199+
An in-kernel verifier.rst is used to validate eBPF programs.
200+
201+
Also in the new design, eBPF is limited to 4096 insns, which means that any
202+
program will terminate quickly and will only call a fixed number of kernel
203+
functions. Original BPF and eBPF are two operand instructions,
204+
which helps to do one-to-one mapping between eBPF insn and x86 insn during JIT.
205+
206+
The input context pointer for invoking the interpreter function is generic,
207+
its content is defined by a specific use case. For seccomp register R1 points
208+
to seccomp_data, for converted BPF filters R1 points to a skb.
209+
210+
A program, that is translated internally consists of the following elements::
211+
212+
op:16, jt:8, jf:8, k:32 ==> op:8, dst_reg:4, src_reg:4, off:16, imm:32
213+
214+
So far 87 eBPF instructions were implemented. 8-bit 'op' opcode field
215+
has room for new instructions. Some of them may use 16/24/32 byte encoding. New
216+
instructions must be multiple of 8 bytes to preserve backward compatibility.
217+
218+
eBPF is a general purpose RISC instruction set. Not every register and
219+
every instruction are used during translation from original BPF to eBPF.
220+
For example, socket filters are not using ``exclusive add`` instruction, but
221+
tracing filters may do to maintain counters of events, for example. Register R9
222+
is not used by socket filters either, but more complex filters may be running
223+
out of registers and would have to resort to spill/fill to stack.
224+
225+
eBPF can be used as a generic assembler for last step performance
226+
optimizations, socket filters and seccomp are using it as assembler. Tracing
227+
filters may use it as assembler to generate code from kernel. In kernel usage
228+
may not be bounded by security considerations, since generated eBPF code
229+
may be optimizing internal code path and not being exposed to the user space.
230+
Safety of eBPF can come from the verifier.rst. In such use cases as
231+
described, it may be used as safe instruction set.
232+
233+
Just like the original BPF, eBPF runs within a controlled environment,
234+
is deterministic and the kernel can easily prove that. The safety of the program
235+
can be determined in two steps: first step does depth-first-search to disallow
236+
loops and other CFG validation; second step starts from the first insn and
237+
descends all possible paths. It simulates execution of every insn and observes
238+
the state change of registers and stack.
239+
240+
opcode encoding
241+
===============
242+
243+
eBPF is reusing most of the opcode encoding from classic to simplify conversion
244+
of classic BPF to eBPF.
245+
246+
For arithmetic and jump instructions the 8-bit 'code' field is divided into three
247+
parts::
248+
249+
+----------------+--------+--------------------+
250+
| 4 bits | 1 bit | 3 bits |
251+
| operation code | source | instruction class |
252+
+----------------+--------+--------------------+
253+
(MSB) (LSB)
254+
255+
Three LSB bits store instruction class which is one of:
256+
257+
=================== ===============
258+
Classic BPF classes eBPF classes
259+
=================== ===============
260+
BPF_LD 0x00 BPF_LD 0x00
261+
BPF_LDX 0x01 BPF_LDX 0x01
262+
BPF_ST 0x02 BPF_ST 0x02
263+
BPF_STX 0x03 BPF_STX 0x03
264+
BPF_ALU 0x04 BPF_ALU 0x04
265+
BPF_JMP 0x05 BPF_JMP 0x05
266+
BPF_RET 0x06 BPF_JMP32 0x06
267+
BPF_MISC 0x07 BPF_ALU64 0x07
268+
=================== ===============
269+
270+
The 4th bit encodes the source operand ...
271+
272+
::
273+
274+
BPF_K 0x00
275+
BPF_X 0x08
276+
277+
* in classic BPF, this means::
278+
279+
BPF_SRC(code) == BPF_X - use register X as source operand
280+
BPF_SRC(code) == BPF_K - use 32-bit immediate as source operand
281+
282+
* in eBPF, this means::
283+
284+
BPF_SRC(code) == BPF_X - use 'src_reg' register as source operand
285+
BPF_SRC(code) == BPF_K - use 32-bit immediate as source operand
286+
287+
... and four MSB bits store operation code.
288+
289+
If BPF_CLASS(code) == BPF_ALU or BPF_ALU64 [ in eBPF ], BPF_OP(code) is one of::
290+
291+
BPF_ADD 0x00
292+
BPF_SUB 0x10
293+
BPF_MUL 0x20
294+
BPF_DIV 0x30
295+
BPF_OR 0x40
296+
BPF_AND 0x50
297+
BPF_LSH 0x60
298+
BPF_RSH 0x70
299+
BPF_NEG 0x80
300+
BPF_MOD 0x90
301+
BPF_XOR 0xa0
302+
BPF_MOV 0xb0 /* eBPF only: mov reg to reg */
303+
BPF_ARSH 0xc0 /* eBPF only: sign extending shift right */
304+
BPF_END 0xd0 /* eBPF only: endianness conversion */
305+
306+
If BPF_CLASS(code) == BPF_JMP or BPF_JMP32 [ in eBPF ], BPF_OP(code) is one of::
307+
308+
BPF_JA 0x00 /* BPF_JMP only */
309+
BPF_JEQ 0x10
310+
BPF_JGT 0x20
311+
BPF_JGE 0x30
312+
BPF_JSET 0x40
313+
BPF_JNE 0x50 /* eBPF only: jump != */
314+
BPF_JSGT 0x60 /* eBPF only: signed '>' */
315+
BPF_JSGE 0x70 /* eBPF only: signed '>=' */
316+
BPF_CALL 0x80 /* eBPF BPF_JMP only: function call */
317+
BPF_EXIT 0x90 /* eBPF BPF_JMP only: function return */
318+
BPF_JLT 0xa0 /* eBPF only: unsigned '<' */
319+
BPF_JLE 0xb0 /* eBPF only: unsigned '<=' */
320+
BPF_JSLT 0xc0 /* eBPF only: signed '<' */
321+
BPF_JSLE 0xd0 /* eBPF only: signed '<=' */
322+
323+
So BPF_ADD | BPF_X | BPF_ALU means 32-bit addition in both classic BPF
324+
and eBPF. There are only two registers in classic BPF, so it means A += X.
325+
In eBPF it means dst_reg = (u32) dst_reg + (u32) src_reg; similarly,
326+
BPF_XOR | BPF_K | BPF_ALU means A ^= imm32 in classic BPF and analogous
327+
src_reg = (u32) src_reg ^ (u32) imm32 in eBPF.
328+
329+
Classic BPF is using BPF_MISC class to represent A = X and X = A moves.
330+
eBPF is using BPF_MOV | BPF_X | BPF_ALU code instead. Since there are no
331+
BPF_MISC operations in eBPF, the class 7 is used as BPF_ALU64 to mean
332+
exactly the same operations as BPF_ALU, but with 64-bit wide operands
333+
instead. So BPF_ADD | BPF_X | BPF_ALU64 means 64-bit addition, i.e.:
334+
dst_reg = dst_reg + src_reg
335+
336+
Classic BPF wastes the whole BPF_RET class to represent a single ``ret``
337+
operation. Classic BPF_RET | BPF_K means copy imm32 into return register
338+
and perform function exit. eBPF is modeled to match CPU, so BPF_JMP | BPF_EXIT
339+
in eBPF means function exit only. The eBPF program needs to store return
340+
value into register R0 before doing a BPF_EXIT. Class 6 in eBPF is used as
341+
BPF_JMP32 to mean exactly the same operations as BPF_JMP, but with 32-bit wide
342+
operands for the comparisons instead.
343+
344+
For load and store instructions the 8-bit 'code' field is divided as::
345+
346+
+--------+--------+-------------------+
347+
| 3 bits | 2 bits | 3 bits |
348+
| mode | size | instruction class |
349+
+--------+--------+-------------------+
350+
(MSB) (LSB)
351+
352+
Size modifier is one of ...
353+
354+
::
355+
356+
BPF_W 0x00 /* word */
357+
BPF_H 0x08 /* half word */
358+
BPF_B 0x10 /* byte */
359+
BPF_DW 0x18 /* eBPF only, double word */
360+
361+
... which encodes size of load/store operation::
362+
363+
B - 1 byte
364+
H - 2 byte
365+
W - 4 byte
366+
DW - 8 byte (eBPF only)
367+
368+
Mode modifier is one of::
369+
370+
BPF_IMM 0x00 /* used for 32-bit mov in classic BPF and 64-bit in eBPF */
371+
BPF_ABS 0x20
372+
BPF_IND 0x40
373+
BPF_MEM 0x60
374+
BPF_LEN 0x80 /* classic BPF only, reserved in eBPF */
375+
BPF_MSH 0xa0 /* classic BPF only, reserved in eBPF */
376+
BPF_ATOMIC 0xc0 /* eBPF only, atomic operations */

Documentation/bpf/index.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -21,6 +21,7 @@ that goes into great technical depth about the BPF Architecture.
2121
helpers
2222
programs
2323
maps
24+
classic_vs_extended.rst
2425
bpf_licensing
2526
test_debug
2627
other

0 commit comments

Comments
 (0)