Skip to content

Commit 5ab154f

Browse files
kkdwivediAlexei Starovoitov
authored andcommitted
bpf: Introduce BPF standard streams
Add support for a stream API to the kernel and expose related kfuncs to BPF programs. Two streams are exposed, BPF_STDOUT and BPF_STDERR. These can be used for printing messages that can be consumed from user space, thus it's similar in spirit to existing trace_pipe interface. The kernel will use the BPF_STDERR stream to notify the program of any errors encountered at runtime. BPF programs themselves may use both streams for writing debug messages. BPF library-like code may use BPF_STDERR to print warnings or errors on misuse at runtime. The implementation of a stream is as follows. Everytime a message is emitted from the kernel (directly, or through a BPF program), a record is allocated by bump allocating from per-cpu region backed by a page obtained using alloc_pages_nolock(). This ensures that we can allocate memory from any context. The eventual plan is to discard this scheme in favor of Alexei's kmalloc_nolock() [0]. This record is then locklessly inserted into a list (llist_add()) so that the printing side doesn't require holding any locks, and works in any context. Each stream has a maximum capacity of 4MB of text, and each printed message is accounted against this limit. Messages from a program are emitted using the bpf_stream_vprintk kfunc, which takes a stream_id argument in addition to working otherwise similar to bpf_trace_vprintk. The bprintf buffer helpers are extracted out to be reused for printing the string into them before copying it into the stream, so that we can (with the defined max limit) format a string and know its true length before performing allocations of the stream element. For consuming elements from a stream, we expose a bpf(2) syscall command named BPF_PROG_STREAM_READ_BY_FD, which allows reading data from the stream of a given prog_fd into a user space buffer. The main logic is implemented in bpf_stream_read(). The log messages are queued in bpf_stream::log by the bpf_stream_vprintk kfunc, and then pulled and ordered correctly in the stream backlog. For this purpose, we hold a lock around bpf_stream_backlog_peek(), as llist_del_first() (if we maintained a second lockless list for the backlog) wouldn't be safe from multiple threads anyway. Then, if we fail to find something in the backlog log, we splice out everything from the lockless log, and place it in the backlog log, and then return the head of the backlog. Once the full length of the element is consumed, we will pop it and free it. The lockless list bpf_stream::log is a LIFO stack. Elements obtained using a llist_del_all() operation are in LIFO order, thus would break the chronological ordering if printed directly. Hence, this batch of messages is first reversed. Then, it is stashed into a separate list in the stream, i.e. the backlog_log. The head of this list is the actual message that should always be returned to the caller. All of this is done in bpf_stream_backlog_fill(). From the kernel side, the writing into the stream will be a bit more involved than the typical printk. First, the kernel typically may print a collection of messages into the stream, and parallel writers into the stream may suffer from interleaving of messages. To ensure each group of messages is visible atomically, we can lift the advantage of using a lockless list for pushing in messages. To enable this, we add a bpf_stream_stage() macro, and require kernel users to use bpf_stream_printk statements for the passed expression to write into the stream. Underneath the macro, we have a message staging API, where a bpf_stream_stage object on the stack accumulates the messages being printed into a local llist_head, and then a commit operation splices the whole batch into the stream's lockless log list. This is especially pertinent for rqspinlock deadlock messages printed to program streams. After this change, we see each deadlock invocation as a non-interleaving contiguous message without any confusion on the reader's part, improving their user experience in debugging the fault. While programs cannot benefit from this staged stream writing API, they could just as well hold an rqspinlock around their print statements to serialize messages, hence this is kept kernel-internal for now. Overall, this infrastructure provides NMI-safe any context printing of messages to two dedicated streams. Later patches will add support for printing splats in case of BPF arena page faults, rqspinlock deadlocks, and cond_break timeouts, and integration of this facility into bpftool for dumping messages to user space. [0]: https://lore.kernel.org/bpf/[email protected] Reviewed-by: Eduard Zingerman <[email protected]> Reviewed-by: Emil Tsalapatis <[email protected]> Signed-off-by: Kumar Kartikeya Dwivedi <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
1 parent 0426729 commit 5ab154f

File tree

9 files changed

+611
-1
lines changed

9 files changed

+611
-1
lines changed

include/linux/bpf.h

Lines changed: 52 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1538,6 +1538,37 @@ struct btf_mod_pair {
15381538

15391539
struct bpf_kfunc_desc_tab;
15401540

1541+
enum bpf_stream_id {
1542+
BPF_STDOUT = 1,
1543+
BPF_STDERR = 2,
1544+
};
1545+
1546+
struct bpf_stream_elem {
1547+
struct llist_node node;
1548+
int total_len;
1549+
int consumed_len;
1550+
char str[];
1551+
};
1552+
1553+
enum {
1554+
/* 100k bytes */
1555+
BPF_STREAM_MAX_CAPACITY = 100000ULL,
1556+
};
1557+
1558+
struct bpf_stream {
1559+
atomic_t capacity;
1560+
struct llist_head log; /* list of in-flight stream elements in LIFO order */
1561+
1562+
struct mutex lock; /* lock protecting backlog_{head,tail} */
1563+
struct llist_node *backlog_head; /* list of in-flight stream elements in FIFO order */
1564+
struct llist_node *backlog_tail; /* tail of the list above */
1565+
};
1566+
1567+
struct bpf_stream_stage {
1568+
struct llist_head log;
1569+
int len;
1570+
};
1571+
15411572
struct bpf_prog_aux {
15421573
atomic64_t refcnt;
15431574
u32 used_map_cnt;
@@ -1646,6 +1677,7 @@ struct bpf_prog_aux {
16461677
struct work_struct work;
16471678
struct rcu_head rcu;
16481679
};
1680+
struct bpf_stream stream[2];
16491681
};
16501682

16511683
struct bpf_prog {
@@ -2409,6 +2441,7 @@ int generic_map_delete_batch(struct bpf_map *map,
24092441
struct bpf_map *bpf_map_get_curr_or_next(u32 *id);
24102442
struct bpf_prog *bpf_prog_get_curr_or_next(u32 *id);
24112443

2444+
24122445
int bpf_map_alloc_pages(const struct bpf_map *map, int nid,
24132446
unsigned long nr_pages, struct page **page_array);
24142447
#ifdef CONFIG_MEMCG
@@ -3574,6 +3607,25 @@ void bpf_bprintf_cleanup(struct bpf_bprintf_data *data);
35743607
int bpf_try_get_buffers(struct bpf_bprintf_buffers **bufs);
35753608
void bpf_put_buffers(void);
35763609

3610+
void bpf_prog_stream_init(struct bpf_prog *prog);
3611+
void bpf_prog_stream_free(struct bpf_prog *prog);
3612+
int bpf_prog_stream_read(struct bpf_prog *prog, enum bpf_stream_id stream_id, void __user *buf, int len);
3613+
void bpf_stream_stage_init(struct bpf_stream_stage *ss);
3614+
void bpf_stream_stage_free(struct bpf_stream_stage *ss);
3615+
__printf(2, 3)
3616+
int bpf_stream_stage_printk(struct bpf_stream_stage *ss, const char *fmt, ...);
3617+
int bpf_stream_stage_commit(struct bpf_stream_stage *ss, struct bpf_prog *prog,
3618+
enum bpf_stream_id stream_id);
3619+
3620+
#define bpf_stream_printk(ss, ...) bpf_stream_stage_printk(&ss, __VA_ARGS__)
3621+
3622+
#define bpf_stream_stage(ss, prog, stream_id, expr) \
3623+
({ \
3624+
bpf_stream_stage_init(&ss); \
3625+
(expr); \
3626+
bpf_stream_stage_commit(&ss, prog, stream_id); \
3627+
bpf_stream_stage_free(&ss); \
3628+
})
35773629

35783630
#ifdef CONFIG_BPF_LSM
35793631
void bpf_cgroup_atype_get(u32 attach_btf_id, int cgroup_atype);

include/uapi/linux/bpf.h

Lines changed: 24 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -906,6 +906,17 @@ union bpf_iter_link_info {
906906
* A new file descriptor (a nonnegative integer), or -1 if an
907907
* error occurred (in which case, *errno* is set appropriately).
908908
*
909+
* BPF_PROG_STREAM_READ_BY_FD
910+
* Description
911+
* Read data of a program's BPF stream. The program is identified
912+
* by *prog_fd*, and the stream is identified by the *stream_id*.
913+
* The data is copied to a buffer pointed to by *stream_buf*, and
914+
* filled less than or equal to *stream_buf_len* bytes.
915+
*
916+
* Return
917+
* Number of bytes read from the stream on success, or -1 if an
918+
* error occurred (in which case, *errno* is set appropriately).
919+
*
909920
* NOTES
910921
* eBPF objects (maps and programs) can be shared between processes.
911922
*
@@ -961,6 +972,7 @@ enum bpf_cmd {
961972
BPF_LINK_DETACH,
962973
BPF_PROG_BIND_MAP,
963974
BPF_TOKEN_CREATE,
975+
BPF_PROG_STREAM_READ_BY_FD,
964976
__MAX_BPF_CMD,
965977
};
966978

@@ -1463,6 +1475,11 @@ struct bpf_stack_build_id {
14631475

14641476
#define BPF_OBJ_NAME_LEN 16U
14651477

1478+
enum {
1479+
BPF_STREAM_STDOUT = 1,
1480+
BPF_STREAM_STDERR = 2,
1481+
};
1482+
14661483
union bpf_attr {
14671484
struct { /* anonymous struct used by BPF_MAP_CREATE command */
14681485
__u32 map_type; /* one of enum bpf_map_type */
@@ -1849,6 +1866,13 @@ union bpf_attr {
18491866
__u32 bpffs_fd;
18501867
} token_create;
18511868

1869+
struct {
1870+
__aligned_u64 stream_buf;
1871+
__u32 stream_buf_len;
1872+
__u32 stream_id;
1873+
__u32 prog_fd;
1874+
} prog_stream_read;
1875+
18521876
} __attribute__((aligned(8)));
18531877

18541878
/* The description below is an attempt at providing documentation to eBPF

kernel/bpf/Makefile

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,7 @@ obj-$(CONFIG_BPF_SYSCALL) += bpf_local_storage.o bpf_task_storage.o
1414
obj-${CONFIG_BPF_LSM} += bpf_inode_storage.o
1515
obj-$(CONFIG_BPF_SYSCALL) += disasm.o mprog.o
1616
obj-$(CONFIG_BPF_JIT) += trampoline.o
17-
obj-$(CONFIG_BPF_SYSCALL) += btf.o memalloc.o rqspinlock.o
17+
obj-$(CONFIG_BPF_SYSCALL) += btf.o memalloc.o rqspinlock.o stream.o
1818
ifeq ($(CONFIG_MMU)$(CONFIG_64BIT),yy)
1919
obj-$(CONFIG_BPF_SYSCALL) += arena.o range_tree.o
2020
endif

kernel/bpf/core.c

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -134,6 +134,10 @@ struct bpf_prog *bpf_prog_alloc_no_stats(unsigned int size, gfp_t gfp_extra_flag
134134
mutex_init(&fp->aux->ext_mutex);
135135
mutex_init(&fp->aux->dst_mutex);
136136

137+
#ifdef CONFIG_BPF_SYSCALL
138+
bpf_prog_stream_init(fp);
139+
#endif
140+
137141
return fp;
138142
}
139143

@@ -2862,6 +2866,7 @@ static void bpf_prog_free_deferred(struct work_struct *work)
28622866
aux = container_of(work, struct bpf_prog_aux, work);
28632867
#ifdef CONFIG_BPF_SYSCALL
28642868
bpf_free_kfunc_btf_tab(aux->kfunc_btf_tab);
2869+
bpf_prog_stream_free(aux->prog);
28652870
#endif
28662871
#ifdef CONFIG_CGROUP_BPF
28672872
if (aux->cgroup_atype != CGROUP_BPF_ATTACH_TYPE_INVALID)

kernel/bpf/helpers.c

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3825,6 +3825,7 @@ BTF_ID_FLAGS(func, bpf_strnstr);
38253825
#if defined(CONFIG_BPF_LSM) && defined(CONFIG_CGROUPS)
38263826
BTF_ID_FLAGS(func, bpf_cgroup_read_xattr, KF_RCU)
38273827
#endif
3828+
BTF_ID_FLAGS(func, bpf_stream_vprintk, KF_TRUSTED_ARGS)
38283829
BTF_KFUNCS_END(common_btf_ids)
38293830

38303831
static const struct btf_kfunc_id_set common_kfunc_set = {

0 commit comments

Comments
 (0)