7308748925
Add a benchmarks to demonstrate the performance cliff for local_storage get as the number of local_storage maps increases beyond current local_storage implementation's cache size. "sequential get" and "interleaved get" benchmarks are added, both of which do many bpf_task_storage_get calls on sets of task local_storage maps of various counts, while considering a single specific map to be 'important' and counting task_storage_gets to the important map separately in addition to normal 'hits' count of all gets. Goal here is to mimic scenario where a particular program using one map - the important one - is running on a system where many other local_storage maps exist and are accessed often. While "sequential get" benchmark does bpf_task_storage_get for map 0, 1, ..., {9, 99, 999} in order, "interleaved" benchmark interleaves 4 bpf_task_storage_gets for the important map for every 10 map gets. This is meant to highlight performance differences when important map is accessed far more frequently than non-important maps. A "hashmap control" benchmark is also included for easy comparison of standard bpf hashmap lookup vs local_storage get. The benchmark is similar to "sequential get", but creates and uses BPF_MAP_TYPE_HASH instead of local storage. Only one inner map is created - a hashmap meant to hold tid -> data mapping for all tasks. Size of the hashmap is hardcoded to my system's PID_MAX_LIMIT (4,194,304). The number of these keys which are actually fetched as part of the benchmark is configurable. Addition of this benchmark is inspired by conversation with Alexei in a previous patchset's thread [0], which highlighted the need for such a benchmark to motivate and validate improvements to local_storage implementation. My approach in that series focused on improving performance for explicitly-marked 'important' maps and was rejected with feedback to make more generally-applicable improvements while avoiding explicitly marking maps as important. Thus the benchmark reports both general and important-map-focused metrics, so effect of future work on both is clear. Regarding the benchmark results. On a powerful system (Skylake, 20 cores, 256gb ram): Hashmap Control =============== num keys: 10 hashmap (control) sequential get: hits throughput: 20.900 ± 0.334 M ops/s, hits latency: 47.847 ns/op, important_hits throughput: 20.900 ± 0.334 M ops/s num keys: 1000 hashmap (control) sequential get: hits throughput: 13.758 ± 0.219 M ops/s, hits latency: 72.683 ns/op, important_hits throughput: 13.758 ± 0.219 M ops/s num keys: 10000 hashmap (control) sequential get: hits throughput: 6.995 ± 0.034 M ops/s, hits latency: 142.959 ns/op, important_hits throughput: 6.995 ± 0.034 M ops/s num keys: 100000 hashmap (control) sequential get: hits throughput: 4.452 ± 0.371 M ops/s, hits latency: 224.635 ns/op, important_hits throughput: 4.452 ± 0.371 M ops/s num keys: 4194304 hashmap (control) sequential get: hits throughput: 3.043 ± 0.033 M ops/s, hits latency: 328.587 ns/op, important_hits throughput: 3.043 ± 0.033 M ops/s Local Storage ============= num_maps: 1 local_storage cache sequential get: hits throughput: 47.298 ± 0.180 M ops/s, hits latency: 21.142 ns/op, important_hits throughput: 47.298 ± 0.180 M ops/s local_storage cache interleaved get: hits throughput: 55.277 ± 0.888 M ops/s, hits latency: 18.091 ns/op, important_hits throughput: 55.277 ± 0.888 M ops/s num_maps: 10 local_storage cache sequential get: hits throughput: 40.240 ± 0.802 M ops/s, hits latency: 24.851 ns/op, important_hits throughput: 4.024 ± 0.080 M ops/s local_storage cache interleaved get: hits throughput: 48.701 ± 0.722 M ops/s, hits latency: 20.533 ns/op, important_hits throughput: 17.393 ± 0.258 M ops/s num_maps: 16 local_storage cache sequential get: hits throughput: 44.515 ± 0.708 M ops/s, hits latency: 22.464 ns/op, important_hits throughput: 2.782 ± 0.044 M ops/s local_storage cache interleaved get: hits throughput: 49.553 ± 2.260 M ops/s, hits latency: 20.181 ns/op, important_hits throughput: 15.767 ± 0.719 M ops/s num_maps: 17 local_storage cache sequential get: hits throughput: 38.778 ± 0.302 M ops/s, hits latency: 25.788 ns/op, important_hits throughput: 2.284 ± 0.018 M ops/s local_storage cache interleaved get: hits throughput: 43.848 ± 1.023 M ops/s, hits latency: 22.806 ns/op, important_hits throughput: 13.349 ± 0.311 M ops/s num_maps: 24 local_storage cache sequential get: hits throughput: 19.317 ± 0.568 M ops/s, hits latency: 51.769 ns/op, important_hits throughput: 0.806 ± 0.024 M ops/s local_storage cache interleaved get: hits throughput: 24.397 ± 0.272 M ops/s, hits latency: 40.989 ns/op, important_hits throughput: 6.863 ± 0.077 M ops/s num_maps: 32 local_storage cache sequential get: hits throughput: 13.333 ± 0.135 M ops/s, hits latency: 75.000 ns/op, important_hits throughput: 0.417 ± 0.004 M ops/s local_storage cache interleaved get: hits throughput: 16.898 ± 0.383 M ops/s, hits latency: 59.178 ns/op, important_hits throughput: 4.717 ± 0.107 M ops/s num_maps: 100 local_storage cache sequential get: hits throughput: 6.360 ± 0.107 M ops/s, hits latency: 157.233 ns/op, important_hits throughput: 0.064 ± 0.001 M ops/s local_storage cache interleaved get: hits throughput: 7.303 ± 0.362 M ops/s, hits latency: 136.930 ns/op, important_hits throughput: 1.907 ± 0.094 M ops/s num_maps: 1000 local_storage cache sequential get: hits throughput: 0.452 ± 0.010 M ops/s, hits latency: 2214.022 ns/op, important_hits throughput: 0.000 ± 0.000 M ops/s local_storage cache interleaved get: hits throughput: 0.542 ± 0.007 M ops/s, hits latency: 1843.341 ns/op, important_hits throughput: 0.136 ± 0.002 M ops/s Looking at the "sequential get" results, it's clear that as the number of task local_storage maps grows beyond the current cache size (16), there's a significant reduction in hits throughput. Note that current local_storage implementation assigns a cache_idx to maps as they are created. Since "sequential get" is creating maps 0..n in order and then doing bpf_task_storage_get calls in the same order, the benchmark is effectively ensuring that a map will not be in cache when the program tries to access it. For "interleaved get" results, important-map hits throughput is greatly increased as the important map is more likely to be in cache by virtue of being accessed far more frequently. Throughput still reduces as # maps increases, though. To get a sense of the overhead of the benchmark program, I commented out bpf_task_storage_get/bpf_map_lookup_elem in local_storage_bench.c and ran the benchmark on the same host as the 'real' run. Results: Hashmap Control =============== num keys: 10 hashmap (control) sequential get: hits throughput: 54.288 ± 0.655 M ops/s, hits latency: 18.420 ns/op, important_hits throughput: 54.288 ± 0.655 M ops/s num keys: 1000 hashmap (control) sequential get: hits throughput: 52.913 ± 0.519 M ops/s, hits latency: 18.899 ns/op, important_hits throughput: 52.913 ± 0.519 M ops/s num keys: 10000 hashmap (control) sequential get: hits throughput: 53.480 ± 1.235 M ops/s, hits latency: 18.699 ns/op, important_hits throughput: 53.480 ± 1.235 M ops/s num keys: 100000 hashmap (control) sequential get: hits throughput: 54.982 ± 1.902 M ops/s, hits latency: 18.188 ns/op, important_hits throughput: 54.982 ± 1.902 M ops/s num keys: 4194304 hashmap (control) sequential get: hits throughput: 50.858 ± 0.707 M ops/s, hits latency: 19.662 ns/op, important_hits throughput: 50.858 ± 0.707 M ops/s Local Storage ============= num_maps: 1 local_storage cache sequential get: hits throughput: 110.990 ± 4.828 M ops/s, hits latency: 9.010 ns/op, important_hits throughput: 110.990 ± 4.828 M ops/s local_storage cache interleaved get: hits throughput: 161.057 ± 4.090 M ops/s, hits latency: 6.209 ns/op, important_hits throughput: 161.057 ± 4.090 M ops/s num_maps: 10 local_storage cache sequential get: hits throughput: 112.930 ± 1.079 M ops/s, hits latency: 8.855 ns/op, important_hits throughput: 11.293 ± 0.108 M ops/s local_storage cache interleaved get: hits throughput: 115.841 ± 2.088 M ops/s, hits latency: 8.633 ns/op, important_hits throughput: 41.372 ± 0.746 M ops/s num_maps: 16 local_storage cache sequential get: hits throughput: 115.653 ± 0.416 M ops/s, hits latency: 8.647 ns/op, important_hits throughput: 7.228 ± 0.026 M ops/s local_storage cache interleaved get: hits throughput: 138.717 ± 1.649 M ops/s, hits latency: 7.209 ns/op, important_hits throughput: 44.137 ± 0.525 M ops/s num_maps: 17 local_storage cache sequential get: hits throughput: 112.020 ± 1.649 M ops/s, hits latency: 8.927 ns/op, important_hits throughput: 6.598 ± 0.097 M ops/s local_storage cache interleaved get: hits throughput: 128.089 ± 1.960 M ops/s, hits latency: 7.807 ns/op, important_hits throughput: 38.995 ± 0.597 M ops/s num_maps: 24 local_storage cache sequential get: hits throughput: 92.447 ± 5.170 M ops/s, hits latency: 10.817 ns/op, important_hits throughput: 3.855 ± 0.216 M ops/s local_storage cache interleaved get: hits throughput: 128.844 ± 2.808 M ops/s, hits latency: 7.761 ns/op, important_hits throughput: 36.245 ± 0.790 M ops/s num_maps: 32 local_storage cache sequential get: hits throughput: 102.042 ± 1.462 M ops/s, hits latency: 9.800 ns/op, important_hits throughput: 3.194 ± 0.046 M ops/s local_storage cache interleaved get: hits throughput: 126.577 ± 1.818 M ops/s, hits latency: 7.900 ns/op, important_hits throughput: 35.332 ± 0.507 M ops/s num_maps: 100 local_storage cache sequential get: hits throughput: 111.327 ± 1.401 M ops/s, hits latency: 8.983 ns/op, important_hits throughput: 1.113 ± 0.014 M ops/s local_storage cache interleaved get: hits throughput: 131.327 ± 1.339 M ops/s, hits latency: 7.615 ns/op, important_hits throughput: 34.302 ± 0.350 M ops/s num_maps: 1000 local_storage cache sequential get: hits throughput: 101.978 ± 0.563 M ops/s, hits latency: 9.806 ns/op, important_hits throughput: 0.102 ± 0.001 M ops/s local_storage cache interleaved get: hits throughput: 141.084 ± 1.098 M ops/s, hits latency: 7.088 ns/op, important_hits throughput: 35.430 ± 0.276 M ops/s Adjusting for overhead, latency numbers for "hashmap control" and "sequential get" are: hashmap_control_1k: ~53.8ns hashmap_control_10k: ~124.2ns hashmap_control_100k: ~206.5ns sequential_get_1: ~12.1ns sequential_get_10: ~16.0ns sequential_get_16: ~13.8ns sequential_get_17: ~16.8ns sequential_get_24: ~40.9ns sequential_get_32: ~65.2ns sequential_get_100: ~148.2ns sequential_get_1000: ~2204ns Clearly demonstrating a cliff. In the discussion for v1 of this patch, Alexei noted that local_storage was 2.5x faster than a large hashmap when initially implemented [1]. The benchmark results show that local_storage is 5-10x faster: a long-running BPF application putting some pid-specific info into a hashmap for each pid it sees will probably see on the order of 10-100k pids. Bench numbers for hashmaps of this size are ~10x slower than sequential_get_16, but as the number of local_storage maps grows far past local_storage cache size the performance advantage shrinks and eventually reverses. When running the benchmarks it may be necessary to bump 'open files' ulimit for a successful run. [0]: https://lore.kernel.org/all/20220420002143.1096548-1-davemarchevsky@fb.com [1]: https://lore.kernel.org/bpf/20220511173305.ftldpn23m4ski3d3@MBP-98dd607d3435.dhcp.thefacebook.com/ Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com> Link: https://lore.kernel.org/r/20220620222554.270578-1-davemarchevsky@fb.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
288 lines
6.8 KiB
C
288 lines
6.8 KiB
C
// SPDX-License-Identifier: GPL-2.0
|
|
/* Copyright (c) 2022 Meta Platforms, Inc. and affiliates. */
|
|
|
|
#include <argp.h>
|
|
#include <linux/btf.h>
|
|
|
|
#include "local_storage_bench.skel.h"
|
|
#include "bench.h"
|
|
|
|
#include <test_btf.h>
|
|
|
|
static struct {
|
|
__u32 nr_maps;
|
|
__u32 hashmap_nr_keys_used;
|
|
} args = {
|
|
.nr_maps = 1000,
|
|
.hashmap_nr_keys_used = 1000,
|
|
};
|
|
|
|
enum {
|
|
ARG_NR_MAPS = 6000,
|
|
ARG_HASHMAP_NR_KEYS_USED = 6001,
|
|
};
|
|
|
|
static const struct argp_option opts[] = {
|
|
{ "nr_maps", ARG_NR_MAPS, "NR_MAPS", 0,
|
|
"Set number of local_storage maps"},
|
|
{ "hashmap_nr_keys_used", ARG_HASHMAP_NR_KEYS_USED, "NR_KEYS",
|
|
0, "When doing hashmap test, set number of hashmap keys test uses"},
|
|
{},
|
|
};
|
|
|
|
static error_t parse_arg(int key, char *arg, struct argp_state *state)
|
|
{
|
|
long ret;
|
|
|
|
switch (key) {
|
|
case ARG_NR_MAPS:
|
|
ret = strtol(arg, NULL, 10);
|
|
if (ret < 1 || ret > UINT_MAX) {
|
|
fprintf(stderr, "invalid nr_maps");
|
|
argp_usage(state);
|
|
}
|
|
args.nr_maps = ret;
|
|
break;
|
|
case ARG_HASHMAP_NR_KEYS_USED:
|
|
ret = strtol(arg, NULL, 10);
|
|
if (ret < 1 || ret > UINT_MAX) {
|
|
fprintf(stderr, "invalid hashmap_nr_keys_used");
|
|
argp_usage(state);
|
|
}
|
|
args.hashmap_nr_keys_used = ret;
|
|
break;
|
|
default:
|
|
return ARGP_ERR_UNKNOWN;
|
|
}
|
|
|
|
return 0;
|
|
}
|
|
|
|
const struct argp bench_local_storage_argp = {
|
|
.options = opts,
|
|
.parser = parse_arg,
|
|
};
|
|
|
|
/* Keep in sync w/ array of maps in bpf */
|
|
#define MAX_NR_MAPS 1000
|
|
/* keep in sync w/ same define in bpf */
|
|
#define HASHMAP_SZ 4194304
|
|
|
|
static void validate(void)
|
|
{
|
|
if (env.producer_cnt != 1) {
|
|
fprintf(stderr, "benchmark doesn't support multi-producer!\n");
|
|
exit(1);
|
|
}
|
|
if (env.consumer_cnt != 1) {
|
|
fprintf(stderr, "benchmark doesn't support multi-consumer!\n");
|
|
exit(1);
|
|
}
|
|
|
|
if (args.nr_maps > MAX_NR_MAPS) {
|
|
fprintf(stderr, "nr_maps must be <= 1000\n");
|
|
exit(1);
|
|
}
|
|
|
|
if (args.hashmap_nr_keys_used > HASHMAP_SZ) {
|
|
fprintf(stderr, "hashmap_nr_keys_used must be <= %u\n", HASHMAP_SZ);
|
|
exit(1);
|
|
}
|
|
}
|
|
|
|
static struct {
|
|
struct local_storage_bench *skel;
|
|
void *bpf_obj;
|
|
struct bpf_map *array_of_maps;
|
|
} ctx;
|
|
|
|
static void prepopulate_hashmap(int fd)
|
|
{
|
|
int i, key, val;
|
|
|
|
/* local_storage gets will have BPF_LOCAL_STORAGE_GET_F_CREATE flag set, so
|
|
* populate the hashmap for a similar comparison
|
|
*/
|
|
for (i = 0; i < HASHMAP_SZ; i++) {
|
|
key = val = i;
|
|
if (bpf_map_update_elem(fd, &key, &val, 0)) {
|
|
fprintf(stderr, "Error prepopulating hashmap (key %d)\n", key);
|
|
exit(1);
|
|
}
|
|
}
|
|
}
|
|
|
|
static void __setup(struct bpf_program *prog, bool hashmap)
|
|
{
|
|
struct bpf_map *inner_map;
|
|
int i, fd, mim_fd, err;
|
|
|
|
LIBBPF_OPTS(bpf_map_create_opts, create_opts);
|
|
|
|
if (!hashmap)
|
|
create_opts.map_flags = BPF_F_NO_PREALLOC;
|
|
|
|
ctx.skel->rodata->num_maps = args.nr_maps;
|
|
ctx.skel->rodata->hashmap_num_keys = args.hashmap_nr_keys_used;
|
|
inner_map = bpf_map__inner_map(ctx.array_of_maps);
|
|
create_opts.btf_key_type_id = bpf_map__btf_key_type_id(inner_map);
|
|
create_opts.btf_value_type_id = bpf_map__btf_value_type_id(inner_map);
|
|
|
|
err = local_storage_bench__load(ctx.skel);
|
|
if (err) {
|
|
fprintf(stderr, "Error loading skeleton\n");
|
|
goto err_out;
|
|
}
|
|
|
|
create_opts.btf_fd = bpf_object__btf_fd(ctx.skel->obj);
|
|
|
|
mim_fd = bpf_map__fd(ctx.array_of_maps);
|
|
if (mim_fd < 0) {
|
|
fprintf(stderr, "Error getting map_in_map fd\n");
|
|
goto err_out;
|
|
}
|
|
|
|
for (i = 0; i < args.nr_maps; i++) {
|
|
if (hashmap)
|
|
fd = bpf_map_create(BPF_MAP_TYPE_HASH, NULL, sizeof(int),
|
|
sizeof(int), HASHMAP_SZ, &create_opts);
|
|
else
|
|
fd = bpf_map_create(BPF_MAP_TYPE_TASK_STORAGE, NULL, sizeof(int),
|
|
sizeof(int), 0, &create_opts);
|
|
if (fd < 0) {
|
|
fprintf(stderr, "Error creating map %d: %d\n", i, fd);
|
|
goto err_out;
|
|
}
|
|
|
|
if (hashmap)
|
|
prepopulate_hashmap(fd);
|
|
|
|
err = bpf_map_update_elem(mim_fd, &i, &fd, 0);
|
|
if (err) {
|
|
fprintf(stderr, "Error updating array-of-maps w/ map %d\n", i);
|
|
goto err_out;
|
|
}
|
|
}
|
|
|
|
if (!bpf_program__attach(prog)) {
|
|
fprintf(stderr, "Error attaching bpf program\n");
|
|
goto err_out;
|
|
}
|
|
|
|
return;
|
|
err_out:
|
|
exit(1);
|
|
}
|
|
|
|
static void hashmap_setup(void)
|
|
{
|
|
struct local_storage_bench *skel;
|
|
|
|
setup_libbpf();
|
|
|
|
skel = local_storage_bench__open();
|
|
ctx.skel = skel;
|
|
ctx.array_of_maps = skel->maps.array_of_hash_maps;
|
|
skel->rodata->use_hashmap = 1;
|
|
skel->rodata->interleave = 0;
|
|
|
|
__setup(skel->progs.get_local, true);
|
|
}
|
|
|
|
static void local_storage_cache_get_setup(void)
|
|
{
|
|
struct local_storage_bench *skel;
|
|
|
|
setup_libbpf();
|
|
|
|
skel = local_storage_bench__open();
|
|
ctx.skel = skel;
|
|
ctx.array_of_maps = skel->maps.array_of_local_storage_maps;
|
|
skel->rodata->use_hashmap = 0;
|
|
skel->rodata->interleave = 0;
|
|
|
|
__setup(skel->progs.get_local, false);
|
|
}
|
|
|
|
static void local_storage_cache_get_interleaved_setup(void)
|
|
{
|
|
struct local_storage_bench *skel;
|
|
|
|
setup_libbpf();
|
|
|
|
skel = local_storage_bench__open();
|
|
ctx.skel = skel;
|
|
ctx.array_of_maps = skel->maps.array_of_local_storage_maps;
|
|
skel->rodata->use_hashmap = 0;
|
|
skel->rodata->interleave = 1;
|
|
|
|
__setup(skel->progs.get_local, false);
|
|
}
|
|
|
|
static void measure(struct bench_res *res)
|
|
{
|
|
res->hits = atomic_swap(&ctx.skel->bss->hits, 0);
|
|
res->important_hits = atomic_swap(&ctx.skel->bss->important_hits, 0);
|
|
}
|
|
|
|
static inline void trigger_bpf_program(void)
|
|
{
|
|
syscall(__NR_getpgid);
|
|
}
|
|
|
|
static void *consumer(void *input)
|
|
{
|
|
return NULL;
|
|
}
|
|
|
|
static void *producer(void *input)
|
|
{
|
|
while (true)
|
|
trigger_bpf_program();
|
|
|
|
return NULL;
|
|
}
|
|
|
|
/* cache sequential and interleaved get benchs test local_storage get
|
|
* performance, specifically they demonstrate performance cliff of
|
|
* current list-plus-cache local_storage model.
|
|
*
|
|
* cache sequential get: call bpf_task_storage_get on n maps in order
|
|
* cache interleaved get: like "sequential get", but interleave 4 calls to the
|
|
* 'important' map (idx 0 in array_of_maps) for every 10 calls. Goal
|
|
* is to mimic environment where many progs are accessing their local_storage
|
|
* maps, with 'our' prog needing to access its map more often than others
|
|
*/
|
|
const struct bench bench_local_storage_cache_seq_get = {
|
|
.name = "local-storage-cache-seq-get",
|
|
.validate = validate,
|
|
.setup = local_storage_cache_get_setup,
|
|
.producer_thread = producer,
|
|
.consumer_thread = consumer,
|
|
.measure = measure,
|
|
.report_progress = local_storage_report_progress,
|
|
.report_final = local_storage_report_final,
|
|
};
|
|
|
|
const struct bench bench_local_storage_cache_interleaved_get = {
|
|
.name = "local-storage-cache-int-get",
|
|
.validate = validate,
|
|
.setup = local_storage_cache_get_interleaved_setup,
|
|
.producer_thread = producer,
|
|
.consumer_thread = consumer,
|
|
.measure = measure,
|
|
.report_progress = local_storage_report_progress,
|
|
.report_final = local_storage_report_final,
|
|
};
|
|
|
|
const struct bench bench_local_storage_cache_hashmap_control = {
|
|
.name = "local-storage-cache-hashmap-control",
|
|
.validate = validate,
|
|
.setup = hashmap_setup,
|
|
.producer_thread = producer,
|
|
.consumer_thread = consumer,
|
|
.measure = measure,
|
|
.report_progress = local_storage_report_progress,
|
|
.report_final = local_storage_report_final,
|
|
};
|