Landlock Path Walk Inversion Experiment

A (failed) performance tuning experiment

2026-05-08

Objective
Hypothesis
Implementation
Measurements
- Effects on Landlock code complexity
- Effects on Performance
  - Noteworthy results:
Conclusion
Full results

Objective

Landlock supports multiple nested sandboxes, so when an operation with filesystem path is attempted, it has to check it against up to 16 nested policies.

The way it currently does that is by constructing a matrix of the requested access rights per layer and checking off these matrix entries during the path walk. This is efficient, but it also increases code complexity and I have long had a nagging doubt about whether it is worth the tradeoff.

After all, the mental model for using Landlock is that each layer gets checked independently from the next. A more “natural” way to implement this would be to:

loop over the layers first, and
then do the path walk inside (multiple times).

To be clear, my confidence that this would be acceptable performance-wise was always low, but then again, performance can sometimes be counterintuitive, and so it seemed like it might be worthwhile exploring.

The refactoring would be a bigger change, but luckily we have a very comprehensive suite of kernel selftests in Landlock – so at least for the sake of doing that experiment, it was feasible to vibe code the refactoring. (The code is absolutely not meant to be submitted like that, but it is good enough to get performance numbers of the general idea.)

The experiment failed – it worsened performance noticeably. I’m publishing it anyway, so that others don’t have to attempt the same.

I also found it amazing that I can play with the performance aspects of such a large refactoring without implementing it myself. ✨ LLMs are underused for such experiments.

Hypothesis

The (highly speculative) hypothesis of this experiment is:

By refactoring the code to loop over the layers first, and doing the path walk inside (multiple times),

the code becomes simpler to reason about,

and keeps reasonable performance.

Implementation

I gave the refactoring task to an LLM agent. This code is not production ready and entirely unreviewed, but it does pass the very comprehensive Landlock test suite, which remained unmodified, so I have good confidence that the relevant aspects are retained.

Measurements

Effects on Landlock code complexity

The measurement was done using:

wc -l security/landlock/*.[ch]

The lines of code in Landlock went down from 7245 (2006 in fs.c) to 6815 (1726 in fs.c) lines.

The usual caveats apply:

Lines of code are a poor measurement of actual complexity.
It might have deleted some KUnit tests along the way. Whether that is legitimate due to simplification is a subjective question.

But it’s at least an indication that it does indeed simplify the code.

The full refactoring and benchmarking code are kept on a Git branch at https://github.com/gnoack/linux/tree/landlock-invert.

Effects on Performance

This is measured using an improved version of fs_bench.

fs_bench benchmarks how long it takes to check Landlock policies during open(), and it does so in an intentionally very-bad-case scenario, namely one where the opened file is in a deeply nested filesystem location from the root, and where the operation is in the end not permitted, so that Landlock has to walk the path all the way up to the root.

This is chosen to amplify the effect and to make it more measurable. In more practical scenarios, the number of nested Landlock domains is likely to be low (e.g., 1), and the number of nested directories is also low (e.g. <10).

The resulting performance graphs show that independent of the number of nested directories, adding more nested Landlock sandboxes does not scale well after the refactoring attempt.

Noteworthy results:

The refactoring attempt does not scale well when adding more Landlock sandboxes.

Even with a small and more realistic directory nesting depth of only 10, the performance degradation of doing the path walk twice when adding a second Landlock layer is clearly visible.

In the case of only one layer, the performance improved slightly, presumably because the code is just simpler now, but this only becomes properly measurable at a high number of nested directories. In the more realistic scenario with only 10 directories, the effect is not measurable with the clock resolution.

Directory nesting depth	before refactoring (clocks)	after refactoring (clocks)
10	6	6
100	32	29 (90%)
1000	469	384 (82%)
10000	6087	4676 (77%)

Conclusion

❌ The refactoring was a bad idea and it reduces performance.

The current path walk implementation with the layer matrix performs better, even in scenarios with small numbers.

Full results

Kernel A: before (benchmarking/bzImage-before)
Kernel B: after (benchmarking/bzImage-after)

Values are “System” clock ticks from times(2) (smaller = faster).

fs_bench

Scenario	before	after	Δ (B − A)	B/A
fs_bench / depth=10000, baseline	6	6	0	1.000
fs_bench / depth=10000, 1 domains	6087	4676	-1411	0.768
fs_bench / depth=10000, 2 domains	6072	9502	3430	1.565
fs_bench / depth=10000, 4 domains	6281	19933	13652	3.174
fs_bench / depth=10000, 8 domains	6452	44484	38032	6.895
fs_bench / depth=10000, 16 domains	7007	99500	92493	14.200
fs_bench / depth=1000, baseline	4	4	0	1.000
fs_bench / depth=1000, 1 domains	469	384	-85	0.819
fs_bench / depth=1000, 2 domains	453	731	278	1.614
fs_bench / depth=1000, 4 domains	506	1867	1361	3.690
fs_bench / depth=1000, 8 domains	530	3826	3296	7.219
fs_bench / depth=1000, 16 domains	595	8684	8089	14.595
fs_bench / depth=100, baseline	3	4	1	1.333
fs_bench / depth=100, 1 domains	32	29	-3	0.906
fs_bench / depth=100, 2 domains	34	59	25	1.735
fs_bench / depth=100, 4 domains	37	123	86	3.324
fs_bench / depth=100, 8 domains	40	299	259	7.475
fs_bench / depth=100, 16 domains	47	678	631	14.426
fs_bench / depth=10, baseline	4	3	-1	0.750
fs_bench / depth=10, 1 domains	6	6	0	1.000
fs_bench / depth=10, 2 domains	6	9	3	1.500
fs_bench / depth=10, 4 domains	7	15	8	2.143
fs_bench / depth=10, 8 domains	7	30	23	4.286
fs_bench / depth=10, 16 domains	8	61	53	7.625

net_bench

Scenario	before	after	Δ (B − A)	B/A
net_bench / baseline	49	49	0	1.000
net_bench / 2 domains	9	9	0	1.000
net_bench / 4 domains	8	9	1	1.125
net_bench / 8 domains	9	8	-1	0.889
net_bench / 16 domains	10	9	-1	0.900

scoped_bench

The “scoped_bench” ran as well, but the results were all zero. (I have not investigated this further; net_bench and scoped_bench were not expected to take a big performance hit from the refactoring.)