Landlock Path Walk Inversion Experiment

A (failed) performance tuning experiment

Objective

Landlock supports multiple nested sandboxes, so when an operation with filesystem path is attempted, it has to check it against up to 16 nested policies.

The way it currently does that is by constructing a matrix of the requested access rights per layer and checking off these matrix entries during the path walk. This is efficient, but it also increases code complexity and I have long had a nagging doubt about whether it is worth the tradeoff.

After all, the mental model for using Landlock is that each layer gets checked independently from the next. A more “natural” way to implement this would be to:

  1. loop over the layers first, and
  2. then do the path walk inside (multiple times).

To be clear, my confidence that this would be acceptable performance-wise was always low, but then again, performance can sometimes be counterintuitive, and so it seemed like it might be worthwhile exploring.

The refactoring would be a bigger change, but luckily we have a very comprehensive suite of kernel selftests in Landlock – so at least for the sake of doing that experiment, it was feasible to vibe code the refactoring. (The code is absolutely not meant to be submitted like that, but it is good enough to get performance numbers of the general idea.)

The experiment failed – it worsened performance noticeably. I’m publishing it anyway, so that others don’t have to attempt the same.

I also found it amazing that I can play with the performance aspects of such a large refactoring without implementing it myself. ✨ LLMs are underused for such experiments.

Hypothesis

The (highly speculative) hypothesis of this experiment is:

By refactoring the code to loop over the layers first, and doing the path walk inside (multiple times),

  1. the code becomes simpler to reason about,
  2. and keeps reasonable performance.

Implementation

I gave the refactoring task to an LLM agent. This code is not production ready and entirely unreviewed, but it does pass the very comprehensive Landlock test suite, which remained unmodified, so I have good confidence that the relevant aspects are retained.

Measurements

Effects on Landlock code complexity

The measurement was done using:

wc -l security/landlock/*.[ch]

The lines of code in Landlock went down from 7245 (2006 in fs.c) to 6815 (1726 in fs.c) lines.

6815 after refactoring 7245 before refactoring 1726 after refactoring 2006 before refactoring Lines of Code in security/landlock/*.[ch] Lines of Code in security/landlock/fs.c

The usual caveats apply:

But it’s at least an indication that it does indeed simplify the code.

The full refactoring and benchmarking code are kept on a Git branch at https://github.com/gnoack/linux/tree/landlock-invert.

Effects on Performance

This is measured using an improved version of fs_bench.

fs_bench benchmarks how long it takes to check Landlock policies during open(), and it does so in an intentionally very-bad-case scenario, namely one where the opened file is in a deeply nested filesystem location from the root, and where the operation is in the end not permitted, so that Landlock has to walk the path all the way up to the root.

This is chosen to amplify the effect and to make it more measurable. In more practical scenarios, the number of nested Landlock domains is likely to be low (e.g., 1), and the number of nested directories is also low (e.g. <10).

The resulting performance graphs show that independent of the number of nested directories, adding more nested Landlock sandboxes does not scale well after the refactoring attempt.

Gnuplot Produced by GNUPLOT 6.0 patchlevel 4 0 10 20 30 40 50 60 70 1 2 4 8 before before after after System clocks Landlock layers fs_bench, depth = 10
Gnuplot Produced by GNUPLOT 6.0 patchlevel 4 0 100 200 300 400 500 600 700 1 2 4 8 before before after after System clocks Landlock layers fs_bench, depth = 100
Gnuplot Produced by GNUPLOT 6.0 patchlevel 4 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 1 2 4 8 before before after after System clocks Landlock layers fs_bench, depth = 1000
Gnuplot Produced by GNUPLOT 6.0 patchlevel 4 0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000 1 2 4 8 before before after after System clocks Landlock layers fs_bench, depth = 10000

Noteworthy results:

The refactoring attempt does not scale well when adding more Landlock sandboxes.

Even with a small and more realistic directory nesting depth of only 10, the performance degradation of doing the path walk twice when adding a second Landlock layer is clearly visible.

In the case of only one layer, the performance improved slightly, presumably because the code is just simpler now, but this only becomes properly measurable at a high number of nested directories. In the more realistic scenario with only 10 directories, the effect is not measurable with the clock resolution.

Directory nesting depth before refactoring (clocks) after refactoring (clocks)
10 6 6
100 32 29 (90%)
1000 469 384 (82%)
10000 6087 4676 (77%)

Conclusion

❌ The refactoring was a bad idea and it reduces performance.

The current path walk implementation with the layer matrix performs better, even in scenarios with small numbers.

Full results

Values are “System” clock ticks from times(2) (smaller = faster).

fs_bench

Scenario before after Δ (B − A) B/A
fs_bench / depth=10000, baseline 6 6 0 1.000
fs_bench / depth=10000, 1 domains 6087 4676 -1411 0.768
fs_bench / depth=10000, 2 domains 6072 9502 3430 1.565
fs_bench / depth=10000, 4 domains 6281 19933 13652 3.174
fs_bench / depth=10000, 8 domains 6452 44484 38032 6.895
fs_bench / depth=10000, 16 domains 7007 99500 92493 14.200
fs_bench / depth=1000, baseline 4 4 0 1.000
fs_bench / depth=1000, 1 domains 469 384 -85 0.819
fs_bench / depth=1000, 2 domains 453 731 278 1.614
fs_bench / depth=1000, 4 domains 506 1867 1361 3.690
fs_bench / depth=1000, 8 domains 530 3826 3296 7.219
fs_bench / depth=1000, 16 domains 595 8684 8089 14.595
fs_bench / depth=100, baseline 3 4 1 1.333
fs_bench / depth=100, 1 domains 32 29 -3 0.906
fs_bench / depth=100, 2 domains 34 59 25 1.735
fs_bench / depth=100, 4 domains 37 123 86 3.324
fs_bench / depth=100, 8 domains 40 299 259 7.475
fs_bench / depth=100, 16 domains 47 678 631 14.426
fs_bench / depth=10, baseline 4 3 -1 0.750
fs_bench / depth=10, 1 domains 6 6 0 1.000
fs_bench / depth=10, 2 domains 6 9 3 1.500
fs_bench / depth=10, 4 domains 7 15 8 2.143
fs_bench / depth=10, 8 domains 7 30 23 4.286
fs_bench / depth=10, 16 domains 8 61 53 7.625

net_bench

Scenario before after Δ (B − A) B/A
net_bench / baseline 49 49 0 1.000
net_bench / 2 domains 9 9 0 1.000
net_bench / 4 domains 8 9 1 1.125
net_bench / 8 domains 9 8 -1 0.889
net_bench / 16 domains 10 9 -1 0.900

scoped_bench

The “scoped_bench” ran as well, but the results were all zero. (I have not investigated this further; net_bench and scoped_bench were not expected to take a big performance hit from the refactoring.)

Comments