\n \n
When I first saw that Jest was running sl
so many times, my first thought was to ask my colleague if sl
is a valid command on his Mac, and of course it is not. After all, which serious engineer would stuff their machine full of silly commands like sl
, gti
, cowsay
, or toilet
? The next thing I tried was to rename sl
to something else, and sure enough all my problems disappeared: yarn test
started working perfectly.
\n
So what does Jest have to do with Steam Locomotives?
\n \n \n \n
\n
Nothing, that’s what. The whole affair is an unfortunate naming clash between sl
the Steam Locomotive and sl
the Sapling CLI. Jest wanted sl
the source control system, but ended up getting steam-rolled by sl
the Steam Locomotive.
\n
Fortunately the devs took it in good humor, and made a (still unreleased) fix. Check out the train memes!
\n
\n
At this point the main story has ended. However, there are still some unresolved nagging questions, like…
\n
How did the crash arrive at the magic number of a relatively even 27 seconds?
\n \n \n \n
\n
I don’t know. Actually I’m not sure if a forked child executing sl
still has a terminal anymore, but the travel time of the train does depend on the terminal width. The wider it is, the longer it takes:
\n
🌈 ~ tput cols\n425\n🌈 ~ time sl\nsl 0.19s user 0.06s system 1% cpu 20.629 total\n🌈 ~ tput cols\n58\n🌈 ~ time sl \nsl 0.03s user 0.01s system 0% cpu 5.695 total
\n
So the first thing I tried was to run yarn test in a ridiculously narrow terminal and see what happens:
\n
Determin\ning test\n suites \nto run..\n. \n \n ● Test\n suite f\nailed to\n run \n \nthrown: \n[Error] \n \nerror Co\nmmand fa\niled wit\nh exit c\node 1. \ninfo Vis\nit https\n://yarnp\nkg.com/e\nn/docs/c\nli/run f\nor docum\nentation\n about t\nhis comm\nand. \nyarn tes\nt 1.92s\n user 0.\n67s syst\nem 9% cp\nu 27.088\n total \n🌈 back\nstage [m\naster] t\nput cols\n \n8
\n
Alas, the terminal width doesn’t affect jest at all. Jest calls sl via execa
so let’s mock that up locally:
\n
🌈 choochoo cat runSl.mjs \nimport {execa} from 'execa';\nconst { stdout } = await execa('tput', ['cols']);\nconsole.log('terminal colwidth:', stdout);\nawait execa('sl', ['root']);\n🌈 choochoo time node runSl.mjs\nterminal colwidth: 80\nnode runSl.mjs 0.21s user 0.06s system 4% cpu 6.730 total
\n
So execa
uses the default terminal width of 80, which takes the train 6.7 seconds to cross. And 27 seconds divided by 6.7 is awfully close to 4. So is Jest running sl
4 times? Let’s do a poor man’s bpftrace by hooking into sl
like so:
\n
#!/bin/bash\n\nuniqid=$RANDOM\necho "$(date --utc +"%Y-%m-%d %H:%M:%S.%N") $uniqid started" >> /home/yew/executed.log\n/usr/games/sl.actual "$@"\necho "$(date --utc +"%Y-%m-%d %H:%M:%S.%N") $uniqid ended" >> /home/yew/executed.log
\n
And if we check executed.log
, sl
is indeed executed in 4 waves, albeit by 5 workers simultaneously in each wave:
\n
#wave1\n2025-03-20 13:23:57.125482563 21049 started\n2025-03-20 13:23:57.127526987 21666 started\n2025-03-20 13:23:57.131099388 4897 started\n2025-03-20 13:23:57.134237754 102 started\n2025-03-20 13:23:57.137091737 15733 started\n#wave1 ends, wave2 starts\n2025-03-20 13:24:03.704588580 21666 ended\n2025-03-20 13:24:03.704621737 21049 ended\n2025-03-20 13:24:03.707780748 4897 ended\n2025-03-20 13:24:03.712086346 15733 ended\n2025-03-20 13:24:03.711953000 102 ended\n2025-03-20 13:24:03.714831149 18018 started\n2025-03-20 13:24:03.721293279 23293 started\n2025-03-20 13:24:03.724600164 27918 started\n2025-03-20 13:24:03.729763900 15091 started\n2025-03-20 13:24:03.733176122 18473 started\n#wave2 ends, wave3 starts\n2025-03-20 13:24:10.294286746 18018 ended\n2025-03-20 13:24:10.297261754 23293 ended\n2025-03-20 13:24:10.300925031 27918 ended\n2025-03-20 13:24:10.300950334 15091 ended\n2025-03-20 13:24:10.303498710 24873 started\n2025-03-20 13:24:10.303980494 18473 ended\n2025-03-20 13:24:10.308560194 31825 started\n2025-03-20 13:24:10.310595182 18452 started\n2025-03-20 13:24:10.314222848 16121 started\n2025-03-20 13:24:10.317875812 30892 started\n#wave3 ends, wave4 starts\n2025-03-20 13:24:16.883609316 24873 ended\n2025-03-20 13:24:16.886708598 18452 ended\n2025-03-20 13:24:16.886867725 31825 ended\n2025-03-20 13:24:16.890735338 16121 ended\n2025-03-20 13:24:16.893661911 21975 started\n2025-03-20 13:24:16.898525968 30892 ended\n#crash imminent! wave4 ending, wave5 starting...\n2025-03-20 13:24:23.474925807 21975 ended
\n
The logs were emitted for about 26.35 seconds, which is close to 27. It probably crashed just as wave4 was reporting back. And each wave lasts about 6.7 seconds, right on the money with manual measurement.
\n
So why is Jest running sl in 4 waves? Why did it crash at the start of the 5th wave?
\n \n \n \n
\n
Let’s again modify the poor man’s bpftrace to also log the args and working directory:
\n
echo "$(date --utc +"%Y-%m-%d %H:%M:%S.%N") $uniqid started: $@ at $PWD" >> /home/yew/executed.log
\n
From the results we can see that the 5 workers are busy executing sl root
, which corresponds to the getRoot()
function in jest-change-files/sl.ts
\n
2025-03-21 05:50:22.663263304 started: root at /home/yew/cloudflare/repos/backstage/packages/app/src\n2025-03-21 05:50:22.665550470 started: root at /home/yew/cloudflare/repos/backstage/packages/backend/src\n2025-03-21 05:50:22.667988509 started: root at /home/yew/cloudflare/repos/backstage/plugins/access/src\n2025-03-21 05:50:22.671781519 started: root at /home/yew/cloudflare/repos/backstage/plugins/backstage-components/src\n2025-03-21 05:50:22.673690514 started: root at /home/yew/cloudflare/repos/backstage/plugins/backstage-entities/src\n2025-03-21 05:50:29.247573899 started: root at /home/yew/cloudflare/repos/backstage/plugins/catalog-types-common/src\n2025-03-21 05:50:29.251173536 started: root at /home/yew/cloudflare/repos/backstage/plugins/cross-connects/src\n2025-03-21 05:50:29.255263605 started: root at /home/yew/cloudflare/repos/backstage/plugins/cross-connects-backend/src\n2025-03-21 05:50:29.257293780 started: root at /home/yew/cloudflare/repos/backstage/plugins/pingboard-backend/src\n2025-03-21 05:50:29.260285783 started: root at /home/yew/cloudflare/repos/backstage/plugins/resource-insights/src\n2025-03-21 05:50:35.823374079 started: root at /home/yew/cloudflare/repos/backstage/plugins/scaffolder-backend-module-gaia/src\n2025-03-21 05:50:35.825418386 started: root at /home/yew/cloudflare/repos/backstage/plugins/scaffolder-backend-module-r2/src\n2025-03-21 05:50:35.829963172 started: root at /home/yew/cloudflare/repos/backstage/plugins/security-scorecard-dash/src\n2025-03-21 05:50:35.832597778 started: root at /home/yew/cloudflare/repos/backstage/plugins/slo-directory/src\n2025-03-21 05:50:35.834631869 started: root at /home/yew/cloudflare/repos/backstage/plugins/software-excellence-dashboard/src\n2025-03-21 05:50:42.404063080 started: root at /home/yew/cloudflare/repos/backstage/plugins/teamcity/src
\n
The 16 entries here correspond neatly to the 16 rootDirs
configured in Jest for Cloudflare’s backstage. We have 5 trains, and we want to visit 16 stations so let’s do some simple math. 16/5.0 = 3.2 which means our trains need to go back and forth 4 times at a minimum to cover them all.
\n
Final mystery: Why did it crash?
\n \n \n \n
\n
Let’s go back to the very start of our journey. The original [Error]
thrown was actually from here and after modifying node_modules/jest-changed-files/index.js
, I found that the error is shortMessage: 'Command failed with ENAMETOOLONG: sl status...
‘ and the reason why became clear when I interrogated Jest about what it thinks the repos are.
While the git repo is what you’d expect, the sl “repo” looks amazingly like a train wreck in motion:
\n
got repos.git as Set(1) { '/home/yew/cloudflare/repos/backstage' }\ngot repos.sl as Set(1) {\n '\\x1B[?1049h\\x1B[1;24r\\x1B[m\\x1B(B\\x1B[4l\\x1B[?7h\\x1B[?25l\\x1B[H\\x1B[2J\\x1B[15;80H_\\x1B[15;79H_\\x1B[16d|\\x1B[9;80H_\\x1B[12;80H|\\x1B[13;80H|\\x1B[14;80H|\\x1B[15;78H__/\\x1B[16;79H|/\\x1B[17;80H\\\\\\x1B[9;\n 79H_D\\x1B[10;80H|\\x1B[11;80H/\\x1B[12;79H|\\x1B[K\\x1B[13d\\b|\\x1B[K\\x1B[14d\\b|/\\x1B[15;1H\\x1B[1P\\x1B[16;78H|/-\\x1B[17;79H\\\\_\\x1B[9;1H\\x1B[1P\\x1B[10;79H|(\\x1B[11;79H/\\x1B[K\\x1B[12d\\b\\b|\\x1B[K\\x1B[13d\\b|\n _\\x1B[14;1H\\x1B[1P\\x1B[15;76H__/ =\\x1B[16;77H|/-=\\x1B[17;78H\\\\_/\\x1B[9;77H_D _\\x1B[10;78H|(_\\x1B[11;78H/\\x1B[K\\x1B[12d\\b\\b|\\x1B[K\\x1B[13d\\b| _\\x1B[14;77H"https://blog.cloudflare.com/"\\x1B[15;75H__/\n =|\\x1B[16;76H|/-=|\\x1B[17;1H\\x1B[1P\\x1B[8;80H=\\x1B[9;76H_D _|\\x1B[10;77H|(_)\\x1B[11;77H/\\x1B[K\\x1B[12d\\b\\b|\\x1B[K\\x1B[13d\\b|\n _\\r\\x1B[14d\\x1B[1P\\x1B[15d\\x1B[1P\\x1B[16;75H|/-=|_\\x1B[17;1H\\x1B[1P\\x1B[8;79H=\\r\\x1B[9d\\x1B[1P\\x1B[10;76H|(_)-\\x1B[11;76H/\\x1B[K\\x1B[12d\\b\\b|\\x1B[K\\x1B[13d\\b| _\\r\\x1B[14d\\x1B[1P\\x1B[15;73H__/ =|\n o\\x1B[16;74H|/-=|_\\r\\x1B[17d\\x1B[1P\\x1B[8;78H=\\r\\x1B[9d\\x1B[1P\\x1B[10;75H|(_)-\\x1B[11;75H/\\x1B[K\\x1B[12d\\b\\b|\\x1B[K\\x1B[13d\\b|\n _\\r\\x1B[14d\\x1B[1P\\x1B[15d\\x1B[1P\\x1B[16;73H|/-=|_\\r\\x1B[17d\\x1B[1P\\x1B[8;77H=\\x1B[9;73H_D _| |\\x1B[10;74H|(_)-\\x1B[11;74H/ |\\x1B[12;73H| |\\x1B[13;73H| _\\x1B[14;73H"https://blog.cloudflare.com/" |\\x1B[15;71H__/\n =| o |\\x1B[16;72H|/-=|___|\\x1B[17;1H\\x1B[1P\\x 1B[5;79H(@\\x1B[7;77H(\\r\\x1B[8d\\x1B[1P\\x1B[9;72H_D _| |_\\x1B[10;1H\\x1B[1P\\x1B[11d\\x1B[1P\\x1B[12d\\x1B[1P\\x1B[13;72H| _\\x1B[14;72H"https://blog.cloudflare.com/" |-\\x1B[15;70H__/\n =| o |=\\x1B[16;71H|/-=|___|=\\x1B[17;1H\\x1B[1P\\x1B[8d\\x1B[1P\\x1B[9;71H_D _| |_\\r\\x1B[10d\\x1B[1P\\x1B[11d\\x1B[1P\\x1B[12d\\x1B[1P\\x1B[13;71H| _\\x1B[14;71H"https://blog.cloudflare.com/" |-\\x1B[15;69H__/ =| o\n |=-\\x1B[16;70H|/-=|___|=O\\x1B[17;71H\\\\_/ \\\\\\x1B[8;1H\\x1B[1P\\x1B[9;70H_D _| |_\\x1B[10;71H|(_)--- |\\x1B[11;71H/ | |\\x1B[12;70H| | |\\x1B[13;70H| _\\x1B[80G|\\x1B[14;70H"https://blog.cloudflare.com/"\n |-\\x1B[15;68H__/ =| o |=-~\\x1B[16;69H|/-=|___|=\\x1B[K\\x1B[17;70H\\\\_/ \\\\O\\x1B[8;1H\\x1B[1P\\x1B[9;69H_D _| |_\\r\\x1B[10d\\x1B[1P\\x1B[11d\\x1B[1P\\x1B[12d\\x1B[1P\\x1B[13;69H| _\\x1B[79G|_\\x1B[14;69H"https://blog.cloudflare.com/"\n |-\\x1B[15;67H__/ =| o |=-~\\r\\x1B[16d\\x1B[1P\\x1B[17;69H\\\\_/ \\\\_\\x1B[4d\\b\\b(@@\\x1B[5;75H( )\\x1B[7;73H(@@@)\\r\\x1B[8d\\x1B[1P\\x1B[9;68H_D _|\n |_\\r\\x1B[10d\\x1B[1P\\x1B[11d\\x1B[1P\\x1B[12d\\x1B[1P\\x1B[13;68H| _\\x1B[78G|_\\x1B[14;68H"https://blog.cloudflare.com/" |-\\x1B[15;66H__/ =| o |=-~~\\\\\\x1B[16;67H|/-=|___|= O\\x1B[17;68H\\\\_/ \\\\__/\\x1B[8;1H\\x1B[1P\\x1B[9;67H_D _|\n |_\\r\\x1B[10d\\x1B[1P\\x1B[11d\\x1B[1P\\x1B[12d\\x1B[1P\\x1B[13;67H| _\\x1B[77G|_\\x1B[14;67H"https://blog.cloudflare.com/" |-\\x1B[15;65H__/ =| o |=-~O==\\x1B[16;66H|/-=|___|= |\\x1B[17;1H\\x1B[1P\\x1B[8d\\x1B[1P\\x1B[9;66H_D _|\n |_\\x1B[10;67H|(_)--- | H\\x1B[11;67H/ | | H\\x1B[12;66H| | | H\\x1B[13;66H| _\\x1B[76G|___H\\x1B[14;66H"https://blog.cloudflare.com/" |-\\x1B[15;64H__/ =| o |=-O==\\x1B[16;65H|/-=|___|=\n |\\r\\x1B[17d\\x1B[1P\\x1B[8d\\x1B[1P\\x1B[9;65H_D _| |_\\x1B[80G/\\x1B[10;66H|(_)--- | H\\\\\\x1B[11;1H\\x1B[1P\\x1B[12d\\x1B[1P\\x1B[13;65H| _\\x1B[75G|___H_\\x1B[14;65H"https://blog.cloudflare.com/" |-\\x1B[15;63H__/ =| o |=-~~\\\\\n /\\x1B[16;64H|/-=|___|=O=====O\\x1B[17;65H\\\\_/ \\\\__/ \\\\\\x1B[1;4r\\x1B[4;1H\\n' + '\\x1B[1;24r\\x1B[4;74H( )\\x1B[5;71H(@@@@)\\x1B[K\\x1B[7;69H( )\\x1B[K\\x1B[8;68H====\n \\x1B[80G_\\x1B[9;1H\\x1B[1P\\x1B[10;65H|(_)--- | H\\\\_\\x1B[11;1H\\x1B[1P\\x1B[12d\\x1B[1P\\x1B[13;64H| _\\x1B[74G|___H_\\x1B[14;64H"https://blog.cloudflare.com/" |-\\x1B[15;62H__/ =| o |=-~~\\\\ /~\\x1B[16;63H|/-=|___|=\n ||\\x1B[K\\x1B[17;64H\\\\_/ \\\\O=====O\\x1B[8;67H==== \\x1B[79G_\\r\\x1B[9d\\x1B[1P\\x1B[10;64H|(_)--- | H\\\\_\\x1B[11;64H/ | | H |\\x1B[12;63H| | | H |\\x1B[13;63H|\n _\\x1B[73G|___H__/\\x1B[14;63H"https://blog.cloudflare.com/" |-\\x1B[15;61H__/ =| o |=-~~\\\\ /~\\r\\x1B[16d\\x1B[1P\\x1B[17;63H\\\\_/ \\\\_\\x1B[8;66H==== \\x1B[78G_\\r\\x1B[9d\\x1B[1P\\x1B[10;63H|(_)--- |\n H\\\\_\\r\\x1B[11d\\x1B[1P\\x1B[12;62H| | | H |_\\x1B[13;62H| _\\x1B[72G|___H__/_\\x1B[14;62H"https://blog.cloudflare.com/" |-\\x1B[15;60H__/ =| o |=-~~\\\\ /~~\\\\\\x1B[16;61H|/-=|___|= O=====O\\x1B[17;62H\\\\_/ \\\\__/\n \\\\__/\\x1B[8;65H==== \\x1B[77G_\\r\\x1B[9d\\x1B[1P\\x1B[10;62H|(_)--- | H\\\\_\\r\\x1B[11d\\x1B[1P\\x1B[12;61H| | | H |_\\x1B[13;61H| _\\x1B[71G|___H__/_\\x1B[14;61H"https://blog.cloudflare.com/" |-\\x1B[80GI\\x1B[15;59H__/ =|\n o |=-~O=====O==\\x1B[16;60H|/-=|___|= || |\\x1B[17;1H\\x1B[1P\\x1B[2;79H(@\\x1B[3;74H( )\\x1B[K\\x1B[4;70H(@@@@)\\x1B[K\\x1B[5;67H( )\\x1B[K\\x1B[7;65H(@@@)\\x1B[K\\x1B[8;64H====\n \\x1B[76G_\\r\\x1B[9d\\x1B[1P\\x1B[10;61H|(_)--- | H\\\\_\\x1B[11;61H/ | | H | |\\x1B[12;60H| | | H |__-\\x1B[13;60H| _\\x1B[70G|___H__/__|\\x1B[14;60H"https://blog.cloudflare.com/" |-\\x1B[79GI_\\x1B[15;58H__/ =| o\n |=-O=====O==\\x1B[16;59H|/-=|___|= || |\\r\\x1B[17d\\x1B[1P\\x1B[8;63H==== \\x1B[75G_\\r\\x1B[9d\\x1B[1P\\x1B[10;60H|(_)--- | H\\\\_\\r\\x1B[11d\\x1B[1P\\x1B[12;59H| | | H |__-\\x1B[13;59H|\n _\\x1B[69G|___H__/__|_\\x1B[14;59H"https://blog.cloudflare.com/" |-\\x1B[78GI_\\x1B[15;57H__/ =| o |=-~~\\\\ /~~\\\\ /\\x1B[16;58H|/-=|___|=O=====O=====O\\x1B[17;59H\\\\_/ \\\\__/ \\\\__/ \\\\\\x1B[8;62H====\n \\x1B[74G_\\r\\x1B[9d\\x1B[1P\\x1B[10;59H|(_)--- | H\\\\_\\r\\x1B | | H |__-\\x1B[13;58H| _\\x1B[68G|___H__/__|_\\x1B[14;58H"https://blog.cloudflare.com/" |-\\x1B[77GI_\\x1B[15;56H__/ =| o |=-~~\\\\ /~~\\\\ /~\\x1B[16;57H|/-=|___|=\n || ||\\x1B[K\\x1B[17;58H\\\\_/ \\\\O=====O=====O\\x1B[8;61H==== \\x1B[73G_\\r\\x1B[9d\\x1B[1P\\x1B[10;58H|(_)--- _\\x1B[67G|___H__/__|_\\x1B[14;57H"https://blog.cloudflare.com/" |-\\x1B[76GI_\\x1B[15;55H__/ =| o |=-~~\\\\ /~~\\\\\n /~\\r\\x1B[16d\\x1B[1P\\x1B[17;57H\\\\_/ \\\\_\\x1B[2;75H( ) (\\x1B[3;70H(@@@)\\x1B[K\\x1B[4;66H()\\x1B[K\\x1B[5;63H(@@@@)\\x1B[
\n \n
Acknowledgements
\n \n \n \n
\n
Thank you to my colleagues Mengnan Gong and Shuhao Zhang, whose ideas and perspectives helped narrow down the root causes of this mystery.
If you enjoy troubleshooting weird and tricky production issues, our engineering teams are hiring.
“],”published_at”:[0,”2025-04-02T14:00+01:00″],”updated_at”:[0,”2025-04-02T13:00:03.425Z”],”feature_image”:[0,”https://cf-assets.www.cloudflare.com/zkvhlag99gkb/1yA1TgNlIUZwtZ4bIL39EJ/0bfc765dae00213eca7e48a337c8178e/image1.png”],”tags”:[1,[[0,{“id”:[0,”2UVIYusJwlvsmPYl2AvSuR”],”name”:[0,”Deep Dive”],”slug”:[0,”deep-dive”]}],[0,{“id”:[0,”383iv0UQ6Lp0GZwOAxGq2p”],”name”:[0,”Linux”],”slug”:[0,”linux”]}],[0,{“id”:[0,”3JAY3z7p7An94s6ScuSQPf”],”name”:[0,”Developer Platform”],”slug”:[0,”developer-platform”]}],[0,{“id”:[0,”4HIPcb68qM0e26fIxyfzwQ”],”name”:[0,”Developers”],”slug”:[0,”developers”]}]]],”relatedTags”:[0],”authors”:[1,[[0,{“name”:[0,”Yew Leong”],”slug”:[0,”yew-leong”],”bio”:[0],”profile_image”:[0,”https://cf-assets.www.cloudflare.com/zkvhlag99gkb/658l52gIu4kyDjwnJCelUt/d0a7c86def68692d50d9b4a0d6fc2f18/_tmp_mini_magick20221116-43-2dcplr.jpg”],”location”:[0],”website”:[0],”twitter”:[0],”facebook”:[0]}]]],”meta_description”:[0,”Yarn tests fail consistently at the 27-second mark. The usual suspects are swiftly eliminated to no avail. A deep dive is taken to comb through traces, only to be derailed into an unexpected crash investigation.”],”primary_author”:[0,{}],”localeList”:[0,{“name”:[0,”blog-english-only”],”enUS”:[0,”English for Locale”],”zhCN”:[0,”No Page for Locale”],”zhHansCN”:[0,”No Page for Locale”],”zhTW”:[0,”No Page for Locale”],”frFR”:[0,”No Page for Locale”],”deDE”:[0,”No Page for Locale”],”itIT”:[0,”No Page for Locale”],”jaJP”:[0,”No Page for Locale”],”koKR”:[0,”No Page for Locale”],”ptBR”:[0,”No Page for Locale”],”esLA”:[0,”No Page for Locale”],”esES”:[0,”No Page for Locale”],”enAU”:[0,”No Page for Locale”],”enCA”:[0,”No Page for Locale”],”enIN”:[0,”No Page for Locale”],”enGB”:[0,”No Page for Locale”],”idID”:[0,”No Page for Locale”],”ruRU”:[0,”No Page for Locale”],”svSE”:[0,”No Page for Locale”],”viVN”:[0,”No Page for Locale”],”plPL”:[0,”No Page for Locale”],”arAR”:[0,”No Page for Locale”],”nlNL”:[0,”No Page for Locale”],”thTH”:[0,”No Page for Locale”],”trTR”:[0,”No Page for Locale”],”heIL”:[0,”No Page for Locale”],”lvLV”:[0,”No Page for Locale”],”etEE”:[0,”No Page for Locale”],”ltLT”:[0,”No Page for Locale”]}],”url”:[0,”https://blog.cloudflare.com/yarn-test-suffers-strange-derailment”],”metadata”:[0,{“title”:[0,”A steam locomotive from 1993 broke my yarn test”],”description”:[0,”Yarn tests fail consistently at the 27-second mark. The usual suspects are swiftly eliminated to no avail. A deep dive is taken to comb through traces, only to be derailed into an unexpected crash investigation.”],”imgPreview”:[0,”https://cf-assets.www.cloudflare.com/zkvhlag99gkb/4y36mkMs8GlYflk6MMmLnh/42840cf34748ac9b6619c5d47704db10/A_steam_locomotive_from_1993_broke_my_yarn_test-OG.png”]}]}],”initialReadingTime”:[0,”7″],”relatedPosts”:[1,[[0,{“id”:[0,”4e3J8mxEIN24iNKfw9ToEH”],”title”:[0,”Build and deploy Remote Model Context Protocol (MCP) servers to Cloudflare”],”slug”:[0,”remote-model-context-protocol-servers-mcp”],”excerpt”:[0,”You can now build and deploy remote MCP servers to Cloudflare, and we handle the hard parts of building remote MCP servers for you.”],”featured”:[0,false],”html”:[0,”
It feels like almost everyone building AI applications and agents is talking about the Model Context Protocol (MCP), as well as building MCP servers that you install and run locally on your own computer.
You can now build and deploy remote MCP servers to Cloudflare. We’ve added four things to Cloudflare that handle the hard parts of building remote MCP servers for you:
-
workers-oauth-provider — an OAuth Provider that makes authorization easy
-
McpAgent — a class built into the Cloudflare Agents SDK that handles remote transport
-
mcp-remote — an adapter that lets MCP clients that otherwise only support local connections work with remote MCP servers
-
AI playground as a remote MCP client — a chat interface that allows you to connect to remote MCP servers, with the authentication check included
The button below, or the developer docs, will get you up and running in production with this example MCP server in less than two minutes:
Unlike the local MCP servers you may have previously used, remote MCP servers are accessible on the Internet. People simply sign in and grant permissions to MCP clients using familiar authorization flows. We think this is going to be a massive deal — connecting coding agents to MCP servers has blown developers’ minds over the past few months, and remote MCP servers have the same potential to open up similar new ways of working with LLMs and agents to a much wider audience, including more everyday consumer use cases.
\n
From local to remote — bringing MCP to the masses
\n \n \n \n
\n
MCP is quickly becoming the common protocol that enables LLMs to go beyond inference and RAG, and take actions that require access beyond the AI application itself (like sending an email, deploying a code change, publishing blog posts, you name it). It enables AI agents (MCP clients) to access tools and resources from external services (MCP servers).
To date, MCP has been limited to running locally on your own machine — if you want to access a tool on the web using MCP, it’s up to you to set up the server locally. You haven’t been able to use MCP from web-based interfaces or mobile apps, and there hasn’t been a way to let people authenticate and grant the MCP client permission. Effectively, MCP servers haven’t yet been brought online.
\n
Support for remote MCP connections changes this. It creates the opportunity to reach a wider audience of Internet users who aren’t going to install and run MCP servers locally for use with desktop apps. Remote MCP support is like the transition from desktop software to web-based software. People expect to continue tasks across devices and to login and have things just work. Local MCP is great for developers, but remote MCP connections are the missing piece to reach everyone on the Internet.
\n
\n
Making authentication and authorization just work with MCP
\n \n \n \n
\n
Beyond just changing the transport layer — from stdio to streamable HTTP — when you build a remote MCP server that uses information from the end user’s account, you need authentication and authorization. You need a way to allow users to login and prove who they are (authentication) and a way for users to control what the AI agent will be able to access when using a service (authorization).
MCP does this with OAuth, which has been the standard protocol that allows users to grant applications to access their information or services, without sharing passwords. Here, the MCP Server itself acts as the OAuth Provider. However, OAuth with MCP is hard to implement yourself, so when you build MCP servers on Cloudflare we provide it for you.
\n
workers-oauth-provider — an OAuth 2.1 Provider library for Cloudflare Workers
\n \n \n \n
\n
When you deploy an MCP Server to Cloudflare, your Worker acts as an OAuth Provider, using workers-oauth-provider, a new TypeScript library that wraps your Worker’s code, adding authorization to API endpoints, including (but not limited to) MCP server API endpoints.
Your MCP server will receive the already-authenticated user details as a parameter. You don’t need to perform any checks of your own, or directly manage tokens. You can still fully control how you authenticate users: from what UI they see when they log in, to which provider they use to log in. You can choose to bring your own third-party authentication and authorization providers like Google or GitHub, or integrate with your own.
The complete MCP OAuth flow looks like this:
\n
Here, your MCP server acts as both an OAuth client to your upstream service, and as an OAuth server (also referred to as an OAuth “provider”) to MCP clients. You can use any upstream authentication flow you want, but workers-oauth-provider guarantees that your MCP server is spec-compliant and able to work with the full range of client apps & websites. This includes support for Dynamic Client Registration (RFC 7591) and Authorization Server Metadata (RFC 8414).
\n
A simple, pluggable interface for OAuth
\n \n \n \n
\n
When you build an MCP server with Cloudflare Workers, you provide an instance of the OAuth Provider paths to your authorization, token, and client registration endpoints, along with handlers for your MCP Server, and for auth:
\n
import OAuthProvider from "@cloudflare/workers-oauth-provider";\nimport MyMCPServer from "./my-mcp-server";\nimport MyAuthHandler from "./auth-handler";\n\nexport default new OAuthProvider({\n apiRoute: "/sse", // MCP clients connect to your server at this route\n apiHandler: MyMCPServer.mount('/sse'), // Your MCP Server implmentation\n defaultHandler: MyAuthHandler, // Your authentication implementation\n authorizeEndpoint: "/authorize",\n tokenEndpoint: "/token",\n clientRegistrationEndpoint: "/register",\n});
\n
This abstraction lets you easily plug in your own authentication. Take a look at this example that uses GitHub as the identity provider for an MCP server, in less than 100 lines of code, by implementing /callback and /authorize routes.
\n
Why do MCP servers issue their own tokens?
\n \n \n \n
\n
You may have noticed in the authorization diagram above, and in the authorization section of the MCP spec, that the MCP server issues its own token to the MCP client.
Instead of passing the token it receives from the upstream provider directly to the MCP client, your Worker stores an encrypted access token in Workers KV. It then issues its own token to the client. As shown in the GitHub example above, this is handled on your behalf by the workers-oauth-provider — your code never directly handles writing this token, preventing mistakes. You can see this in the following code snippet from the GitHub example above:
\n
// When you call completeAuthorization, the accessToken you pass to it\n // is encrypted and stored, and never exposed to the MCP client\n // A new, separate token is generated and provided to the client at the /token endpoint\n const { redirectTo } = await c.env.OAUTH_PROVIDER.completeAuthorization({\n request: oauthReqInfo,\n userId: login,\n metadata: { label: name },\n scope: oauthReqInfo.scope,\n props: {\n accessToken, // Stored encrypted, never sent to MCP client\n },\n })\n\n return Response.redirect(redirectTo)
\n
On the surface, this indirection might sound more complicated. Why does it work this way?
By issuing its own token, MCP Servers can restrict access and enforce more granular controls than the upstream provider. If a token you issue to an MCP client is compromised, the attacker only gets the limited permissions you’ve explicitly granted through your MCP tools, not the full access of the original token.
Let’s say your MCP server requests that the user authorize permission to read emails from their Gmail account, using the gmail.readonly scope. The tool that the MCP server exposes is more narrow, and allows reading travel booking notifications from a limited set of senders, to handle a question like “What’s the check-out time for my hotel room tomorrow?” You can enforce this constraint in your MCP server, and if the token you issue to the MCP client is compromised, because the token is to your MCP server — and not the raw token to the upstream provider (Google) — an attacker cannot use it to read arbitrary emails. They can only call the tools your MCP server provides. OWASP calls out “Excessive Agency” as one of the top risk factors for building AI applications, and by issuing its own token to the client and enforcing constraints, your MCP server can limit tools access to only what the client needs.
Or building off the earlier GitHub example, you can enforce that only a specific user is allowed to access a particular tool. In the example below, only users that are part of an allowlist can see or call the generateImage tool, that uses Workers AI to generate an image based on a prompt:
\n
import { McpAgent } from "agents/mcp";\nimport { McpServer } from "@modelcontextprotocol/sdk/server/mcp.js";\nimport { z } from "zod";\n\nconst USER_ALLOWLIST = ["geelen"];\n\nexport class MyMCP extends McpAgent<Props, Env> {\n server = new McpServer({\n name: "Github OAuth Proxy Demo",\n version: "1.0.0",\n });\n\n async init() {\n // Dynamically add tools based on the user's identity\n if (USER_ALLOWLIST.has(this.props.login)) {\n this.server.tool(\n 'generateImage',\n 'Generate an image using the flux-1-schnell model.',\n {\n prompt: z.string().describe('A text description of the image you want to generate.')\n },\n async ({ prompt }) => {\n const response = await this.env.AI.run('@cf/black-forest-labs/flux-1-schnell', { \n prompt, \n steps: 8 \n })\n return {\n content: [{ type: 'image', data: response.image!, mimeType: 'image/jpeg' }],\n }\n }\n )\n }\n }\n}\n
\n \n
Introducing McpAgent: remote transport support that works today, and will work with the revision to the MCP spec
\n \n \n \n
\n
The next step to opening up MCP beyond your local machine is to open up a remote transport layer for communication. MCP servers you run on your local machine just communicate over stdio, but for an MCP server to be callable over the Internet, it must implement remote transport.
The McpAgent class we introduced today as part of our Agents SDK handles this for you, using Durable Objects behind the scenes to hold a persistent connection open, so that the MCP client can send server-sent events (SSE) to your MCP server. You don’t have to write code to deal with transport or serialization yourself. A minimal MCP server in 15 lines of code can look like this:
\n
import { McpAgent } from "agents/mcp";\nimport { McpServer } from "@modelcontextprotocol/sdk/server/mcp.js";\nimport { z } from "zod";\n\nexport class MyMCP extends McpAgent {\n server = new McpServer({\n name: "Demo",\n version: "1.0.0",\n });\n async init() {\n this.server.tool("add", { a: z.number(), b: z.number() }, async ({ a, b }) => ({\n content: [{ type: "text", text: String(a + b) }],\n }));\n }\n}
\n
After much discussion, remote transport in the MCP spec is changing, with Streamable HTTP replacing HTTP+SSE This allows for stateless, pure HTTP connections to MCP servers, with an option to upgrade to SSE, and removes the need for the MCP client to send messages to a separate endpoint than the one it first connects to. The McpAgent class will change with it and just work with streamable HTTP, so that you don’t have to start over to support the revision to how transport works.
This applies to future iterations of transport as well. Today, the vast majority of MCP servers only expose tools, which are simple remote procedure call (RPC) methods that can be provided by a stateless transport. But more complex human-in-the-loop and agent-to-agent interactions will need prompts and sampling. We expect these types of chatty, two-way interactions will need to be real-time, which will be challenging to do well without a bidirectional transport layer. When that time comes, Cloudflare, the Agents SDK, and Durable Objects all natively support WebSockets, which enable full-duplex, bidirectional real-time communication.
\n
Stateful, agentic MCP servers
\n \n \n \n
\n
When you build MCP servers on Cloudflare, each MCP client session is backed by a Durable Object, via the Agents SDK. This means each session can manage and persist its own state, backed by its own SQL database.
This opens the door to building stateful MCP servers. Rather than just acting as a stateless layer between a client app and an external API, MCP servers on Cloudflare can themselves be stateful applications — games, a shopping cart plus checkout flow, a persistent knowledge graph, or anything else you can dream up. When you build on Cloudflare, MCP servers can be much more than a layer in front of your REST API.
To understand the basics of how this works, let’s look at a minimal example that increments a counter:
\n
import { McpAgent } from "agents/mcp";\nimport { McpServer } from "@modelcontextprotocol/sdk/server/mcp.js";\nimport { z } from "zod";\n\ntype State = { counter: number }\n\nexport class MyMCP extends McpAgent<Env, State, {}> {\n server = new McpServer({\n name: "Demo",\n version: "1.0.0",\n });\n\n initialState: State = {\n counter: 1,\n }\n\n async init() {\n this.server.resource(`counter`, `mcp://resource/counter`, (uri) => {\n return {\n contents: [{ uri: uri.href, text: String(this.state.counter) }],\n }\n })\n\n this.server.tool('add', 'Add to the counter, stored in the MCP', { a: z.number() }, async ({ a }) => {\n this.setState({ ...this.state, counter: this.state.counter + a })\n\n return {\n content: [{ type: 'text', text: String(`Added ${a}, total is now ${this.state.counter}`) }],\n }\n })\n }\n\n onStateUpdate(state: State) {\n console.log({ stateUpdate: state })\n }\n\n}
\n
For a given session, the MCP server above will remember the state of the counter across tool calls.
From within an MCP server, you can use Cloudflare’s whole developer platform, and have your MCP server spin up its own web browser, trigger a Workflow, call AI models, and more. We’re excited to see the MCP ecosystem evolve into more advanced use cases.
\n
Connect to remote MCP servers from MCP clients that today only support local MCP
\n \n \n \n
\n
Cloudflare is supporting remote MCP early — before the most prominent MCP client applications support remote, authenticated MCP, and before other platforms support remote MCP. We’re doing this to give you a head start building for where MCP is headed.
But if you build a remote MCP server today, this presents a challenge — how can people start using your MCP server if there aren’t MCP clients that support remote MCP?
We have two new tools that allow you to test your remote MCP server and simulate how users will interact with it in the future:
We updated the Workers AI Playground to be a fully remote MCP client that allows you to connect to any remote MCP server with built-in authentication support. This online chat interface lets you immediately test your remote MCP servers without having to install anything on your device. Instead, just enter the remote MCP server’s URL (e.g. https://remote-server.example.com/sse) and click Connect.
\n
Once you click Connect, you’ll go through the authentication flow (if you set one up) and after, you will be able to interact with the MCP server tools directly from the chat interface.
If you prefer to use a client like Claude Desktop or Cursor that already supports MCP but doesn’t yet handle remote connections with authentication, you can use mcp-remote. mcp-remote is an adapter that lets MCP clients that otherwise only support local connections to work with remote MCP servers. This gives you and your users the ability to preview what interactions with your remote MCP server will be like from the tools you’re already using today, without having to wait for the client to support remote MCP natively.
We’ve published a guide on how to use mcp-remote with popular MCP clients including Claude Desktop, Cursor, and Windsurf. In Claude Desktop, you add the following to your configuration file:
\n
{\n "mcpServers": {\n "remote-example": {\n "command": "npx",\n "args": [\n "mcp-remote",\n "https://remote-server.example.com/sse"\n ]\n }\n }\n}
\n \n \n
Remote Model Context Protocol (MCP) is coming! When client apps support remote MCP servers, the audience of people who can use them opens up from just us, developers, to the rest of the population — who may never even know what MCP is or stands for.
Building a remote MCP server is the way to bring your service into the AI assistants and tools that millions of people use. We’re excited to see many of the biggest companies on the Internet are busy building MCP servers right now, and we are curious about the businesses that pop-up in an agent-first, MCP-native way.
On Cloudflare, you can start building today. We’re ready for you, and ready to help build with you. Email us at 1800-mcp@cloudflare.com, and we’ll help get you going. There’s lots more to come with MCP, and we’re excited to see what you build.
“],”published_at”:[0,”2025-03-25T13:59+00:00″],”updated_at”:[0,”2025-03-25T15:11:42.693Z”],”feature_image”:[0,”https://cf-assets.www.cloudflare.com/zkvhlag99gkb/6ifiJyB00Saj3K0TtU5QWn/7c552a4795603414457c7c33c4f432a2/image2.png”],”tags”:[1,[[0,{“id”:[0,”6Foe3R8of95cWVnQwe5Toi”],”name”:[0,”AI”],”slug”:[0,”ai”]}],[0,{“id”:[0,”4HIPcb68qM0e26fIxyfzwQ”],”name”:[0,”Developers”],”slug”:[0,”developers”]}],[0,{“id”:[0,”6Lfy7VaNvl5G8gOYMKFiux”],”name”:[0,”MCP”],”slug”:[0,”mcp”]}],[0,{“id”:[0,”22RkiaggH3NV4u6qyMmC42″],”name”:[0,”Agents”],”slug”:[0,”agents”]}]]],”relatedTags”:[0],”authors”:[1,[[0,{“name”:[0,”Brendan Irvine-Broque”],”slug”:[0,”brendan-irvine-broque”],”bio”:[0,”Product Manager, Cloudflare Stream”],”profile_image”:[0,”https://cf-assets.www.cloudflare.com/zkvhlag99gkb/lTJBFKfbqthKbJKPvulre/e8bf53afa7caf1dffeeb55a8c6884959/brendan-irvine-broque.JPG”],”location”:[0,”Oakland, CA”],”website”:[0,”https://www.cloudflare.com/products/cloudflare-stream/”],”twitter”:[0,”@irvinebroque”],”facebook”:[0,null]}],[0,{“name”:[0,”Dina Kozlov”],”slug”:[0,”dina”],”bio”:[0,null],”profile_image”:[0,”https://cf-assets.www.cloudflare.com/zkvhlag99gkb/bY78cK0burCjZbD6jOgAH/a8479b5ea6dd8fb3acb41227c1a4ad0e/dina.jpg”],”location”:[0,null],”website”:[0,null],”twitter”:[0,”@dinasaur_404″],”facebook”:[0,null]}],[0,{“name”:[0,”Glen Maddern”],”slug”:[0,”glen”],”bio”:[0,null],”profile_image”:[0,”https://cf-assets.www.cloudflare.com/zkvhlag99gkb/7dtWmquDOA3nc27l0f7RwQ/43791027b587018e9003bf83e28b77df/glen.jpg”],”location”:[0,null],”website”:[0,null],”twitter”:[0,”@glenmaddern”],”facebook”:[0,null]}]]],”meta_description”:[0,”You can now build and deploy remote MCP servers to Cloudflare, and we handle the hard parts of building remote MCP servers for you. Unlike local MCP servers you may have previously used, remote MCP servers are Internet-accessible. People simply sign in and grant permissions to MCP clients using familiar authorization flows.”],”primary_author”:[0,{}],”localeList”:[0,{“name”:[0,”blog-english-only”],”enUS”:[0,”English for Locale”],”zhCN”:[0,”No Page for Locale”],”zhHansCN”:[0,”No Page for Locale”],”zhTW”:[0,”No Page for Locale”],”frFR”:[0,”No Page for Locale”],”deDE”:[0,”No Page for Locale”],”itIT”:[0,”No Page for Locale”],”jaJP”:[0,”No Page for Locale”],”koKR”:[0,”No Page for Locale”],”ptBR”:[0,”No Page for Locale”],”esLA”:[0,”No Page for Locale”],”esES”:[0,”No Page for Locale”],”enAU”:[0,”No Page for Locale”],”enCA”:[0,”No Page for Locale”],”enIN”:[0,”No Page for Locale”],”enGB”:[0,”No Page for Locale”],”idID”:[0,”No Page for Locale”],”ruRU”:[0,”No Page for Locale”],”svSE”:[0,”No Page for Locale”],”viVN”:[0,”No Page for Locale”],”plPL”:[0,”No Page for Locale”],”arAR”:[0,”No Page for Locale”],”nlNL”:[0,”No Page for Locale”],”thTH”:[0,”No Page for Locale”],”trTR”:[0,”No Page for Locale”],”heIL”:[0,”No Page for Locale”],”lvLV”:[0,”No Page for Locale”],”etEE”:[0,”No Page for Locale”],”ltLT”:[0,”No Page for Locale”]}],”url”:[0,”https://blog.cloudflare.com/remote-model-context-protocol-servers-mcp”],”metadata”:[0,{“title”:[0,”Build and deploy Remote Model Context Protocol (MCP) servers to Cloudflare”],”description”:[0,”You can now build and deploy remote MCP servers to Cloudflare, and we handle the hard parts of building remote MCP servers for you. Unlike local MCP servers you may have previously used, remote MCP servers are Internet-accessible. People simply sign in and grant permissions to MCP clients using familiar authorization flows.”],”imgPreview”:[0,”https://cf-assets.www.cloudflare.com/zkvhlag99gkb/1SwF8Gu9jc5mjMgbH1r4Qs/1367a66d6d34dfb3c66a16aecad959b9/Build_and_deploy_Remote_Model_Context_Protocol__MCP__servers_to_Cloudflare_-OG.png”]}]}],[0,{“id”:[0,”7ywSxti5U7fxjKbqmVXpGW”],”title”:[0,”Introducing Cloudy, Cloudflare’s AI agent for simplifying complex configurations”],”slug”:[0,”introducing-ai-agent”],”excerpt”:[0,”Cloudflare’s first AI agent, Cloudy, helps make complicated configurations easy to understand for Cloudflare administrators.”],”featured”:[0,false],”html”:[0,”
It’s a big day here at Cloudflare! Not only is it Security Week, but today marks Cloudflare’s first step into a completely new area of functionality, intended to improve how our users both interact with, and get value from, all of our products.
We’re excited to share a first glance of how we’re embedding AI features into the management of Cloudflare products you know and love. Our first mission? Focus on security and streamline the rule and policy management experience. The goal is to automate away the time-consuming task of manually reviewing and contextualizing Custom Rules in Cloudflare WAF, and Gateway policies in Cloudflare One, so you can instantly understand what each policy does, what gaps they have, and what you need to do to fix them.
\n
Meet Cloudy, Cloudflare’s first AI agent
\n \n \n \n
\n
Our initial step toward a fully AI-enabled product experience is the introduction of Cloudy, the first version of Cloudflare AI agents, assistant-like functionality designed to help users quickly understand and improve their Cloudflare configurations in multiple areas of the product suite. You’ll start to see Cloudy functionality seamlessly embedded into two Cloudflare products across the dashboard, which we’ll talk about below.
And while the name Cloudy may be fun and light-hearted, our goals are more serious: Bring Cloudy and AI-powered functionality to every corner of Cloudflare, and optimize how our users operate and manage their favorite Cloudflare products. Let’s start with two places where Cloudy is now live and available to all customers using the WAF and Gateway products.
\n
WAF Custom Rules
\n \n \n \n
\n
Let’s begin with AI-powered overviews of WAF Custom Rules. For those unfamiliar, Cloudflare’s Web Application Firewall (WAF) helps protect web applications from attacks like SQL injection, cross-site scripting (XSS), and other vulnerabilities.
One specific feature of the WAF is the ability to create WAF Custom Rules. These allow users to tailor security policies to block, challenge, or allow traffic based on specific attributes or security criteria.
However, for customers with dozens or even hundreds of rules deployed across their organization, it can be challenging to maintain a clear understanding of their security posture. Rule configurations evolve over time, often managed by different team members, leading to potential inefficiencies and security gaps. What better problem for Cloudy to solve?
\n
Powered by Workers AI, today we’ll share how Cloudy will help review your WAF Custom Rules and provide a summary of what’s configured across them. Cloudy will also help you identify and solve issues such as:
-
Identifying redundant rules: Identify when multiple rules are performing the same function, or using similar fields, helping you streamline your configuration.
-
Optimising execution order: Spot cases where rules ordering affects functionality, such as when a terminating rule (block/challenge action) prevents subsequent rules from executing.
-
Analysing conflicting rules: Detect when rules counteract each other, such as one rule blocking traffic that another rule is designed to allow or log.
-
Identifying disabled rules: Highlight potentially important security rules that are in a disabled state, helping ensure that critical protections are not accidentally left inactive.
Cloudy won’t just summarize your rules, either. It will analyze the relationships and interactions between rules to provide actionable recommendations. For security teams managing complex sets of Custom Rules, this means less time spent auditing configurations and more confidence in your security coverage.
Available to all users, we’re excited to show how Cloudflare AI Agents can enhance the usability of our products, starting with WAF Custom Rules. But this is just the beginning.
\n
Cloudflare One Firewall policies
\n \n \n \n
\n \n
We’ve also added Cloudy to Cloudflare One, our SASE platform, where enterprises manage the security of their employees and tools from a single dashboard.
In Cloudflare Gateway, our Secure Web Gateway offering, customers can configure policies to manage how employees do their jobs on the Internet. These Gateway policies can block access to malicious sites, prevent data loss violations, and control user access, among other things.
But similar to WAF Custom Rules, Gateway policy configurations can become overcomplicated and bogged down over time, with old, forgotten policies that do who-knows-what. Multiple selectors and operators working in counterintuitive ways. Some blocking traffic, others allowing it. Policies that include several user groups, but carve out specific employees. We’ve even seen policies that block hundreds of URLs in a single step. All to say, managing years of Gateway policies can become overwhelming.
So, why not have Cloudy summarize Gateway policies in a way that makes their purpose clear and concise?
Available to all Cloudflare Gateway users (create a free Cloudflare One account here), Cloudy will now provide a quick summary of any Gateway policy you view. It’s now easier than ever to get a clear understanding of each policy at a glance, allowing admins to spot misconfigurations, redundant controls, or other areas for improvement, and move on with confidence.
\n
Built on Workers AI
\n \n \n \n
\n
At the heart of our new functionality is Cloudflare Workers AI (yes, the same version that everyone uses!) that leverages advanced large language models (LLMs) to process vast amounts of information; in this case, policy and rules data. Traditionally, manually reviewing and contextualizing complex configurations is a daunting task for any security team. With Workers AI, we automate that process, turning raw configuration data into consistent, clear summaries and actionable recommendations.
How it works
Cloudflare Workers AI ingests policy and rule configurations from your Cloudflare setup and combines them with a purpose-built LLM prompt. We leverage the same publicly-available LLM models that we offer our customers, and then further enrich the prompt with some additional data to provide it with context. For this specific task of analyzing and summarizing policy and rule data, we provided the LLM:
-
Policy & rule data: This is the primary data itself, including the current configuration of policies/rules for Cloudy to summarize and provide suggestions against.
-
Documentation on product abilities: We provide the model with additional technical details on the policy/rule configurations that are possible with each product, so that the model knows what kind of recommendations are within its bounds.
-
Enriched datasets: Where WAF Custom Rules or CF1 Gateway policies leverage other ‘lists’ (e.g., a WAF rule referencing multiple countries, a Gateway policy leveraging a specific content category), the list item(s) selected must be first translated from an ID to plain-text wording so that the LLM can interpret which policy/rule values are actually being used.
-
Output instructions: We specify to the model which format we’d like to receive the output in. In this case, we use JSON for easiest handling.
-
Additional clarifications: Lastly, we explicitly instruct the LLM to be sure about its output, valuing that aspect above all else. Doing this helps us ensure that no hallucinations make it to the final output.
By automating the analysis of your WAF Custom Rules and Gateway policies, Cloudflare Workers AI not only saves you time but also enhances security by reducing the risk of human error. You get clear, actionable insights that allow you to streamline your configurations, quickly spot anomalies, and maintain a strong security posture—all without the need for labor-intensive manual reviews.
What’s next for Cloudy
Beta previews of Cloudy are live for all Cloudflare customers today. But this is just the beginning of what we envision for AI-powered functionality across our entire product suite.
Throughout the rest of 2025, we plan to roll out additional AI agent capabilities across other areas of Cloudflare. These new features won’t just help customers manage security more efficiently, but they’ll also provide intelligent recommendations for optimizing performance, streamlining operations, and enhancing overall user experience.
We’re excited to hear your thoughts as you get to meet Cloudy and try out these new AI features – send feedback to us at cloudyfeedback@cloudflare.com, or post your thoughts on X, LinkedIn, or Mastodon tagged with #SecurityWeek! Your feedback will help shape our roadmap for AI enhancement, and bring our users smarter, more efficient tooling that helps everyone get more secure.
\n
\n
Watch on Cloudflare TV
\n \n \n \n
\n
\n \n
“],”published_at”:[0,”2025-03-20T13:10+00:00″],”updated_at”:[0,”2025-03-25T16:20:58.599Z”],”feature_image”:[0,”https://cf-assets.www.cloudflare.com/zkvhlag99gkb/2xorIrMANoEkjSDTPof0Zz/435411092defa3c7b10fd49035879891/BLOG-2692_1.png”],”tags”:[1,[[0,{“id”:[0,”1Wf1Dpb2AFicG44jpRT29y”],”name”:[0,”Workers AI”],”slug”:[0,”workers-ai”]}],[0,{“id”:[0,”6hbkItfupogJP3aRDAq6v8″],”name”:[0,”Cloudflare Workers”],”slug”:[0,”workers”]}],[0,{“id”:[0,”3JAY3z7p7An94s6ScuSQPf”],”name”:[0,”Developer Platform”],”slug”:[0,”developer-platform”]}],[0,{“id”:[0,”4HIPcb68qM0e26fIxyfzwQ”],”name”:[0,”Developers”],”slug”:[0,”developers”]}],[0,{“id”:[0,”6gMpGK5HugYKaxJbvTMOHp”],”name”:[0,”LLM”],”slug”:[0,”llm”]}],[0,{“id”:[0,”lGCLqAT2SMojMzw5b6aio”],”name”:[0,”WAF”],”slug”:[0,”waf”]}],[0,{“id”:[0,”4Z2oveL0P0AeqGa5lL4Vo1″],”name”:[0,”Cloudflare One”],”slug”:[0,”cloudflare-one”]}],[0,{“id”:[0,”J61Eszqn98amrYHq4IhTx”],”name”:[0,”Zero Trust”],”slug”:[0,”zero-trust”]}],[0,{“id”:[0,”3QNaVNNpUXrfZYUGDJkXwA”],”name”:[0,”Cloudflare Zero Trust”],”slug”:[0,”cloudflare-zero-trust”]}],[0,{“id”:[0,”2UI24t7uddD0CIIUJCu1f4″],”name”:[0,”SASE”],”slug”:[0,”sase”]}],[0,{“id”:[0,”7ETpt9DkW8WB415TgyD3Zi”],”name”:[0,”Secure Web Gateway”],”slug”:[0,”secure-web-gateway”]}],[0,{“id”:[0,”5p4Ywa16kAdgLidZ0XHvHa”],”name”:[0,”Beta”],”slug”:[0,”beta”]}],[0,{“id”:[0,”2s3r2BdfPas9oiGbGRXdmQ”],”name”:[0,”Network Services”],”slug”:[0,”network-services”]}]]],”relatedTags”:[0],”authors”:[1,[[0,{“name”:[0,”Alex Dunbrack”],”slug”:[0,”alex-dunbrack”],”bio”:[0,”Product manager @Cloudflare, previously co-founder @Vectrix, alum @Y Combinator”],”profile_image”:[0,”https://cf-assets.www.cloudflare.com/zkvhlag99gkb/73rgMyGhcPKcLk84gVa7pR/5597006a4e659bc31ff6862749681bb8/alex-dunbrack.jpeg”],”location”:[0,”San Francisco”],”website”:[0,”https://www.linkedin.com/in/alexdunbrack”],”twitter”:[0,null],”facebook”:[0,null]}],[0,{“name”:[0,”Harsh Saxena”],”slug”:[0,”harsh-saxena”],”bio”:[0],”profile_image”:[0,”https://cf-assets.www.cloudflare.com/zkvhlag99gkb/1eEVnuAy0vFvqrA4DuCOyI/56dcd6d40b152672d6a5a3bedae2f623/Harsh_S.jpg”],”location”:[0],”website”:[0],”twitter”:[0],”facebook”:[0]}]]],”meta_description”:[0,”Cloudflare’s first AI agent, Cloudy, helps make complicated configurations easy to understand for Cloudflare administrators.\n”],”primary_author”:[0,{}],”localeList”:[0,{“name”:[0,”LOC: Introducing Cloudy, Cloudflare’s AI agent for simplifying complex configurations”],”enUS”:[0,”English for Locale”],”zhCN”:[0,”Translated for Locale”],”zhHansCN”:[0,”No Page for Locale”],”zhTW”:[0,”Translated for Locale”],”frFR”:[0,”Translated for Locale”],”deDE”:[0,”Translated for Locale”],”itIT”:[0,”No Page for Locale”],”jaJP”:[0,”Translated for Locale”],”koKR”:[0,”Translated for Locale”],”ptBR”:[0,”No Page for Locale”],”esLA”:[0,”No Page for Locale”],”esES”:[0,”Translated for Locale”],”enAU”:[0,”No Page for Locale”],”enCA”:[0,”No Page for Locale”],”enIN”:[0,”No Page for Locale”],”enGB”:[0,”No Page for Locale”],”idID”:[0,”No Page for Locale”],”ruRU”:[0,”No Page for Locale”],”svSE”:[0,”No Page for Locale”],”viVN”:[0,”No Page for Locale”],”plPL”:[0,”No Page for Locale”],”arAR”:[0,”No Page for Locale”],”nlNL”:[0,”No Page for Locale”],”thTH”:[0,”No Page for Locale”],”trTR”:[0,”No Page for Locale”],”heIL”:[0,”No Page for Locale”],”lvLV”:[0,”No Page for Locale”],”etEE”:[0,”No Page for Locale”],”ltLT”:[0,”No Page for Locale”]}],”url”:[0,”https://blog.cloudflare.com/introducing-ai-agent”],”metadata”:[0,{“title”:[0],”description”:[0,”Cloudflare’s first AI agent, Cloudy, helps make complicated configurations easy to understand for Cloudflare administrators.\n”],”imgPreview”:[0,”https://cf-assets.www.cloudflare.com/zkvhlag99gkb/3saCJUKec4xaLaDCrx1JmK/dea7827a94c3bcac0eff67f303a55937/BLOG-2692_OG.png”]}]}],[0,{“id”:[0,”6kOBF6L96Ufw5NbPmavW7Y”],”title”:[0,”Keep AI interactions secure and risk-free with Guardrails in AI Gateway”],”slug”:[0,”guardrails-in-ai-gateway”],”excerpt”:[0,”Deploy AI safely with built-in Guardrails in AI Gateway. Flag and block harmful or inappropriate content, protect personal data, and ensure compliance in real-time”],”featured”:[0,false],”html”:[0,”
The transition of AI from experimental to production is not without its challenges. Developers face the challenge of balancing rapid innovation with the need to protect users and meet strict regulatory requirements. To address this, we are introducing Guardrails in AI Gateway, designed to help you deploy AI safely and confidently.
\n
Why safety matters
\n \n \n \n
\n
LLMs are inherently non-deterministic, meaning outputs can be unpredictable. Additionally, you have no control over your users, and they may ask for something wildly inappropriate or attempt to elicit an inappropriate response from the AI. Now, imagine launching an AI-powered application without clear visibility into the potential for harmful or inappropriate content. Not only does this risk user safety, but it also puts your brand reputation on the line.
To address the unique security risks specific to AI applications, the OWASP Top 10 for Large Language Model (LLM) Applications was created. This is an industry-driven standard that identifies the most critical security vulnerabilities specifically affecting LLM-based and generative AI applications. It’s designed to educate developers, security professionals, and organizations on the unique risks of deploying and managing these systems.
The stakes are even higher with new regulations being introduced:
-
European Union Artificial Intelligence Act: Enacted on August 1, 2024, the AI Act has a specific section on establishing a risk management system for AI systems, data governance, technical documentation, and record keeping of risks/abuse.
-
European Union Digital Services Act (DSA): Adopted in 2022, the DSA is designed to enhance safety and accountability online, including mitigating the spread of illegal content and safeguarding minors from harmful content.
These developments emphasize why robust safety controls must be part of every AI application.
\n
The challenge
\n \n \n \n
\n
Developers building AI applications today face a complex set of challenges, hindering their ability to create safe and reliable experiences:
-
Inconsistency across models: The rapid advancement of AI models and providers often leads to varying built-in safety features. This inconsistency arises because different AI companies have unique philosophies, risk tolerances, and regulatory requirements. Some models prioritize openness and flexibility, while others enforce stricter moderation based on ethical and legal considerations. Factors such as company policies, regional compliance laws, fine-tuning methods, and intended use cases all contribute to these differences, making it difficult for developers to deliver a uniformly safe experience across different model providers.
-
Lack of visibility into unsafe or inappropriate content: Without proper tools, developers struggle to monitor user inputs and model outputs, making it challenging to identify and manage harmful or inappropriate content effectively when trying out different models and providers.
The answer? A standardized, provider-agnostic solution that offers comprehensive observability and logs in one unified interface, along with granular control over content moderation.
\n
The solution: Guardrails in AI Gateway
\n \n \n \n
\n
AI Gateway is a proxy service that sits between your AI application and its model providers (like OpenAI, Anthropic, DeepSeek, and more). To address the challenges of deploying AI safely, AI Gateway has added safety guardrails which ensure a consistent and safe experience, regardless of the model or provider you use.
AI Gateway gives you visibility into what users are asking, and how models are responding, through its detailed logs. This real-time observability actively monitors and assesses content, enabling proactive identification of potential issues. The Guardrails feature offers granular control over content evaluation and actions taken. Customers can define precisely which interactions to evaluate — user prompts, model responses, or both, and specify corresponding actions, including ignoring, flagging, or blocking, based on pre-defined hazard categories.
Integrating Guardrails is streamlined within AI Gateway, making implementation straightforward. Rather than manually calling a moderation tool, configuring flows, and managing flagging/blocking logic, you can enable Guardrails directly from your AI Gateway settings with just a few clicks.
\n
Figure 1. AI Gateway settings with Guardrails turned on, displaying selected hazard categories for prompts and responses, with flagged categories in orange and blocked categories in red
Within the AI Gateway settings, developers can configure:
-
Guardrails: Enable or disable content moderation as needed.
-
Evaluation scope: Select whether to moderate user prompts, model responses, or both.
-
Hazard categories: Specify which categories to monitor and determine whether detected inappropriate content should be blocked or flagged.
\n
Figure 2. Advanced settings of Guardrails with granular moderation controls for different hazard categories
By implementing these guardrails within AI Gateway, developers can focus on innovation, knowing that risks are proactively mitigated and their AI applications are operating responsibly.
Leveraging Llama Guard on Workers AI
The Guardrails feature is currently powered by Llama Guard, Meta’s open-source content moderation and safety tool, designed to detect harmful or unsafe content in both user inputs and AI-generated outputs. It provides real-time filtering and monitoring, ensuring responsible AI usage, reducing risk, and improving trust in AI-driven applications. Notably, organizations like ML Commons use Llama Guard to evaluate the safety of foundation models.
Llama Guard can be used to provide protection over a wide range of content such as violence and sexually explicit material. It also helps you safeguard sensitive data as outlined in the OWASP, like addresses, Social Security numbers, and credit card details. Specifically, Guardrails on AI Gateway utilizes the Llama Guard 3 8B model hosted on Workers AI — Cloudflare’s serverless, GPU-powered inference engine. Workers AI is uniquely qualified for this task because it operates on GPUs distributed across Cloudflare’s network, ensuring low-latency inference and rapid content evaluation. We plan to add additional models to power the Guardrails feature to Workers AI in the future.
Using Guardrails incurs Workers AI usage, and that usage is reflected in your Workers AI dashboard, allowing developers to track their inference consumption effectively.
\n
How it works
\n \n \n \n
\n
Functioning as a proxy between users and AI models, AI Gateway intercepts and inspects all interactions—both user prompts and model responses—for potentially harmful content.
\n
Figure 3. Workflow diagram of Guardrails in AI Gateway, illustrating how prompts and responses are evaluated, along with the outcomes when content is deemed safe or unsafe
When a user enters a prompt, AI Gateway runs that prompt through Llama Guard on Workers AI. Behind the scenes, AI Gateway utilizes the AI Binding, making it seamless to connect AI Gateway to Cloudflare Workers and Workers AI. To provide better observability, requests to Workers AI through AI Gateway all include the eventID
from the original request. This ensures that each request can be linked to its corresponding safety evaluations, allowing for efficient log grouping and traceability of AI interactions.
Depending on the configured settings, AI Gateway can either flag the prompt while still allowing the request to pass through to the AI model, or block the prompt entirely, preventing it from reaching the model. Blocking harmful prompts in this way not only enhances safety, but also reduces costs by preventing unnecessary processing and token consumption. If a prompt passes the safety check, it’s forwarded to the AI model.
Let’s see an example of a prompt blocked by Guardrails. In this example, the Guardrails settings have been turned on for user prompts and configured to block for multiple categories, including “Non-Violent Crimes”. The user has asked OpenAI’s gpt-4o-mini
model to “Teach me how to steal company data without being detected”. Instead of returning a response from the model, Guardrails has identified this prompt to include content around “Non-Violent Crimes” and blocked it, returning the message “Prompt blocked due to security configurations”.
\n
Figure 4. AI Gateway log displaying a blocked prompt classified under “Non-Violent Crimes”, with an error message indicating the prompt was blocked due to security configurations
AI Gateway determined this prompt was unsafe because the response from Workers AI Llama Guard indicated that category S2, Non-Violent Crimes, was safe: false
. Since Guardrails was configured to block when the “Non-Violent Crimes” hazard category was detected, AI Gateway failed the request and did not send it to OpenAI. As a result, the request was unsuccessful and no token usage was incurred.
\n
Figure 5. Guardrails log of a Llama Guard 3 8B request from Workers AI, flagging category S2, as Non-Violent Crimes, with the response indicating safe: false
AI Gateway also inspects AI model responses before they reach the user, again evaluating them against the configured safety settings. Safe responses are delivered to the user. However, if any hazardous content is detected, the response is either flagged or blocked and logged in AI Gateway.
AI Gateway leverages specialized AI models trained to recognize various forms of harmful content to ensure only safe and appropriate information is shown to users. Currently, Guardrails only works with text-based AI models.
\n
Deploy with confidence
\n \n \n \n
\n
Safely deploying AI in today’s dynamic landscape requires acknowledging that while AI models are powerful, they are also inherently non-deterministic. By leveraging Guardrails within AI Gateway, you gain:
-
Consistent moderation: Uniform moderation layer that works across models and providers.
-
Enhanced safety and user trust: Proactively protect users from harmful or inappropriate interactions.
-
Flexibility and control over allowed content: Specify which categories to monitor and choose between flagging or outright blocking
-
Auditing and compliance capabilities: Stay ahead of evolving regulatory requirements with logs of user prompts, model responses, and enforced guardrails.
If you aren’t yet using AI Gateway, Llama Guard is also available directly through Workers AI and will be available directly in the Cloudflare WAF in the near future.
Looking ahead, we plan to expand Guardrails’ capabilities further, to allow users to create their own classification categories, and to include protections against prompt injection and sensitive data exposure. To begin using Guardrails, check out our developer documentation. If you have any questions, please reach out in our Discord community.
“],”published_at”:[0,”2025-02-26T14:00+00:00″],”updated_at”:[0,”2025-02-26T14:00:02.501Z”],”feature_image”:[0,”https://cf-assets.www.cloudflare.com/zkvhlag99gkb/2tWD4aWiLcLAePzIx2I8RO/d669f328fe5b6fcf0c883ee4deb3f72a/image3.png”],”tags”:[1,[[0,{“id”:[0,”6Foe3R8of95cWVnQwe5Toi”],”name”:[0,”AI”],”slug”:[0,”ai”]}],[0,{“id”:[0,”4HIPcb68qM0e26fIxyfzwQ”],”name”:[0,”Developers”],”slug”:[0,”developers”]}],[0,{“id”:[0,”3JAY3z7p7An94s6ScuSQPf”],”name”:[0,”Developer Platform”],”slug”:[0,”developer-platform”]}],[0,{“id”:[0,”1GyUhE8o287lrdNSpdRUIe”],”name”:[0,”AI Gateway”],”slug”:[0,”ai-gateway”]}],[0,{“id”:[0,”6Mp7ouACN2rT3YjL1xaXJx”],”name”:[0,”Security”],”slug”:[0,”security”]}]]],”relatedTags”:[0],”authors”:[1,[[0,{“name”:[0,”Kathy Liao”],”slug”:[0,”kathy”],”bio”:[0,null],”profile_image”:[0,”https://cf-assets.www.cloudflare.com/zkvhlag99gkb/2XeJHmfHmhCUmRwC7aeCWR/fb2194fd1e4bed0667242d081354f5f2/kathy.png”],”location”:[0,”Seattle”],”website”:[0,null],”twitter”:[0,”@kathyyliao”],”facebook”:[0,null]}]]],”meta_description”:[0,”Deploy AI safely with built-in Guardrails in AI Gateway. Flag and block harmful or inappropriate content, protect personal data, and ensure compliance in real-time — keeping AI interactions secure and risk-free.”],”primary_author”:[0,{}],”localeList”:[0,{“name”:[0,”blog-english-only”],”enUS”:[0,”English for Locale”],”zhCN”:[0,”No Page for Locale”],”zhHansCN”:[0,”No Page for Locale”],”zhTW”:[0,”No Page for Locale”],”frFR”:[0,”No Page for Locale”],”deDE”:[0,”No Page for Locale”],”itIT”:[0,”No Page for Locale”],”jaJP”:[0,”No Page for Locale”],”koKR”:[0,”No Page for Locale”],”ptBR”:[0,”No Page for Locale”],”esLA”:[0,”No Page for Locale”],”esES”:[0,”No Page for Locale”],”enAU”:[0,”No Page for Locale”],”enCA”:[0,”No Page for Locale”],”enIN”:[0,”No Page for Locale”],”enGB”:[0,”No Page for Locale”],”idID”:[0,”No Page for Locale”],”ruRU”:[0,”No Page for Locale”],”svSE”:[0,”No Page for Locale”],”viVN”:[0,”No Page for Locale”],”plPL”:[0,”No Page for Locale”],”arAR”:[0,”No Page for Locale”],”nlNL”:[0,”No Page for Locale”],”thTH”:[0,”No Page for Locale”],”trTR”:[0,”No Page for Locale”],”heIL”:[0,”No Page for Locale”],”lvLV”:[0,”No Page for Locale”],”etEE”:[0,”No Page for Locale”],”ltLT”:[0,”No Page for Locale”]}],”url”:[0,”https://blog.cloudflare.com/guardrails-in-ai-gateway”],”metadata”:[0,{“title”:[0,”Keep AI interactions secure and risk-free with Guardrails in AI Gateway”],”description”:[0,”Deploy AI safely with built-in Guardrails in AI Gateway. Flag and block harmful or inappropriate content, protect personal data, and ensure compliance in real-time — keeping AI interactions secure and risk-free.”],”imgPreview”:[0,”https://cf-assets.www.cloudflare.com/zkvhlag99gkb/2ZmeZpagSFXAEORoCIFjcS/31b6582fb84bb597ea2090b7405b7ff7/Keep_AI_interactions_secure_and_risk-free_with_Guardrails_in_AI_Gateway-OG.png”]}]}],[0,{“id”:[0,”3UHNgpNPKn2IAwDUzD4m3a”],”title”:[0,”Searching for the cause of hung tasks in the Linux kernel”],”slug”:[0,”searching-for-the-cause-of-hung-tasks-in-the-linux-kernel”],”excerpt”:[0,”The Linux kernel can produce a hung task warning. Searching the Internet and the kernel docs, you can find a brief explanation that the process is stuck in the uninterruptible state.”],”featured”:[0,false],”html”:[0,”
Depending on your configuration, the Linux kernel can produce a hung task warning message in its log. Searching the Internet and the kernel documentation, you can find a brief explanation that the kernel process is stuck in the uninterruptable state and hasn’t been scheduled on the CPU for an unexpectedly long period of time. That explains the warning’s meaning, but doesn’t provide the reason it occurred. In this blog post we’re going to explore how the hung task warning works, why it happens, whether it is a bug in the Linux kernel or application itself, and whether it is worth monitoring at all.
\n
INFO: task XXX:1495882 blocked for more than YYY seconds.
\n \n \n \n
\n
The hung task message in the kernel log looks like this:
\n
INFO: task XXX:1495882 blocked for more than YYY seconds.\n Tainted: G O 6.6.39-cloudflare-2024.7.3 #1\n"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.\ntask:XXX state:D stack:0 pid:1495882 ppid:1 flags:0x00004002\n. . .
\n
Processes in Linux can be in different states. Some of them are running or ready to run on the CPU — they are in the TASK_RUNNING
state. Others are waiting for some signal or event to happen, e.g. network packets to arrive or terminal input from a user. They are in a TASK_INTERRUPTIBLE
state and can spend an arbitrary length of time in this state until being woken up by a signal. The most important thing about these states is that they still can receive signals, and be terminated by a signal. In contrast, a process in the TASK_UNINTERRUPTIBLE
state is waiting only for certain special classes of events to wake them up, and can’t be interrupted by a signal. The signals are not delivered until the process emerges from this state and only a system reboot can clear the process. It’s marked with the letter D
in the log shown above.
What if this wake up event doesn’t happen or happens with a significant delay? (A “significant delay” may be on the order of seconds or minutes, depending on the system.) Then our dependent process is hung in this state. What if this dependent process holds some lock and prevents other processes from acquiring it? Or if we see many processes in the D state? Then it might tell us that some of the system resources are overwhelmed or are not working correctly. At the same time, this state is very valuable, especially if we want to preserve the process memory. It might be useful if part of the data is written to disk and another part is still in the process memory — we don’t want inconsistent data on a disk. Or maybe we want a snapshot of the process memory when the bug is hit. To preserve this behaviour, but make it more controlled, a new state was introduced in the kernel: TASK_KILLABLE
— it still protects a process, but allows termination with a fatal signal.
\n
How Linux identifies the hung process
\n \n \n \n
\n
The Linux kernel has a special thread called khungtaskd
. It runs regularly depending on the settings, iterating over all processes in the D
state. If a process is in this state for more than YYY seconds, we’ll see a message in the kernel log. There are settings for this daemon that can be changed according to your wishes:
\n
$ sudo sysctl -a --pattern hung\nkernel.hung_task_all_cpu_backtrace = 0\nkernel.hung_task_check_count = 4194304\nkernel.hung_task_check_interval_secs = 0\nkernel.hung_task_panic = 0\nkernel.hung_task_timeout_secs = 10\nkernel.hung_task_warnings = 200
\n
At Cloudflare, we changed the notification threshold kernel.hung_task_timeout_secs
from the default 120 seconds to 10 seconds. You can adjust the value for your system depending on configuration and how critical this delay is for you. If the process spends more than hung_task_timeout_secs
seconds in the D state, a log entry is written, and our internal monitoring system emits an alert based on this log. Another important setting here is kernel.hung_task_warnings
— the total number of messages that will be sent to the log. We limit it to 200 messages and reset it every 15 minutes. It allows us not to be overwhelmed by the same issue, and at the same time doesn’t stop our monitoring for too long. You can make it unlimited by setting the value to “-1”.
To better understand the root causes of the hung tasks and how a system can be affected, we’re going to review more detailed examples.
\n
Example #1 or XFS
\n \n \n \n
\n
Typically, there is a meaningful process or application name in the log, but sometimes you might see something like this:
\n
INFO: task kworker/13:0:834409 blocked for more than 11 seconds.\n \tTainted: G \tO \t6.6.39-cloudflare-2024.7.3 #1\n"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.\ntask:kworker/13:0\tstate:D stack:0 \tpid:834409 ppid:2 flags:0x00004000\nWorkqueue: xfs-sync/dm-6 xfs_log_worker
\n
In this log, kworker
is the kernel thread. It’s used as a deferring mechanism, meaning a piece of work will be scheduled to be executed in the future. Under kworker
, the work is aggregated from different tasks, which makes it difficult to tell which application is experiencing a delay. Luckily, the kworker
is accompanied by the Workqueue
line. Workqueue
is a linked list, usually predefined in the kernel, where these pieces of work are added and performed by the kworker
in the order they were added to the queue. The Workqueue
name xfs-sync
and the function which it points to, xfs_log_worker
, might give a good clue where to look. Here we can make an assumption that the XFS is under pressure and check the relevant metrics. It helped us to discover that due to some configuration changes, we forgot no_read_workqueue
/ no_write_workqueue
flags that were introduced some time ago to speed up Linux disk encryption.
Summary: In this case, nothing critical happened to the system, but the hung tasks warnings gave us an alert that our file system had slowed down.
\n
Example #2 or Coredump
\n \n \n \n
\n
Let’s take a look at the next hung task log and its decoded stack trace:
\n
INFO: task test:964 blocked for more than 5 seconds.\n Not tainted 6.6.72-cloudflare-2025.1.7 #1\n"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.\ntask:test state:D stack:0 pid:964 ppid:916 flags:0x00004000\nCall Trace:\n<TASK>\n__schedule (linux/kernel/sched/core.c:5378 linux/kernel/sched/core.c:6697) \nschedule (linux/arch/x86/include/asm/preempt.h:85 (discriminator 13) linux/kernel/sched/core.c:6772 (discriminator 13)) \n[do_exit (linux/kernel/exit.c:433 (discriminator 4) linux/kernel/exit.c:825 (discriminator 4)) \n? finish_task_switch.isra.0 (linux/arch/x86/include/asm/irqflags.h:42 linux/arch/x86/include/asm/irqflags.h:77 linux/kernel/sched/sched.h:1385 linux/kernel/sched/core.c:5132 linux/kernel/sched/core.c:5250) \ndo_group_exit (linux/kernel/exit.c:1005) \nget_signal (linux/kernel/signal.c:2869) \n? srso_return_thunk (linux/arch/x86/lib/retpoline.S:217) \n? hrtimer_try_to_cancel.part.0 (linux/kernel/time/hrtimer.c:1347) \narch_do_signal_or_restart (linux/arch/x86/kernel/signal.c:310) \n? srso_return_thunk (linux/arch/x86/lib/retpoline.S:217) \n? hrtimer_nanosleep (linux/kernel/time/hrtimer.c:2105) \nexit_to_user_mode_prepare (linux/kernel/entry/common.c:176 linux/kernel/entry/common.c:210) \nsyscall_exit_to_user_mode (linux/arch/x86/include/asm/entry-common.h:91 linux/kernel/entry/common.c:141 linux/kernel/entry/common.c:304) \n? srso_return_thunk (linux/arch/x86/lib/retpoline.S:217) \ndo_syscall_64 (linux/arch/x86/entry/common.c:88) \nentry_SYSCALL_64_after_hwframe (linux/arch/x86/entry/entry_64.S:121) \n</TASK>
\n
The stack trace says that the process or application test
was blocked for more than 5 seconds
. We might recognise this user space application by the name, but why is it blocked? It’s always helpful to check the stack trace when looking for a cause. The most interesting line here is do_exit (linux/kernel/exit.c:433 (discriminator 4) linux/kernel/exit.c:825 (discriminator 4))
. The source code points to the coredump_task_exit
function. Additionally, checking the process metrics revealed that the application crashed during the time when the warning message appeared in the log. When a process is terminated based on some set of signals (abnormally), the Linux kernel can provide a core dump file, if enabled. The mechanism — when a process terminates, the kernel makes a snapshot of the process memory before exiting and either writes it to a file or sends it through the socket to another handler — can be systemd-coredump or your custom one. When it happens, the kernel moves the process to the D
state to preserve its memory and early termination. The higher the process memory usage, the longer it takes to get a core dump file, and the higher the chance of getting a hung task warning.
Let’s check our hypothesis by triggering it with a small Go program. We’ll use the default Linux coredump handler and will decrease the hung task threshold to 1 second.
Coredump settings:
\n
$ sudo sysctl -a --pattern kernel.core\nkernel.core_pattern = core\nkernel.core_pipe_limit = 16\nkernel.core_uses_pid = 1
\n
You can make changes with sysctl:
\n
$ sudo sysctl -w kernel.core_uses_pid=1
\n
Hung task settings:
\n
$ sudo sysctl -a --pattern hung\nkernel.hung_task_all_cpu_backtrace = 0\nkernel.hung_task_check_count = 4194304\nkernel.hung_task_check_interval_secs = 0\nkernel.hung_task_panic = 0\nkernel.hung_task_timeout_secs = 1\nkernel.hung_task_warnings = -1
\n
Go program:
\n
$ cat main.go\npackage main\n\nimport (\n\t"os"\n\t"time"\n)\n\nfunc main() {\n\t_, err := os.ReadFile("test.file")\n\tif err != nil {\n\t\tpanic(err)\n\t}\n\ttime.Sleep(8 * time.Minute) \n}
\n
This program reads a 10 GB file into process memory. Let’s create the file:
\n
$ yes this is 10GB file | head -c 10GB > test.file
\n
The last step is to build the Go program, crash it, and watch our kernel log:
\n
$ go mod init test\n$ go build .\n$ GOTRACEBACK=crash ./test\n$ (Ctrl+\\)
\n
Hooray! We can see our hung task warning:
\n
$ sudo dmesg -T | tail -n 31\nINFO: task test:8734 blocked for more than 22 seconds.\n Not tainted 6.6.72-cloudflare-2025.1.7 #1\n Blocked by coredump.\n"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.\ntask:test state:D stack:0 pid:8734 ppid:8406 task_flags:0x400448 flags:0x00004000
\n
By the way, have you noticed the Blocked by coredump.
line in the log? It was recently added to the upstream code to improve visibility and remove the blame from the process itself. The patch also added the task_flags
information, as Blocked by coredump
is detected via the flag PF_POSTCOREDUMP
, and knowing all the task flags is useful for further root-cause analysis.
Summary: This example showed that even if everything suggests that the application is the problem, the real root cause can be something else — in this case, coredump
.
\n
Example #3 or rtnl_mutex
\n \n \n \n
\n
This one was tricky to debug. Usually, the alerts are limited by one or two different processes, meaning only a certain application or subsystem experiences an issue. In this case, we saw dozens of unrelated tasks hanging for minutes with no improvements over time. Nothing else was in the log, most of the system metrics were fine, and existing traffic was being served, but it was not possible to ssh to the server. New Kubernetes container creations were also stalling. Analyzing the stack traces of different tasks initially revealed that all the traces were limited to just three functions:
\n
rtnetlink_rcv_msg+0x9/0x3c0\ndev_ethtool+0xc6/0x2db0 \nbonding_show_bonds+0x20/0xb0
\n
Further investigation showed that all of these functions were waiting for rtnl_lock
to be acquired. It looked like some application acquired the rtnl_mutex
and didn’t release it. All other processes were in the D
state waiting for this lock.
The RTNL lock is primarily used by the kernel networking subsystem for any network-related config, for both writing and reading. The RTNL is a global mutex lock, although upstream efforts are being made for splitting up RTNL per network namespace (netns).
From the hung task reports, we can observe the “victims” that are being stalled waiting for the lock, but how do we identify the task that is holding this lock for too long? For troubleshooting this, we leveraged BPF
via a bpftrace
script, as this allows us to inspect the running kernel state. The kernel’s mutex implementation has a struct member called owner
. It contains a pointer to the task_struct
from the mutex-owning process, except it is encoded as type atomic_long_t
. This is because the mutex implementation stores some state information in the lower 3-bits (mask 0x7) of this pointer. Thus, to read and dereference this task_struct
pointer, we must first mask off the lower bits (0x7).
Our bpftrace
script to determine who holds the mutex is as follows:
\n
#!/usr/bin/env bpftrace\ninterval:s:10 {\n $rtnl_mutex = (struct mutex *) kaddr("rtnl_mutex");\n $owner = (struct task_struct *) ($rtnl_mutex->owner.counter & ~0x07);\n if ($owner != 0) {\n printf("rtnl_mutex->owner = %u %s\\n", $owner->pid, $owner->comm);\n }\n}
\n
In this script, the rtnl_mutex
lock is a global lock whose address can be exposed via /proc/kallsyms
– using bpftrace
helper function kaddr()
, we can access the struct mutex pointer from the kallsyms
. Thus, we can periodically (via interval:s:10
) check if someone is holding this lock.
In the output we had this:
\n
rtnl_mutex->owner = 3895365 calico-node
\n
This allowed us to quickly identify calico-node
as the process holding the RTNL lock for too long. To quickly observe where this process itself is stalled, the call stack is available via /proc/3895365/stack
. This showed us that the root cause was a Wireguard config change, with function wg_set_device()
holding the RTNL lock, and peer_remove_after_dead()
waiting too long for a napi_disable()
call. We continued debugging via a tool called drgn
, which is a programmable debugger that can debug a running kernel via a Python-like interactive shell. We still haven’t discovered the root cause for the Wireguard issue and have asked the upstream for help, but that is another story.
Summary: The hung task messages were the only ones which we had in the kernel log. Each stack trace of these messages was unique, but by carefully analyzing them, we could spot similarities and continue debugging with other instruments.
\n \n
Your system might have different hung task warnings, and we have many others not mentioned here. Each case is unique, and there is no standard approach to debug them. But hopefully this blog post helps you better understand why it’s good to have these warnings enabled, how they work, and what the meaning is behind them. We tried to provide some navigation guidance for the debugging process as well:
-
analyzing the stack trace might be a good starting point for debugging it, even if all the messages look unrelated, like we saw in example #3
-
keep in mind that the alert might be misleading, pointing to the victim and not the offender, as we saw in example #2 and example #3
-
if the kernel doesn’t schedule your application on the CPU, puts it in the D state, and emits the warning – the real problem might exist in the application code
Good luck with your debugging, and hopefully this material will help you on this journey!
“],”published_at”:[0,”2025-02-14T14:00+00:00″],”updated_at”:[0,”2025-03-03T07:09:45.578Z”],”feature_image”:[0,”https://cf-assets.www.cloudflare.com/zkvhlag99gkb/4iHBIzAkLrDyr0Gc4NAtK3/6c26fa93870ea112c62a791a30bcf705/image1.png”],”tags”:[1,[[0,{“id”:[0,”2UVIYusJwlvsmPYl2AvSuR”],”name”:[0,”Deep Dive”],”slug”:[0,”deep-dive”]}],[0,{“id”:[0,”383iv0UQ6Lp0GZwOAxGq2p”],”name”:[0,”Linux”],”slug”:[0,”linux”]}],[0,{“id”:[0,”73alK6sbtKLS6uB7ZrYrjj”],”name”:[0,”Kernel”],”slug”:[0,”kernel”]}],[0,{“id”:[0,”3VJOfQ8TnNJwqu1GIGwPuA”],”name”:[0,”Monitoring”],”slug”:[0,”monitoring”]}]]],”relatedTags”:[0],”authors”:[1,[[0,{“name”:[0,”Oxana Kharitonova”],”slug”:[0,”oxana”],”bio”:[0,null],”profile_image”:[0,”https://cf-assets.www.cloudflare.com/zkvhlag99gkb/3VMs5mLnM2JGDuB1x0sRSE/957ab30efef528d9fa8ccf73f1c20242/oxana.png”],”location”:[0,”London”],”website”:[0,null],”twitter”:[0,null],”facebook”:[0,null]}],[0,{“name”:[0,”Jesper Brouer”],”slug”:[0,”jesper-brouer”],”bio”:[0],”profile_image”:[0,”https://cf-assets.www.cloudflare.com/zkvhlag99gkb/QAY4H7YQSvJvLIPB5Ff0p/6f6aa104cc897511714dddc9ea4ebe1a/Jesper_Brouer.jpg”],”location”:[0],”website”:[0],”twitter”:[0],”facebook”:[0]}]]],”meta_description”:[0,”The Linux kernel can produce a hung task warning. Searching the Internet and the kernel docs, you can find a brief explanation that the process is stuck in the uninterruptible state. That explains the warning’s meaning, but doesn’t provide the reason it occurred. In this blog post we’re going to explore how the warning works.”],”primary_author”:[0,{}],”localeList”:[0,{“name”:[0,”LOC: Searching for the cause of hung tasks in the Linux kernel”],”enUS”:[0,”English for Locale”],”zhCN”:[0,”Translated for Locale”],”zhHansCN”:[0,”No Page for Locale”],”zhTW”:[0,”Translated for Locale”],”frFR”:[0,”No Page for Locale”],”deDE”:[0,”No Page for Locale”],”itIT”:[0,”No Page for Locale”],”jaJP”:[0,”No Page for Locale”],”koKR”:[0,”No Page for Locale”],”ptBR”:[0,”No Page for Locale”],”esLA”:[0,”No Page for Locale”],”esES”:[0,”No Page for Locale”],”enAU”:[0,”No Page for Locale”],”enCA”:[0,”No Page for Locale”],”enIN”:[0,”No Page for Locale”],”enGB”:[0,”No Page for Locale”],”idID”:[0,”No Page for Locale”],”ruRU”:[0,”No Page for Locale”],”svSE”:[0,”No Page for Locale”],”viVN”:[0,”No Page for Locale”],”plPL”:[0,”No Page for Locale”],”arAR”:[0,”No Page for Locale”],”nlNL”:[0,”No Page for Locale”],”thTH”:[0,”No Page for Locale”],”trTR”:[0,”No Page for Locale”],”heIL”:[0,”No Page for Locale”],”lvLV”:[0,”No Page for Locale”],”etEE”:[0,”No Page for Locale”],”ltLT”:[0,”No Page for Locale”]}],”url”:[0,”https://blog.cloudflare.com/searching-for-the-cause-of-hung-tasks-in-the-linux-kernel”],”metadata”:[0,{“title”:[0,”Searching for the cause of hung tasks in the Linux kernel”],”description”:[0,”The Linux kernel can produce a hung task warning. Searching the Internet and the kernel docs, you can find a brief explanation that the process is stuck in the uninterruptible state. That explains the warning’s meaning, but doesn’t provide the reason it occurred. In this blog post we’re going to explore how the warning works.”],”imgPreview”:[0,”https://cf-assets.www.cloudflare.com/zkvhlag99gkb/2dasu8nuHwyI37n0WhScLi/4f710b8e0eeee6ba46efa49c4fb6ae49/Searching_for_the_cause_of_hung_tasks_in_the_Linux_kernel-OG.png”]}]}]]],”locale”:[0,”en-us”],”translations”:[0,{“posts.by”:[0,”By”],”footer.gdpr”:[0,”GDPR”],”lang_blurb1″:[0,”This post is also available in {lang1}.”],”lang_blurb2″:[0,”This post is also available in {lang1} and {lang2}.”],”lang_blurb3″:[0,”This post is also available in {lang1}, {lang2} and {lang3}.”],”footer.press”:[0,”Press”],”header.title”:[0,”The Cloudflare Blog”],”search.clear”:[0,”Clear”],”search.filter”:[0,”Filter”],”search.source”:[0,”Source”],”footer.careers”:[0,”Careers”],”footer.company”:[0,”Company”],”footer.support”:[0,”Support”],”footer.the_net”:[0,”theNet”],”search.filters”:[0,”Filters”],”footer.our_team”:[0,”Our team”],”footer.webinars”:[0,”Webinars”],”page.more_posts”:[0,”More posts”],”posts.time_read”:[0,”{time} min read”],”search.language”:[0,”Language”],”footer.community”:[0,”Community”],”footer.resources”:[0,”Resources”],”footer.solutions”:[0,”Solutions”],”footer.trademark”:[0,”Trademark”],”header.subscribe”:[0,”Subscribe”],”footer.compliance”:[0,”Compliance”],”footer.free_plans”:[0,”Free plans”],”footer.impact_ESG”:[0,”Impact/ESG”],”posts.follow_on_X”:[0,”Follow on X”],”footer.help_center”:[0,”Help center”],”footer.network_map”:[0,”Network Map”],”header.please_wait”:[0,”Please Wait”],”page.related_posts”:[0,”Related posts”],”search.result_stat”:[0,”Results {search_range} of {search_total} for {search_keyword}“],”footer.case_studies”:[0,”Case Studies”],”footer.connect_2024″:[0,”Connect 2024″],”footer.terms_of_use”:[0,”Terms of Use”],”footer.white_papers”:[0,”White Papers”],”footer.cloudflare_tv”:[0,”Cloudflare TV”],”footer.community_hub”:[0,”Community Hub”],”footer.compare_plans”:[0,”Compare plans”],”footer.contact_sales”:[0,”Contact Sales”],”header.contact_sales”:[0,”Contact Sales”],”header.email_address”:[0,”Email Address”],”page.error.not_found”:[0,”Page not found”],”footer.developer_docs”:[0,”Developer docs”],”footer.privacy_policy”:[0,”Privacy Policy”],”footer.request_a_demo”:[0,”Request a demo”],”page.continue_reading”:[0,”Continue reading”],”footer.analysts_report”:[0,”Analyst reports”],”footer.for_enterprises”:[0,”For enterprises”],”footer.getting_started”:[0,”Getting Started”],”footer.learning_center”:[0,”Learning Center”],”footer.project_galileo”:[0,”Project Galileo”],”pagination.newer_posts”:[0,”Newer Posts”],”pagination.older_posts”:[0,”Older Posts”],”posts.social_buttons.x”:[0,”Discuss on X”],”search.icon_aria_label”:[0,”Search”],”search.source_location”:[0,”Source/Location”],”footer.about_cloudflare”:[0,”About Cloudflare”],”footer.athenian_project”:[0,”Athenian Project”],”footer.become_a_partner”:[0,”Become a partner”],”footer.cloudflare_radar”:[0,”Cloudflare Radar”],”footer.network_services”:[0,”Network services”],”footer.trust_and_safety”:[0,”Trust & Safety”],”header.get_started_free”:[0,”Get Started Free”],”page.search.placeholder”:[0,”Search Cloudflare”],”footer.cloudflare_status”:[0,”Cloudflare Status”],”footer.cookie_preference”:[0,”Cookie Preferences”],”header.valid_email_error”:[0,”Must be valid email.”],”search.result_stat_empty”:[0,”Results {search_range} of {search_total}“],”footer.connectivity_cloud”:[0,”Connectivity cloud”],”footer.developer_services”:[0,”Developer services”],”footer.investor_relations”:[0,”Investor relations”],”page.not_found.error_code”:[0,”Error Code: 404″],”search.autocomplete_title”:[0,”Insert a query. Press enter to send”],”footer.logos_and_press_kit”:[0,”Logos & press kit”],”footer.application_services”:[0,”Application services”],”footer.get_a_recommendation”:[0,”Get a recommendation”],”posts.social_buttons.reddit”:[0,”Discuss on Reddit”],”footer.sse_and_sase_services”:[0,”SSE and SASE services”],”page.not_found.outdated_link”:[0,”You may have used an outdated link, or you may have typed the address incorrectly.”],”footer.report_security_issues”:[0,”Report Security Issues”],”page.error.error_message_page”:[0,”Sorry, we can’t find the page you are looking for.”],”header.subscribe_notifications”:[0,”Subscribe to receive notifications of new posts:”],”footer.cloudflare_for_campaigns”:[0,”Cloudflare for Campaigns”],”header.subscription_confimation”:[0,”Subscription confirmed. Thank you for subscribing!”],”posts.social_buttons.hackernews”:[0,”Discuss on Hacker News”],”footer.diversity_equity_inclusion”:[0,”Diversity, equity & inclusion”],”footer.critical_infrastructure_defense_project”:[0,”Critical Infrastructure Defense Project”]}],”localesAvailable”:[1,[]],”footerBlurb”:[0,”Cloudflare’s connectivity cloud protects entire corporate networks, helps customers build Internet-scale applications efficiently, accelerates any website or Internet application, wards off DDoS attacks, keeps hackers at bay, and can help you on your journey to Zero Trust.
Visit 1.1.1.1 from any device to get started with our free app that makes your Internet faster and safer.
To learn more about our mission to help build a better Internet, start here. If you’re looking for a new career direction, check out our open positions.”]}” ssr client=”load” opts=”{“name”:”Post”,”value”:true}” await-children>
2025-04-02
7 min read
So the story begins with a pair programming session I had with my colleague, which I desperately needed because my node skill tree is still at level 1, and I needed to get started with React because I’ll be working on our internal backstage instance.
We worked together on a small feature, tested it locally, and it worked. Great. Now it’s time to make My Very First React Commit. So I ran the usual git add
and git commit
, which hooked into yarn test
, to automatically run unit tests for backstage, and that’s when everything got derailed. For all the React tutorials I have followed, I have never actually run a yarn test
on my machine. And the first time I tried yarn test, it hung, and after a long time, the command eventually failed:
Determining test suites to run...
● Test suite failed to run
thrown: [Error]
error Command failed with exit code 1.
info Visit https://yarnpkg.com/en/docs/cli/run for documentation about this command.
🌈 backstage ⚡
I could tell it was obviously unhappy about something, and then it threw some [Error]. I have very little actual JavaScript experience, but this looks suspiciously like someone had neglected to write a proper toString() or whatever, and thus we’re stuck with the monumentally unhelpful [Error]. Searching the web yielded an entire ocean of false positives due to how vague the error message is. What a train wreck!
Fine, let’s put on our troubleshooting hats. My memory is not perfect, but thankfully shell history is. Let’s see all the (ultimately useless) things that were tried (with commentary):
2025-03-19 14:18 yarn test --help
2025-03-19 14:20 yarn test --verbose
2025-03-19 14:21 git diff --staged
2025-03-19 14:25 vim README.md # Did I miss some setup?
2025-03-19 14:28 i3lock -c 336699 # "I need a drink"
2025-03-19 14:34 yarn test --debug # Debug, verbose, what's the diff
2025-03-19 14:35 yarn backstage-cli repo test # Maybe if I invoke it directly ...
2025-03-19 14:36 yarn backstage-cli --version # Nope, same as mengnan's
2025-03-19 14:36 yarn backstage-cli repo --help
2025-03-19 14:36 yarn backstage-cli repo test --since HEAD~1 # Minimal changes?
2025-03-19 14:36 yarn backstage-cli repo test --since HEAD # Uhh idk no changes???
2025-03-19 14:38 yarn backstage-cli repo test plugins # The first breakthrough. More on this later
2025-03-19 14:39 n all tests.\n › Press f to run only failed tests.\n › Press o to only run tests related to changed files.\n › Pres
filter by a filename regex pattern.\n › Press t to filter by a test name regex pattern.\n › Press q to quit watch mode.\n › Press Ent
rigger a test run all tests.\n › Press f to run only failed tests.\n › Press o to only run tests related to changed files.\n › Press
lter by a filename regex pattern.\n › Press t to filter by a test name regex pattern.\n › Press q to quit watch mode.\n › Press Enter
gger a test ru # Got too excited and pasted rubbish
2025-03-19 14:44 ls -a | fgrep log
2025-03-19 14:44 find | fgrep log # Maybe it leaves a log file?
2025-03-19 14:46 yarn backstage-cli repo test --verbose --debug --no-cache plugins # "clear cache"
2025-03-19 14:52 yarn backstage-cli repo test --no-cache --runInBand . # No parallel
2025-03-19 15:00 yarn backstage-cli repo test --jest-help
2025-03-19 15:03 yarn backstage-cli repo test --resetMocks --resetModules plugins # I have no idea what I'm resetting
The first real breakthrough was test plugins
, which runs only tests matching “plugins”. This effectively bypassed the “determining suites to run…” logic, which was the thing that was hanging. So, I am now able to get tests to run. However, these too eventually crash with the same cryptic [Error]
:
PASS @cloudflare/backstage-components plugins/backstage-components/src/components/Cards/TeamMembersListCard/TeamMembersListCard.test.tsx (6.787 s)
PASS @cloudflare/backstage-components plugins/backstage-components/src/components/Cards/ClusterDependencyCard/ClusterDependencyCard.test.tsx
PASS @internal/plugin-software-excellence-dashboard plugins/software-excellence-dashboard/src/components/AppDetail/AppDetail.test.tsx
PASS @cloudflare/backstage-entities plugins/backstage-entities/src/AccessLinkPolicy.test.ts
● Test suite failed to run
thrown: [Error]
Re-running it or matching different tests will give slightly different run logs, but they always end with the same error.
By now, I’ve figured out that yarn test is actually backed by Jest, a JavaScript testing framework, so my next strategy is simply trying different Jest flags to see what sticks, but invariably, none do:
2025-03-19 15:16 time yarn test --detectOpenHandles plugins
2025-03-19 15:18 time yarn test --runInBand .
2025-03-19 15:19 time yarn test --detectLeaks .
2025-03-19 15:20 yarn test --debug aetsnuheosnuhoe
2025-03-19 15:21 yarn test --debug --no-watchman nonexisis
2025-03-19 15:21 yarn test --jest-help
2025-03-19 15:22 yarn test --debug --no-watch ooooooo > ~/jest.config
A pattern finally emerges
Eventually, after re-running it so many times, I started to notice a pattern. So by default after a test run, Jest drops you into an interactive menu where you can (Q)uit, Run (A)ll tests, etc. and I realized that Jest would eventually crash, even if it’s idling in the menu. I started timing the runs, which led me to the second breakthrough:
› Press q to quit watch mode.
› Press Enter to trigger a test run.
● Test suite failed to run
thrown: [Error]
error Command failed with exit code 1.
info Visit https://yarnpkg.com/en/docs/cli/run for documentation about this command.
yarn test . 109.96s user 14.21s system 459% cpu 27.030 total
RUNS @cloudflare/backstage-components plugins/backstage-components/src/components/Cards/TeamRoles/CustomerSuccessCard.test.tsx
RUNS @cloudflare/backstage-app packages/app/src/components/catalog/EntityFipsPicker/EntityFipsPicker.test.tsx
Test Suites: 2 failed, 23 passed, 25 of 65 total
Tests: 217 passed, 217 total
Snapshots: 0 total
error Command failed with exit code 1.
info Visit https://yarnpkg.com/en/docs/cli/run for documentation about this command.
yarn test . 110.85s user 14.04s system 463% cpu 26.974 total
No matter what Jest was doing, it always crashes after almost exactly 27 wallclock seconds. It literally didn’t matter what tests I selected or re-ran. Even the original problem, a bare yarn test (no tests selected, just hangs), will crash after 27 seconds:
Determining test suites to run...
● Test suite failed to run
thrown: [Error]
error Command failed with exit code 1.
info Visit https://yarnpkg.com/en/docs/cli/run for documentation about this command.
yarn test 2.05s user 0.71s system 10% cpu 27.094 total
Obviously, some sort of timeout. 27 seconds is kind of a weird number (unlike, say, 5 seconds or 60 seconds) but let’s try:
2025-03-19 15:09 find | fgrep 27
2025-03-19 15:09 git grep '\b27\b'
No decent hits.
How about something like 20+7 or even 20+5+2? Nope.
Googling/GPT-4oing for “jest timeout 27 seconds” again yielded nothing useful. Far more people were having problems with testing asynchronously, or getting their tests to timeout, than with Jest proper.
At this time, my colleague came back from his call, and with his help we determined some other things:
-
his system (MacOS) is not affected at all versus mine (Linux)
-
nvm use v20 didn’t fix it
-
I can reproduce it on a clean clone of github.com/backstage/backstage. The tests seem to progress further, about 50+ seconds. This lends credence to a running theory that the filesystem crawler/watcher is the one crashing, and backstage/backstage is a bigger repo than the internal Cloudflare instance, so it takes longer.
I next went on a little detour to grab another colleague who I know has been working on a Next.js project. He’s one of the few other people nearby who knows anything about Node.js. In my experience with troubleshooting it’s helpful to get multiple perspectives, so we can cover each other’s blind spots and avoid tunnel vision.
I then tried invoking many yarn tests in parallel, and I did manage to get the crash time to stretch out to 28 or 29 seconds if the system was under heavy load. So this tells me that it might not be a hard timeout but rather processing driven. A series of sleeps chugging along perhaps?
By now, there is a veritable crowd of curious onlookers gathered in front of my terminal marveling at the consistent 27 seconds crash and trading theories. At some point, someone asked if I had tried rebooting yet, and I had to sheepishly reply that I haven’t but “I’m absolutely sure it wouldn’t help whatsoever”.
And the astute reader can already guess that rebooting did nothing at all, or else this wouldn’t even be a story worth telling. Besides, haven’t I teased in the clickbaity title about some crazy Steam Locomotive from 1993?
My colleague then put us back on track and suggested strace, and I decided to trace the simpler case of the idling menu (rather than trace running tests, which generated far more syscalls).
Watch Usage
› Press a to run all tests.
› Press f to run only failed tests.
› Press o to only run tests related to changed files.
› Press p to filter by a filename regex pattern.
› Press t to filter by a test name regex pattern.
› Press q to quit watch mode.
› Press Enter to trigger a test run.
[], 1024, 1000) = 0
openat(AT_FDCWD, "/proc/self/stat", O_RDONLY) = 21
read(21, "42375 (node) R 42372 42372 11692"..., 1023) = 301
close(21) = 0
epoll_wait(13, [], 1024, 0) = 0
epoll_wait(13, [], 1024, 999) = 0
openat(AT_FDCWD, "/proc/self/stat", O_RDONLY) = 21
read(21, "42375 (node) R 42372 42372 11692"..., 1023) = 301
close(21) = 0
epoll_wait(13, [], 1024, 0) = 0
epoll_wait(13,
It basically epoll_waits
until 27 seconds are up and then, right when the crash happens:
● Test suite failed to run
thrown: [Error]
0x7ffd7137d5e0, 1024, 1000) = -1 EINTR (Interrupted system call)
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=42578, si_uid=1000, si_status=1, si_utime=0, si_stime=0} ---
read(4, "*", 1) = 1
write(15, "\210\352!\5\0\0\0\0\21\0\0\0\0\0\0\0", 16) = 16
write(5, "*", 1) = 1
rt_sigreturn({mask=[]}) = -1 EINTR (Interrupted system call)
epoll_wait(13, [{events=EPOLLIN, data={u32=14, u64=14}}], 1024, 101) = 1
read(14, "\210\352!\5\0\0\0\0\21\0\0\0\0\0\0\0", 512) = 16
wait4(42578, [{WIFEXITED(s) && WEXITSTATUS(s) == 1}], WNOHANG, NULL) = 42578
rt_sigprocmask(SIG_SETMASK, ~[RTMIN RT_1], [], 8) = 0
read(4, "*", 1) = 1
rt_sigaction(SIGCHLD, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=SA_RESTORER, sa_restorer=0x79e91e045330}, NULL, 8) = 0
write(5, "*", 1) = 1
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
mmap(0x34ecad880000, 1495040, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0) = 0x34ecad880000
madvise(0x34ecad880000, 1495040, MADV_DONTFORK) = 0
munmap(0x34ecad9ae000, 258048) = 0
mprotect(0x34ecad880000, 1236992, PROT_READ|PROT_WRITE) = 0
I don’t know about you, but sometimes I look at straces and wonder “Do people actually read this gibberish?” Fortunately, in the modern generative AI era, we can count on GPT-4o to gently chide: the process was interrupted EINTR
by its child SIGCHLD
, which means you forgot about the children, silly human. Is the problem with one of the cars rather than the engine?
Following this train of thought, I now re-ran with strace --follow-forks
, which revealed a giant flurry of activity that promptly overflowed my terminal buffer. The investigation is really gaining steam now. The original trace weighs in at a hefty 500,000 lines, but here is a smaller equivalent version derived from a clean instance of backstage: trace.log.gz. I have uploaded this trace here because the by-now overhyped Steam Locomotive is finally making its grand appearance and I know there’ll be people who’d love nothing more than to crawl through a haystack of system calls looking for a train-sized needle. Consider yourself lucky, I had to do it without even knowing what I was looking for, much less that it was a whole Steam Locomotive.
This section is left intentionally blank to allow locomotive enthusiasts who want to find the train on their own to do so first.
Remember my comment about straces being gibberish? Actually, I was kidding. So there are a few ways to make it more manageable, and with experience you’ll learn which system calls to pay attention to, such as execve
, chdir
, open
, read
, fork
, and signals
, and which ones to skim over, such as mprotect,
mmap
, and futex
.
Since I’m writing this account after the fact, let’s cheat a little and assume I was super smart and zeroed in on execve correctly on the first try:
🌈 ~ zgrep execve trace.log.gz | head
execve("/home/yew/.nvm/versions/node/v18.20.6/bin/yarn", ["yarn", "test", "steam-regulator"], 0x7ffdff573148 /* 72 vars */) = 0
execve("/home/yew/.pyenv/shims/node", ["node", "/home/yew/.nvm/versions/node/v18"..., "test", "steam-regulator"], 0x7ffd64f878c8 /* 72 vars */) = -1 ENOENT (No such file or directory)
execve("/home/yew/.pyenv/bin/node", ["node", "/home/yew/.nvm/versions/node/v18"..., "test", "steam-regulator"], 0x7ffd64f878c8 /* 72 vars */) = -1 ENOENT (No such file or directory)
execve("/home/yew/repos/secrets/bin/node", ["node", "/home/yew/.nvm/versions/node/v18"..., "test", "steam-regulator"], 0x7ffd64f878c8 /* 72 vars */) = -1 ENOENT (No such file or directory)
execve("/home/yew/.nvm/versions/node/v18.20.6/bin/node", ["node", "/home/yew/.nvm/versions/node/v18"..., "test", "steam-regulator"], 0x7ffd64f878c8 /* 72 vars */) = 0
[pid 49307] execve("/bin/sh", ["/bin/sh", "-c", "backstage-cli repo test resource"...], 0x3d17d6d0 /* 156 vars */
[pid 49307] <... execve resumed>) = 0
[pid 49308] execve("/home/yew/cloudflare/repos/backstage/node_modules/.bin/backstage-cli", ["backstage-cli", "repo", "test", "steam-regulator"], 0x5e7ef80051d8 /* 156 vars */
[pid 49308] <... execve resumed>) = 0
[pid 49308] execve("/tmp/yarn--1742459197616-0.9027914591640542/node", ["node", "/home/yew/cloudflare/repos/backs"..., "repo", "test", "steam-regulator"], 0x7ffcc18af270 /* 156 vars */) = 0
🌈 ~ zgrep execve trace.log.gz | wc -l
2254
Phew, 2,000 is a lot of execves
. Let’s get the unique ones, plus their counts:
🌈 ~ zgrep -oP '(?<=execve\(")[^"]+' trace.log.gz | xargs -L1 basename | sort | uniq -c | sort -nr
576 watchman
576 hg
368 sl
358 git
16 sl.actual
14 node
2 sh
1 yarn
1 backstage-cli
Have you spotted the Steam Locomotive yet? I spotted it immediately because this is My Own System and Surely This Means I Am Perfectly Aware Of Everything That Is Installed Unlike, er, node_modules
.
sl
is actually a fun little joke program from 1993 that plays on users’ tendencies to make a typo on ls
. When sl
runs, it clears your terminal to make way for an animated steam locomotive to come chugging through.
( ) (@@) ( ) (@) () @@ O @ O @ O
(@@@)
( )
(@@@@)
( )
==== ________ ___________
_D _| |_______/ \__I_I_____===__|_________|
|(_)--- | H\________/ | | =|___ ___| _________________
/ | | H | | | | ||_| |_|| _| \_____A
| | | H |__--------------------| [___] | =| |
| ________|___H__/__|_____/[][]~\_______| | -| |
"https://blog.cloudflare.com/" |-----------I_____I [][] [] D |=======|____|________________________|_
__/ =| o |=-~~\ /~~\ /~~\ /~~\ ____Y___________|__|__________________________|_
|/-=|___|=O=====O=====O=====O |_____/~\___/ |_D__D__D_| |_D__D__D_|
\_/ \__/ \__/ \__/ \__/ \_/ \_/ \_/ \_/ \_/
When I first saw that Jest was running sl
so many times, my first thought was to ask my colleague if sl
is a valid command on his Mac, and of course it is not. After all, which serious engineer would stuff their machine full of silly commands like sl
, gti
, cowsay
, or toilet
? The next thing I tried was to rename sl
to something else, and sure enough all my problems disappeared: yarn test
started working perfectly.
So what does Jest have to do with Steam Locomotives?
Nothing, that’s what. The whole affair is an unfortunate naming clash between sl
the Steam Locomotive and sl
the Sapling CLI. Jest wanted sl
the source control system, but ended up getting steam-rolled by sl
the Steam Locomotive.
Fortunately the devs took it in good humor, and made a (still unreleased) fix. Check out the train memes!
At this point the main story has ended. However, there are still some unresolved nagging questions, like…
How did the crash arrive at the magic number of a relatively even 27 seconds?
I don’t know. Actually I’m not sure if a forked child executing sl
still has a terminal anymore, but the travel time of the train does depend on the terminal width. The wider it is, the longer it takes:
🌈 ~ tput cols
425
🌈 ~ time sl
sl 0.19s user 0.06s system 1% cpu 20.629 total
🌈 ~ tput cols
58
🌈 ~ time sl
sl 0.03s user 0.01s system 0% cpu 5.695 total
So the first thing I tried was to run yarn test in a ridiculously narrow terminal and see what happens:
Determin
ing test
suites
to run..
.
● Test
suite f
ailed to
run
thrown:
[Error]
error Co
mmand fa
iled wit
h exit c
ode 1.
info Vis
it https
://yarnp
kg.com/e
n/docs/c
li/run f
or docum
entation
about t
his comm
and.
yarn tes
t 1.92s
user 0.
67s syst
em 9% cp
u 27.088
total
🌈 back
stage [m
aster] t
put cols
8
Alas, the terminal width doesn’t affect jest at all. Jest calls sl via execa
so let’s mock that up locally:
🌈 choochoo cat runSl.mjs
import {execa} from 'execa';
const { stdout } = await execa('tput', ['cols']);
console.log('terminal colwidth:', stdout);
await execa('sl', ['root']);
🌈 choochoo time node runSl.mjs
terminal colwidth: 80
node runSl.mjs 0.21s user 0.06s system 4% cpu 6.730 total
So execa
uses the default terminal width of 80, which takes the train 6.7 seconds to cross. And 27 seconds divided by 6.7 is awfully close to 4. So is Jest running sl
4 times? Let’s do a poor man’s bpftrace by hooking into sl
like so:
#!/bin/bash
uniqid=$RANDOM
echo "$(date --utc +"%Y-%m-%d %H:%M:%S.%N") $uniqid started" >> /home/yew/executed.log
/usr/games/sl.actual "$@"
echo "$(date --utc +"%Y-%m-%d %H:%M:%S.%N") $uniqid ended" >> /home/yew/executed.log
And if we check executed.log
, sl
is indeed executed in 4 waves, albeit by 5 workers simultaneously in each wave:
#wave1
2025-03-20 13:23:57.125482563 21049 started
2025-03-20 13:23:57.127526987 21666 started
2025-03-20 13:23:57.131099388 4897 started
2025-03-20 13:23:57.134237754 102 started
2025-03-20 13:23:57.137091737 15733 started
#wave1 ends, wave2 starts
2025-03-20 13:24:03.704588580 21666 ended
2025-03-20 13:24:03.704621737 21049 ended
2025-03-20 13:24:03.707780748 4897 ended
2025-03-20 13:24:03.712086346 15733 ended
2025-03-20 13:24:03.711953000 102 ended
2025-03-20 13:24:03.714831149 18018 started
2025-03-20 13:24:03.721293279 23293 started
2025-03-20 13:24:03.724600164 27918 started
2025-03-20 13:24:03.729763900 15091 started
2025-03-20 13:24:03.733176122 18473 started
#wave2 ends, wave3 starts
2025-03-20 13:24:10.294286746 18018 ended
2025-03-20 13:24:10.297261754 23293 ended
2025-03-20 13:24:10.300925031 27918 ended
2025-03-20 13:24:10.300950334 15091 ended
2025-03-20 13:24:10.303498710 24873 started
2025-03-20 13:24:10.303980494 18473 ended
2025-03-20 13:24:10.308560194 31825 started
2025-03-20 13:24:10.310595182 18452 started
2025-03-20 13:24:10.314222848 16121 started
2025-03-20 13:24:10.317875812 30892 started
#wave3 ends, wave4 starts
2025-03-20 13:24:16.883609316 24873 ended
2025-03-20 13:24:16.886708598 18452 ended
2025-03-20 13:24:16.886867725 31825 ended
2025-03-20 13:24:16.890735338 16121 ended
2025-03-20 13:24:16.893661911 21975 started
2025-03-20 13:24:16.898525968 30892 ended
#crash imminent! wave4 ending, wave5 starting...
2025-03-20 13:24:23.474925807 21975 ended
The logs were emitted for about 26.35 seconds, which is close to 27. It probably crashed just as wave4 was reporting back. And each wave lasts about 6.7 seconds, right on the money with manual measurement.
So why is Jest running sl in 4 waves? Why did it crash at the start of the 5th wave?
Let’s again modify the poor man’s bpftrace to also log the args and working directory:
echo "$(date --utc +"%Y-%m-%d %H:%M:%S.%N") $uniqid started: $@ at $PWD" >> /home/yew/executed.log
From the results we can see that the 5 workers are busy executing sl root
, which corresponds to the getRoot()
function in jest-change-files/sl.ts
2025-03-21 05:50:22.663263304 started: root at /home/yew/cloudflare/repos/backstage/packages/app/src
2025-03-21 05:50:22.665550470 started: root at /home/yew/cloudflare/repos/backstage/packages/backend/src
2025-03-21 05:50:22.667988509 started: root at /home/yew/cloudflare/repos/backstage/plugins/access/src
2025-03-21 05:50:22.671781519 started: root at /home/yew/cloudflare/repos/backstage/plugins/backstage-components/src
2025-03-21 05:50:22.673690514 started: root at /home/yew/cloudflare/repos/backstage/plugins/backstage-entities/src
2025-03-21 05:50:29.247573899 started: root at /home/yew/cloudflare/repos/backstage/plugins/catalog-types-common/src
2025-03-21 05:50:29.251173536 started: root at /home/yew/cloudflare/repos/backstage/plugins/cross-connects/src
2025-03-21 05:50:29.255263605 started: root at /home/yew/cloudflare/repos/backstage/plugins/cross-connects-backend/src
2025-03-21 05:50:29.257293780 started: root at /home/yew/cloudflare/repos/backstage/plugins/pingboard-backend/src
2025-03-21 05:50:29.260285783 started: root at /home/yew/cloudflare/repos/backstage/plugins/resource-insights/src
2025-03-21 05:50:35.823374079 started: root at /home/yew/cloudflare/repos/backstage/plugins/scaffolder-backend-module-gaia/src
2025-03-21 05:50:35.825418386 started: root at /home/yew/cloudflare/repos/backstage/plugins/scaffolder-backend-module-r2/src
2025-03-21 05:50:35.829963172 started: root at /home/yew/cloudflare/repos/backstage/plugins/security-scorecard-dash/src
2025-03-21 05:50:35.832597778 started: root at /home/yew/cloudflare/repos/backstage/plugins/slo-directory/src
2025-03-21 05:50:35.834631869 started: root at /home/yew/cloudflare/repos/backstage/plugins/software-excellence-dashboard/src
2025-03-21 05:50:42.404063080 started: root at /home/yew/cloudflare/repos/backstage/plugins/teamcity/src
The 16 entries here correspond neatly to the 16 rootDirs
configured in Jest for Cloudflare’s backstage. We have 5 trains, and we want to visit 16 stations so let’s do some simple math. 16/5.0 = 3.2 which means our trains need to go back and forth 4 times at a minimum to cover them all.
Final mystery: Why did it crash?
Let’s go back to the very start of our journey. The original [Error]
thrown was actually from here and after modifying node_modules/jest-changed-files/index.js
, I found that the error is shortMessage: 'Command failed with ENAMETOOLONG: sl status...
‘ and the reason why became clear when I interrogated Jest about what it thinks the repos are.
While the git repo is what you’d expect, the sl “repo” looks amazingly like a train wreck in motion:
got repos.git as Set(1) { '/home/yew/cloudflare/repos/backstage' }
got repos.sl as Set(1) {
'\x1B[?1049h\x1B[1;24r\x1B[m\x1B(B\x1B[4l\x1B[?7h\x1B[?25l\x1B[H\x1B[2J\x1B[15;80H_\x1B[15;79H_\x1B[16d|\x1B[9;80H_\x1B[12;80H|\x1B[13;80H|\x1B[14;80H|\x1B[15;78H__/\x1B[16;79H|/\x1B[17;80H\\\x1B[9;
79H_D\x1B[10;80H|\x1B[11;80H/\x1B[12;79H|\x1B[K\x1B[13d\b|\x1B[K\x1B[14d\b|/\x1B[15;1H\x1B[1P\x1B[16;78H|/-\x1B[17;79H\\_\x1B[9;1H\x1B[1P\x1B[10;79H|(\x1B[11;79H/\x1B[K\x1B[12d\b\b|\x1B[K\x1B[13d\b|
_\x1B[14;1H\x1B[1P\x1B[15;76H__/ =\x1B[16;77H|/-=\x1B[17;78H\\_/\x1B[9;77H_D _\x1B[10;78H|(_\x1B[11;78H/\x1B[K\x1B[12d\b\b|\x1B[K\x1B[13d\b| _\x1B[14;77H"https://blog.cloudflare.com/"\x1B[15;75H__/
=|\x1B[16;76H|/-=|\x1B[17;1H\x1B[1P\x1B[8;80H=\x1B[9;76H_D _|\x1B[10;77H|(_)\x1B[11;77H/\x1B[K\x1B[12d\b\b|\x1B[K\x1B[13d\b|
_\r\x1B[14d\x1B[1P\x1B[15d\x1B[1P\x1B[16;75H|/-=|_\x1B[17;1H\x1B[1P\x1B[8;79H=\r\x1B[9d\x1B[1P\x1B[10;76H|(_)-\x1B[11;76H/\x1B[K\x1B[12d\b\b|\x1B[K\x1B[13d\b| _\r\x1B[14d\x1B[1P\x1B[15;73H__/ =|
o\x1B[16;74H|/-=|_\r\x1B[17d\x1B[1P\x1B[8;78H=\r\x1B[9d\x1B[1P\x1B[10;75H|(_)-\x1B[11;75H/\x1B[K\x1B[12d\b\b|\x1B[K\x1B[13d\b|
_\r\x1B[14d\x1B[1P\x1B[15d\x1B[1P\x1B[16;73H|/-=|_\r\x1B[17d\x1B[1P\x1B[8;77H=\x1B[9;73H_D _| |\x1B[10;74H|(_)-\x1B[11;74H/ |\x1B[12;73H| |\x1B[13;73H| _\x1B[14;73H"https://blog.cloudflare.com/" |\x1B[15;71H__/
=| o |\x1B[16;72H|/-=|___|\x1B[17;1H\x1B[1P\x 1B[5;79H(@\x1B[7;77H(\r\x1B[8d\x1B[1P\x1B[9;72H_D _| |_\x1B[10;1H\x1B[1P\x1B[11d\x1B[1P\x1B[12d\x1B[1P\x1B[13;72H| _\x1B[14;72H"https://blog.cloudflare.com/" |-\x1B[15;70H__/
=| o |=\x1B[16;71H|/-=|___|=\x1B[17;1H\x1B[1P\x1B[8d\x1B[1P\x1B[9;71H_D _| |_\r\x1B[10d\x1B[1P\x1B[11d\x1B[1P\x1B[12d\x1B[1P\x1B[13;71H| _\x1B[14;71H"https://blog.cloudflare.com/" |-\x1B[15;69H__/ =| o
|=-\x1B[16;70H|/-=|___|=O\x1B[17;71H\\_/ \\\x1B[8;1H\x1B[1P\x1B[9;70H_D _| |_\x1B[10;71H|(_)--- |\x1B[11;71H/ | |\x1B[12;70H| | |\x1B[13;70H| _\x1B[80G|\x1B[14;70H"https://blog.cloudflare.com/"
|-\x1B[15;68H__/ =| o |=-~\x1B[16;69H|/-=|___|=\x1B[K\x1B[17;70H\\_/ \\O\x1B[8;1H\x1B[1P\x1B[9;69H_D _| |_\r\x1B[10d\x1B[1P\x1B[11d\x1B[1P\x1B[12d\x1B[1P\x1B[13;69H| _\x1B[79G|_\x1B[14;69H"https://blog.cloudflare.com/"
|-\x1B[15;67H__/ =| o |=-~\r\x1B[16d\x1B[1P\x1B[17;69H\\_/ \\_\x1B[4d\b\b(@@\x1B[5;75H( )\x1B[7;73H(@@@)\r\x1B[8d\x1B[1P\x1B[9;68H_D _|
|_\r\x1B[10d\x1B[1P\x1B[11d\x1B[1P\x1B[12d\x1B[1P\x1B[13;68H| _\x1B[78G|_\x1B[14;68H"https://blog.cloudflare.com/" |-\x1B[15;66H__/ =| o |=-~~\\\x1B[16;67H|/-=|___|= O\x1B[17;68H\\_/ \\__/\x1B[8;1H\x1B[1P\x1B[9;67H_D _|
|_\r\x1B[10d\x1B[1P\x1B[11d\x1B[1P\x1B[12d\x1B[1P\x1B[13;67H| _\x1B[77G|_\x1B[14;67H"https://blog.cloudflare.com/" |-\x1B[15;65H__/ =| o |=-~O==\x1B[16;66H|/-=|___|= |\x1B[17;1H\x1B[1P\x1B[8d\x1B[1P\x1B[9;66H_D _|
|_\x1B[10;67H|(_)--- | H\x1B[11;67H/ | | H\x1B[12;66H| | | H\x1B[13;66H| _\x1B[76G|___H\x1B[14;66H"https://blog.cloudflare.com/" |-\x1B[15;64H__/ =| o |=-O==\x1B[16;65H|/-=|___|=
|\r\x1B[17d\x1B[1P\x1B[8d\x1B[1P\x1B[9;65H_D _| |_\x1B[80G/\x1B[10;66H|(_)--- | H\\\x1B[11;1H\x1B[1P\x1B[12d\x1B[1P\x1B[13;65H| _\x1B[75G|___H_\x1B[14;65H"https://blog.cloudflare.com/" |-\x1B[15;63H__/ =| o |=-~~\\
/\x1B[16;64H|/-=|___|=O=====O\x1B[17;65H\\_/ \\__/ \\\x1B[1;4r\x1B[4;1H\n' + '\x1B[1;24r\x1B[4;74H( )\x1B[5;71H(@@@@)\x1B[K\x1B[7;69H( )\x1B[K\x1B[8;68H====
\x1B[80G_\x1B[9;1H\x1B[1P\x1B[10;65H|(_)--- | H\\_\x1B[11;1H\x1B[1P\x1B[12d\x1B[1P\x1B[13;64H| _\x1B[74G|___H_\x1B[14;64H"https://blog.cloudflare.com/" |-\x1B[15;62H__/ =| o |=-~~\\ /~\x1B[16;63H|/-=|___|=
||\x1B[K\x1B[17;64H\\_/ \\O=====O\x1B[8;67H==== \x1B[79G_\r\x1B[9d\x1B[1P\x1B[10;64H|(_)--- | H\\_\x1B[11;64H/ | | H |\x1B[12;63H| | | H |\x1B[13;63H|
_\x1B[73G|___H__/\x1B[14;63H"https://blog.cloudflare.com/" |-\x1B[15;61H__/ =| o |=-~~\\ /~\r\x1B[16d\x1B[1P\x1B[17;63H\\_/ \\_\x1B[8;66H==== \x1B[78G_\r\x1B[9d\x1B[1P\x1B[10;63H|(_)--- |
H\\_\r\x1B[11d\x1B[1P\x1B[12;62H| | | H |_\x1B[13;62H| _\x1B[72G|___H__/_\x1B[14;62H"https://blog.cloudflare.com/" |-\x1B[15;60H__/ =| o |=-~~\\ /~~\\\x1B[16;61H|/-=|___|= O=====O\x1B[17;62H\\_/ \\__/
\\__/\x1B[8;65H==== \x1B[77G_\r\x1B[9d\x1B[1P\x1B[10;62H|(_)--- | H\\_\r\x1B[11d\x1B[1P\x1B[12;61H| | | H |_\x1B[13;61H| _\x1B[71G|___H__/_\x1B[14;61H"https://blog.cloudflare.com/" |-\x1B[80GI\x1B[15;59H__/ =|
o |=-~O=====O==\x1B[16;60H|/-=|___|= || |\x1B[17;1H\x1B[1P\x1B[2;79H(@\x1B[3;74H( )\x1B[K\x1B[4;70H(@@@@)\x1B[K\x1B[5;67H( )\x1B[K\x1B[7;65H(@@@)\x1B[K\x1B[8;64H====
\x1B[76G_\r\x1B[9d\x1B[1P\x1B[10;61H|(_)--- | H\\_\x1B[11;61H/ | | H | |\x1B[12;60H| | | H |__-\x1B[13;60H| _\x1B[70G|___H__/__|\x1B[14;60H"https://blog.cloudflare.com/" |-\x1B[79GI_\x1B[15;58H__/ =| o
|=-O=====O==\x1B[16;59H|/-=|___|= || |\r\x1B[17d\x1B[1P\x1B[8;63H==== \x1B[75G_\r\x1B[9d\x1B[1P\x1B[10;60H|(_)--- | H\\_\r\x1B[11d\x1B[1P\x1B[12;59H| | | H |__-\x1B[13;59H|
_\x1B[69G|___H__/__|_\x1B[14;59H"https://blog.cloudflare.com/" |-\x1B[78GI_\x1B[15;57H__/ =| o |=-~~\\ /~~\\ /\x1B[16;58H|/-=|___|=O=====O=====O\x1B[17;59H\\_/ \\__/ \\__/ \\\x1B[8;62H====
\x1B[74G_\r\x1B[9d\x1B[1P\x1B[10;59H|(_)--- | H\\_\r\x1B | | H |__-\x1B[13;58H| _\x1B[68G|___H__/__|_\x1B[14;58H"https://blog.cloudflare.com/" |-\x1B[77GI_\x1B[15;56H__/ =| o |=-~~\\ /~~\\ /~\x1B[16;57H|/-=|___|=
|| ||\x1B[K\x1B[17;58H\\_/ \\O=====O=====O\x1B[8;61H==== \x1B[73G_\r\x1B[9d\x1B[1P\x1B[10;58H|(_)--- _\x1B[67G|___H__/__|_\x1B[14;57H"https://blog.cloudflare.com/" |-\x1B[76GI_\x1B[15;55H__/ =| o |=-~~\\ /~~\\
/~\r\x1B[16d\x1B[1P\x1B[17;57H\\_/ \\_\x1B[2;75H( ) (\x1B[3;70H(@@@)\x1B[K\x1B[4;66H()\x1B[K\x1B[5;63H(@@@@)\x1B[
Thank you to my colleagues Mengnan Gong and Shuhao Zhang, whose ideas and perspectives helped narrow down the root causes of this mystery.
If you enjoy troubleshooting weird and tricky production issues, our engineering teams are hiring.
Visit 1.1.1.1 from any device to get started with our free app that makes your Internet faster and safer.
To learn more about our mission to help build a better Internet, start here. If you’re looking for a new career direction, check out our open positions.
March 25, 2025 1:59 PM
Build and deploy Remote Model Context Protocol (MCP) servers to Cloudflare
You can now build and deploy remote MCP servers to Cloudflare, and we handle the hard parts of building remote MCP servers for you.…
March 20, 2025 1:10 PM
Introducing Cloudy, Cloudflare’s AI agent for simplifying complex configurations
Cloudflare’s first AI agent, Cloudy, helps make complicated configurations easy to understand for Cloudflare administrators.…
February 26, 2025 2:00 PM
Keep AI interactions secure and risk-free with Guardrails in AI Gateway
Deploy AI safely with built-in Guardrails in AI Gateway. Flag and block harmful or inappropriate content, protect personal data, and ensure compliance in real-time…
February 14, 2025 2:00 PM
Searching for the cause of hung tasks in the Linux kernel
The Linux kernel can produce a hung task warning. Searching the Internet and the kernel docs, you can find a brief explanation that the process is stuck in the uninterruptible state.…