WIP: broker: convert to interthread overlay channel #7112

garlick · 2025-10-04T23:26:27Z

Problem: the broker moves message to and from the overlay code using functions and callbacks, which won't work if the overlay code moves to a built-in module.

Change the interface to a message channel consisting of back-to-back interthread:// handles which can work when the ends of the channel are in different threads. Messages sent by the broker to this channel go out on the overlay. Messages received on the overlay are selectively sent to the broker on this channel.

This change also moves some overlay routing decisions from the broker to the overlay code. Having a more clearly demarcated interface may improve maintainability of both components.

Adapt overlay unit test.

Marking as a WIP pending possible test enhancements.

garlick · 2025-10-09T23:38:19Z

Rebased on current master. I Got two test failures:

bookworm/gcc-12,distcheck - sqlite got a permissions error at open. I assume this is just noise.
inception - it's a bit hard to tell

       JOBID USER     NAME       ST NTASKS NNODES     TIME INFO
     ƒDNZ1Q1 runner   sh          F      1      1    32.7s e6cfb7bd6b61

2025-10-09T22:22:50.3872618Z �[0;0mOct 09 22:22:50.386904 UTC 2025 broker.err[0]: rc2.0: ./test-inception.sh Exited (rc=137) 1108.2s

That failing job ID seems to be associated with t0012-content-sqlite, based on the fact that it appears in some test output:

2025-10-09T22:08:27.5650168Z 2025-10-09T22:08:27.227541Z content-sqlite.debug[0]: /tmp/flux-XRxulJ/jobtmp-0-ƒDNZ1Q1/trash-directory.t0012-content-sqlite/content.sqlite (0 objects) journal_mode=WAL synchronous=NORMAL

The last output from that test is:

2025-10-09T22:08:28.4656336Z �[36mok 56 - run flux with statedir and verify config�(B�[m
2025-10-09T22:08:28.4656710Z 
2025-10-09T22:08:28.4659266Z run flux with statedir and verify config
2025-10-09T22:08:28.4659580Z 
2025-10-09T22:08:28.4659780Z 
2025-10-09T22:08:28.4659787Z 
2025-10-09T22:08:28.4660706Z �[36mexpecting success: 
2025-10-09T22:08:28.4660930Z 
2025-10-09T22:08:28.4661306Z 	flux start -Sbroker.rc1_path= -Sbroker.rc3_path= \
2025-10-09T22:08:28.4661660Z 
2025-10-09T22:08:28.4661795Z 	    "flux module load content; \
2025-10-09T22:08:28.4662041Z 
2025-10-09T22:08:28.4662235Z 	    flux module load content-sqlite journal_mode=FOO; \
2025-10-09T22:08:28.4662588Z 
2025-10-09T22:08:28.4662840Z 	    flux module remove content" > invalid1.out 2> invalid1.err &&
2025-10-09T22:08:28.4663227Z 
2025-10-09T22:08:28.4663606Z 	grep "content-sqlite: Invalid argument" invalid1.err

which is not the last test in the script. So maybe the test hung and there was a timeout?
But I don't see in the inception script that any timeout is imposed.

Puzzling.

Interesting that both failures are content-sqlite related.

grondo · 2025-10-10T00:12:42Z

You can try running in the same mode as the inception test, i.e. run each test as a job in a Flux instance, and see if anything reproduces?

       JOBID USER     NAME       ST NTASKS NNODES     TIME INFO
     ƒDNZ1Q1 runner   sh          F      1      1    32.7s e6cfb7bd6b61

That output is supposed to helpfully show the failing test name, not just sh. 🤦 Will propose a fix at some point.

The log for the failing jobs is also output to a <name>.log file, which are supposed to be redirected to output in the annotation phase, but that didn't happen here for some reason.

Problem: sharness tests occasionally hang during shutdown and are killed by the flux-start exit timeout. Failures were chiefly observed during development of flux-framework#7112, a PR which refactors the overlay code and likely changes some timing. A particular sharness test, t0035-content-sqlite-checkpoint.t, which uses "minimal" personality (size=2) and thus progresses rapidly through the final states on broker 0, began to fail a lot. However, I think we have seen this failure mode in CI intermittently on a variety of different tests, so this change is proposed as a general fix. When the hang occurs, it is observed that the final handshake with broker rank 1 is not completed: - broker 1 sends a goodbye request upon entring GOODBYE state - broker 0 receives the goodbye request - broker 0 sends a goodbye response - broker 1 never receives the response - broker 1 remains in GOODBYE state - broker 1 is killed by flux-start's exit timeout Since broker 0 rapidly finalizes after handing the response over to zeromq, most likely the zeromq teardown is somehow aborting the in-flight message. However, there is protection for this in the ZMQ_LINGER socket option, which we set to 5 on the downstream socket. This option causes zmq_close() to block for up to the specified number of milliseconds to allow in-flight messages to be delivered. The time is a little short, but increasing it to a large value such as 2000 experimentally did not impact the test failure rate. There is also a call to zmq_unbind(3) that occurs immediately after the response is sent, in the state machine action callback for FINALIZE (rc3) state. I did not think this call was disruptive to in-flight messages, but removing it or adding a brief sleep() call right before it resolved the issue. Note: the failures were easily recreated on Ubuntu 22.04 LTS with libzmq3-dev-4.3.4-2. The zmq_unbind(3) call was added in 2022 by flux-framework#4277 in an attempt to prevent follower brokers from reconnecting to the leader broker of a system instance in FINALIZE (rc3) state. The primary issue was log noise. However, in 2024 during El Capitan integration, it was found that clients reconnecting during CLEANUP state were also a nuisance and and a shutdown flag was added in flux-framework#5883 that causes overlay.hello requests to be quietly rejected in CLEANUP state and beyond. The more recent measure, along with other measures added during El Capitan integration, such as an increased follower respawn rate, should be sufficient to address the log noise issue without zmq_unbind(3). Therefore, drop the zmq_unbind(3) call and avoid the race.

Problem: sharness tests occasionally hang during shutdown and are killed by the flux-start exit timeout. Failures were chiefly observed during development of flux-framework#7112, a PR which refactors the overlay code and likely changes some timing. A particular sharness test, t0035-content-sqlite-checkpoint.t, which uses "minimal" personality (size=2) and thus progresses rapidly through the final states on broker 0, began to fail a lot. However, I think we have seen this failure mode in CI intermittently on a variety of different tests, so this change is proposed as a general fix. When the hang occurs, it is observed that the final handshake with broker rank 1 is not completed: - broker 1 sends a goodbye request upon entring GOODBYE state - broker 0 receives the goodbye request - broker 0 sends a goodbye response - broker 1 never receives the response - broker 1 remains in GOODBYE state - broker 1 is killed by flux-start's exit timeout Since broker 0 rapidly finalizes after handing the response over to zeromq, most likely the zeromq teardown is somehow aborting the in-flight message. However, there is protection for this in the ZMQ_LINGER socket option, which we set to 5 on the downstream socket. This option causes zmq_close() to block for up to the specified number of milliseconds to allow in-flight messages to be delivered. The time is a little short, but increasing it to a large value such as 2000 experimentally did not impact the test failure rate. There is also a call to zmq_unbind(3) that occurs immediately after the response is sent, in the state machine action callback for FINALIZE (rc3) state. I did not think this call was disruptive to in-flight messages, but removing it or adding a brief sleep() call right before it resolved the issue. Note: the failures were easily recreated on Ubuntu 22.04 LTS with libzmq3-dev-4.3.4-2. The zmq_unbind(3) call was added in 2022 by flux-framework#4277 in an attempt to prevent follower brokers from reconnecting to the leader broker of a system instance in FINALIZE (rc3) state. The primary issue was log noise. However, in 2024 during El Capitan integration, it was found that clients reconnecting during CLEANUP state were also a nuisance and and a shutdown flag was added in flux-framework#5883 that causes overlay.hello requests to be quietly rejected in CLEANUP state and beyond. The more recent measure, along with other measures added during El Capitan integration, such as a reduced follower respawn rate, should be sufficient to address the log noise issue without zmq_unbind(3). Therefore, drop the zmq_unbind(3) call and avoid the race.

Problem: sharness tests occasionally hang during shutdown and are killed by the flux-start exit timeout. Failures were chiefly observed during development of flux-framework#7112, a PR which refactors the overlay code and likely changes some timing. A particular sharness test, t0035-content-sqlite-checkpoint.t, which uses "minimal" personality (size=2) and thus progresses rapidly through the final states on broker 0, began to fail a lot. However, I think we have seen this failure mode in CI intermittently on a variety of different tests, so this change is proposed as a general fix. When the hang occurs, it is observed that the final handshake with broker rank 1 is not completed: - broker 1 sends a goodbye request upon entering GOODBYE state - broker 0 receives the goodbye request - broker 0 sends a goodbye response - broker 1 never receives the response - broker 1 remains in GOODBYE state - broker 1 is killed by flux-start's exit timeout Since broker 0 rapidly finalizes after handing the response over to zeromq, most likely the zeromq teardown is somehow aborting the in-flight message. However, there is protection for this in the ZMQ_LINGER socket option, which we set to 5 on the downstream socket. This option causes zmq_close() to block for up to the specified number of milliseconds to allow in-flight messages to be delivered. The time is a little short, but increasing it to a large value such as 2000 experimentally did not impact the test failure rate. There is also a call to zmq_unbind(3) that occurs immediately after the response is sent, in the state machine action callback for FINALIZE (rc3) state. I did not think this call was disruptive to in-flight messages, but removing it or adding a brief sleep() call right before it resolved the issue. Note: the failures were easily recreated on Ubuntu 22.04 LTS with libzmq3-dev-4.3.4-2. The zmq_unbind(3) call was added in 2022 by flux-framework#4277 in an attempt to prevent follower brokers from reconnecting to the leader broker of a system instance in FINALIZE (rc3) state. The primary issue was log noise. However, in 2024 during El Capitan integration, it was found that clients reconnecting during CLEANUP state were also a nuisance and and a shutdown flag was added in flux-framework#5883 that causes overlay.hello requests to be quietly rejected in CLEANUP state and beyond. The more recent measure, along with other measures added during El Capitan integration, such as a reduced follower respawn rate, should be sufficient to address the log noise issue without zmq_unbind(3). Therefore, drop the zmq_unbind(3) call and avoid the race.

Problem: the broker moves message to and from the overlay code using functions and callbacks, which won't work if the overlay code moves to a built-in module. Change the interface to a message channel consisting of back-to-back interthread:// handles which can work when the ends of the channel are in different threads. Messages sent by the broker to this channel go out on the overlay. Messages received on the overlay are selectively sent to the broker on this channel. This change also moves some overlay routing decisions from the broker to the overlay code. Having a more clearly demarcated interface may improve maintainability of both components. Adapt overlay unit test.

Problem: it may be interesting to observe the amount of message backlog in the overlay's new interthread message channel when a broker is under load. Add send and receive queue counts to 'flux module stats overlay' output.

codecov · 2025-10-20T22:42:39Z

Codecov Report

❌ Patch coverage is 81.13208% with 20 lines in your changes missing coverage. Please review.
✅ Project coverage is 83.83%. Comparing base (807e874) to head (89eb9d2).
⚠️ Report is 1 commits behind head on master.

Files with missing lines	Patch %	Lines
src/broker/overlay.c	78.94%	16 Missing ⚠️
src/broker/broker.c	86.20%	4 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #7112      +/-   ##
==========================================
- Coverage   83.85%   83.83%   -0.02%     
==========================================
  Files         551      551              
  Lines       93397    93428      +31     
==========================================
+ Hits        78320    78330      +10     
- Misses      15077    15098      +21

Files with missing lines	Coverage Δ
src/broker/modhash.c	`77.14% <100.00%> (ø)`
src/broker/broker.c	`76.30% <86.20%> (-0.20%)`	⬇️
src/broker/overlay.c	`82.12% <78.94%> (+0.34%)`	⬆️

... and 10 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

garlick force-pushed the overlay_msgchan branch 2 times, most recently from dafc4e3 to 37f7d12 Compare October 9, 2025 22:00

garlick force-pushed the overlay_msgchan branch from 37f7d12 to fbc4f24 Compare October 12, 2025 18:54

garlick mentioned this pull request Oct 13, 2025

ci: brokers are occasionally SIGKILLed #7144

Closed

garlick force-pushed the overlay_msgchan branch from fbc4f24 to 4dfddcc Compare October 14, 2025 01:47

garlick mentioned this pull request Oct 15, 2025

broker: fix potential zmq_unbind() race #7150

Merged

garlick force-pushed the overlay_msgchan branch from 4dfddcc to f1fc696 Compare October 15, 2025 00:27

garlick force-pushed the overlay_msgchan branch 2 times, most recently from f646dd9 to 94d4c24 Compare October 15, 2025 03:14

garlick added 2 commits October 20, 2025 15:01

broker: add interthread queue lengths to stats

89eb9d2

Problem: it may be interesting to observe the amount of message backlog in the overlay's new interthread message channel when a broker is under load. Add send and receive queue counts to 'flux module stats overlay' output.

garlick force-pushed the overlay_msgchan branch from 94d4c24 to 89eb9d2 Compare October 20, 2025 22:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

WIP: broker: convert to interthread overlay channel #7112

WIP: broker: convert to interthread overlay channel #7112

Uh oh!

garlick commented Oct 4, 2025

Uh oh!

garlick commented Oct 9, 2025

Uh oh!

grondo commented Oct 10, 2025

Uh oh!

codecov bot commented Oct 20, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

WIP: broker: convert to interthread overlay channel #7112

Are you sure you want to change the base?

WIP: broker: convert to interthread overlay channel #7112

Uh oh!

Conversation

garlick commented Oct 4, 2025

Uh oh!

garlick commented Oct 9, 2025

Uh oh!

grondo commented Oct 10, 2025

Uh oh!

codecov bot commented Oct 20, 2025

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants