Skip to content

Conversation

@garlick
Copy link
Member

@garlick garlick commented Oct 4, 2025

Problem: the broker moves message to and from the overlay code using functions and callbacks, which won't work if the overlay code moves to a built-in module.

Change the interface to a message channel consisting of back-to-back interthread:// handles which can work when the ends of the channel are in different threads. Messages sent by the broker to this channel go out on the overlay. Messages received on the overlay are selectively sent to the broker on this channel.

This change also moves some overlay routing decisions from the broker to the overlay code. Having a more clearly demarcated interface may improve maintainability of both components.

Adapt overlay unit test.

Marking as a WIP pending possible test enhancements.

@garlick garlick force-pushed the overlay_msgchan branch 2 times, most recently from dafc4e3 to 37f7d12 Compare October 9, 2025 22:00
@garlick
Copy link
Member Author

garlick commented Oct 9, 2025

Rebased on current master. I Got two test failures:

  • bookworm/gcc-12,distcheck - sqlite got a permissions error at open. I assume this is just noise.
  • inception - it's a bit hard to tell
       JOBID USER     NAME       ST NTASKS NNODES     TIME INFO
     ƒDNZ1Q1 runner   sh          F      1      1    32.7s e6cfb7bd6b61

2025-10-09T22:22:50.3872618Z �[0;0mOct 09 22:22:50.386904 UTC 2025 broker.err[0]: rc2.0: ./test-inception.sh Exited (rc=137) 1108.2s

That failing job ID seems to be associated with t0012-content-sqlite, based on the fact that it appears in some test output:

2025-10-09T22:08:27.5650168Z 2025-10-09T22:08:27.227541Z content-sqlite.debug[0]: /tmp/flux-XRxulJ/jobtmp-0-ƒDNZ1Q1/trash-directory.t0012-content-sqlite/content.sqlite (0 objects) journal_mode=WAL synchronous=NORMAL

The last output from that test is:

2025-10-09T22:08:28.4656336Z �[36mok 56 - run flux with statedir and verify config�(B�[m
2025-10-09T22:08:28.4656710Z 
2025-10-09T22:08:28.4659266Z run flux with statedir and verify config
2025-10-09T22:08:28.4659580Z 
2025-10-09T22:08:28.4659780Z 
2025-10-09T22:08:28.4659787Z 
2025-10-09T22:08:28.4660706Z �[36mexpecting success: 
2025-10-09T22:08:28.4660930Z 
2025-10-09T22:08:28.4661306Z 	flux start -Sbroker.rc1_path= -Sbroker.rc3_path= \
2025-10-09T22:08:28.4661660Z 
2025-10-09T22:08:28.4661795Z 	    "flux module load content; \
2025-10-09T22:08:28.4662041Z 
2025-10-09T22:08:28.4662235Z 	    flux module load content-sqlite journal_mode=FOO; \
2025-10-09T22:08:28.4662588Z 
2025-10-09T22:08:28.4662840Z 	    flux module remove content" > invalid1.out 2> invalid1.err &&
2025-10-09T22:08:28.4663227Z 
2025-10-09T22:08:28.4663606Z 	grep "content-sqlite: Invalid argument" invalid1.err

which is not the last test in the script. So maybe the test hung and there was a timeout?
But I don't see in the inception script that any timeout is imposed.

Puzzling.

Interesting that both failures are content-sqlite related.

@grondo
Copy link
Contributor

grondo commented Oct 10, 2025

You can try running in the same mode as the inception test, i.e. run each test as a job in a Flux instance, and see if anything reproduces?

       JOBID USER     NAME       ST NTASKS NNODES     TIME INFO
     ƒDNZ1Q1 runner   sh          F      1      1    32.7s e6cfb7bd6b61

That output is supposed to helpfully show the failing test name, not just sh. 🤦 Will propose a fix at some point.

The log for the failing jobs is also output to a <name>.log file, which are supposed to be redirected to output in the annotation phase, but that didn't happen here for some reason.

garlick added a commit to garlick/flux-core that referenced this pull request Oct 15, 2025
Problem: sharness tests occasionally hang during shutdown and
are killed by the flux-start exit timeout.

Failures were chiefly observed during development of
flux-framework#7112, a PR which refactors the overlay
code and likely changes some timing.  A particular sharness test,
t0035-content-sqlite-checkpoint.t, which uses "minimal"
personality (size=2) and thus progresses rapidly through the
final states on broker 0, began to fail a lot.  However, I think
we have seen this failure mode in CI intermittently on a variety
of different tests, so this change is proposed as a general fix.

When the hang occurs, it is observed that the final handshake
with broker rank 1 is not completed:
- broker 1 sends a goodbye request upon entring GOODBYE state
- broker 0 receives the goodbye request
- broker 0 sends a goodbye response
- broker 1 never receives the response
- broker 1 remains in GOODBYE state
- broker 1 is killed by flux-start's exit timeout

Since broker 0 rapidly finalizes after handing the response over
to zeromq, most likely the zeromq teardown is somehow aborting the
in-flight message.  However, there is protection for this in the
ZMQ_LINGER socket option, which we set to 5 on the downstream socket.
This option causes zmq_close() to block for up to the specified number
of milliseconds to allow in-flight messages to be delivered.  The time
is a little short, but increasing it to a large value such as 2000
experimentally did not impact the test failure rate.

There is also a call to zmq_unbind(3) that occurs immediately
after the response is sent, in the state machine action callback
for FINALIZE (rc3) state.  I did not think this call was disruptive
to in-flight messages, but removing it or adding a brief sleep()
call right before it resolved the issue.  Note: the failures were
easily recreated on Ubuntu 22.04 LTS with libzmq3-dev-4.3.4-2.

The zmq_unbind(3) call was added in 2022 by flux-framework#4277
in an attempt to prevent follower brokers from reconnecting to the leader
broker of a system instance in FINALIZE (rc3) state.  The primary issue
was log noise.  However, in 2024 during El Capitan integration, it was
found that clients reconnecting during CLEANUP state were also a nuisance
and and a shutdown flag was added in flux-framework#5883 that
causes overlay.hello requests to be quietly rejected in CLEANUP state
and beyond.

The more recent measure, along with other measures added during El Capitan
integration, such as an increased follower respawn rate, should be
sufficient to address the log noise issue without zmq_unbind(3).

Therefore, drop the zmq_unbind(3) call and avoid the race.
garlick added a commit to garlick/flux-core that referenced this pull request Oct 15, 2025
Problem: sharness tests occasionally hang during shutdown and
are killed by the flux-start exit timeout.

Failures were chiefly observed during development of
flux-framework#7112, a PR which refactors the overlay
code and likely changes some timing.  A particular sharness test,
t0035-content-sqlite-checkpoint.t, which uses "minimal"
personality (size=2) and thus progresses rapidly through the
final states on broker 0, began to fail a lot.  However, I think
we have seen this failure mode in CI intermittently on a variety
of different tests, so this change is proposed as a general fix.

When the hang occurs, it is observed that the final handshake
with broker rank 1 is not completed:
- broker 1 sends a goodbye request upon entring GOODBYE state
- broker 0 receives the goodbye request
- broker 0 sends a goodbye response
- broker 1 never receives the response
- broker 1 remains in GOODBYE state
- broker 1 is killed by flux-start's exit timeout

Since broker 0 rapidly finalizes after handing the response over
to zeromq, most likely the zeromq teardown is somehow aborting the
in-flight message.  However, there is protection for this in the
ZMQ_LINGER socket option, which we set to 5 on the downstream socket.
This option causes zmq_close() to block for up to the specified number
of milliseconds to allow in-flight messages to be delivered.  The time
is a little short, but increasing it to a large value such as 2000
experimentally did not impact the test failure rate.

There is also a call to zmq_unbind(3) that occurs immediately
after the response is sent, in the state machine action callback
for FINALIZE (rc3) state.  I did not think this call was disruptive
to in-flight messages, but removing it or adding a brief sleep()
call right before it resolved the issue.  Note: the failures were
easily recreated on Ubuntu 22.04 LTS with libzmq3-dev-4.3.4-2.

The zmq_unbind(3) call was added in 2022 by flux-framework#4277
in an attempt to prevent follower brokers from reconnecting to the leader
broker of a system instance in FINALIZE (rc3) state.  The primary issue
was log noise.  However, in 2024 during El Capitan integration, it was
found that clients reconnecting during CLEANUP state were also a nuisance
and and a shutdown flag was added in flux-framework#5883 that
causes overlay.hello requests to be quietly rejected in CLEANUP state
and beyond.

The more recent measure, along with other measures added during El Capitan
integration, such as a reduced follower respawn rate, should be
sufficient to address the log noise issue without zmq_unbind(3).

Therefore, drop the zmq_unbind(3) call and avoid the race.
garlick added a commit to garlick/flux-core that referenced this pull request Oct 15, 2025
Problem: sharness tests occasionally hang during shutdown and
are killed by the flux-start exit timeout.

Failures were chiefly observed during development of
flux-framework#7112, a PR which refactors the overlay
code and likely changes some timing.  A particular sharness test,
t0035-content-sqlite-checkpoint.t, which uses "minimal"
personality (size=2) and thus progresses rapidly through the
final states on broker 0, began to fail a lot.  However, I think
we have seen this failure mode in CI intermittently on a variety
of different tests, so this change is proposed as a general fix.

When the hang occurs, it is observed that the final handshake
with broker rank 1 is not completed:
- broker 1 sends a goodbye request upon entring GOODBYE state
- broker 0 receives the goodbye request
- broker 0 sends a goodbye response
- broker 1 never receives the response
- broker 1 remains in GOODBYE state
- broker 1 is killed by flux-start's exit timeout

Since broker 0 rapidly finalizes after handing the response over
to zeromq, most likely the zeromq teardown is somehow aborting the
in-flight message.  However, there is protection for this in the
ZMQ_LINGER socket option, which we set to 5 on the downstream socket.
This option causes zmq_close() to block for up to the specified number
of milliseconds to allow in-flight messages to be delivered.  The time
is a little short, but increasing it to a large value such as 2000
experimentally did not impact the test failure rate.

There is also a call to zmq_unbind(3) that occurs immediately
after the response is sent, in the state machine action callback
for FINALIZE (rc3) state.  I did not think this call was disruptive
to in-flight messages, but removing it or adding a brief sleep()
call right before it resolved the issue.  Note: the failures were
easily recreated on Ubuntu 22.04 LTS with libzmq3-dev-4.3.4-2.

The zmq_unbind(3) call was added in 2022 by flux-framework#4277
in an attempt to prevent follower brokers from reconnecting to the leader
broker of a system instance in FINALIZE (rc3) state.  The primary issue
was log noise.  However, in 2024 during El Capitan integration, it was
found that clients reconnecting during CLEANUP state were also a nuisance
and and a shutdown flag was added in flux-framework#5883 that
causes overlay.hello requests to be quietly rejected in CLEANUP state
and beyond.

The more recent measure, along with other measures added during El Capitan
integration, such as a reduced follower respawn rate, should be
sufficient to address the log noise issue without zmq_unbind(3).

Therefore, drop the zmq_unbind(3) call and avoid the race.
garlick added a commit to garlick/flux-core that referenced this pull request Oct 15, 2025
Problem: sharness tests occasionally hang during shutdown and
are killed by the flux-start exit timeout.

Failures were chiefly observed during development of
flux-framework#7112, a PR which refactors the overlay
code and likely changes some timing.  A particular sharness test,
t0035-content-sqlite-checkpoint.t, which uses "minimal"
personality (size=2) and thus progresses rapidly through the
final states on broker 0, began to fail a lot.  However, I think
we have seen this failure mode in CI intermittently on a variety
of different tests, so this change is proposed as a general fix.

When the hang occurs, it is observed that the final handshake
with broker rank 1 is not completed:
- broker 1 sends a goodbye request upon entering GOODBYE state
- broker 0 receives the goodbye request
- broker 0 sends a goodbye response
- broker 1 never receives the response
- broker 1 remains in GOODBYE state
- broker 1 is killed by flux-start's exit timeout

Since broker 0 rapidly finalizes after handing the response over
to zeromq, most likely the zeromq teardown is somehow aborting the
in-flight message.  However, there is protection for this in the
ZMQ_LINGER socket option, which we set to 5 on the downstream socket.
This option causes zmq_close() to block for up to the specified number
of milliseconds to allow in-flight messages to be delivered.  The time
is a little short, but increasing it to a large value such as 2000
experimentally did not impact the test failure rate.

There is also a call to zmq_unbind(3) that occurs immediately
after the response is sent, in the state machine action callback
for FINALIZE (rc3) state.  I did not think this call was disruptive
to in-flight messages, but removing it or adding a brief sleep()
call right before it resolved the issue.  Note: the failures were
easily recreated on Ubuntu 22.04 LTS with libzmq3-dev-4.3.4-2.

The zmq_unbind(3) call was added in 2022 by flux-framework#4277
in an attempt to prevent follower brokers from reconnecting to the leader
broker of a system instance in FINALIZE (rc3) state.  The primary issue
was log noise.  However, in 2024 during El Capitan integration, it was
found that clients reconnecting during CLEANUP state were also a nuisance
and and a shutdown flag was added in flux-framework#5883 that
causes overlay.hello requests to be quietly rejected in CLEANUP state
and beyond.

The more recent measure, along with other measures added during El Capitan
integration, such as a reduced follower respawn rate, should be
sufficient to address the log noise issue without zmq_unbind(3).

Therefore, drop the zmq_unbind(3) call and avoid the race.
garlick added a commit to garlick/flux-core that referenced this pull request Oct 15, 2025
Problem: sharness tests occasionally hang during shutdown and
are killed by the flux-start exit timeout.

Failures were chiefly observed during development of
flux-framework#7112, a PR which refactors the overlay
code and likely changes some timing.  A particular sharness test,
t0035-content-sqlite-checkpoint.t, which uses "minimal"
personality (size=2) and thus progresses rapidly through the
final states on broker 0, began to fail a lot.  However, I think
we have seen this failure mode in CI intermittently on a variety
of different tests, so this change is proposed as a general fix.

When the hang occurs, it is observed that the final handshake
with broker rank 1 is not completed:
- broker 1 sends a goodbye request upon entering GOODBYE state
- broker 0 receives the goodbye request
- broker 0 sends a goodbye response
- broker 1 never receives the response
- broker 1 remains in GOODBYE state
- broker 1 is killed by flux-start's exit timeout

Since broker 0 rapidly finalizes after handing the response over
to zeromq, most likely the zeromq teardown is somehow aborting the
in-flight message.  However, there is protection for this in the
ZMQ_LINGER socket option, which we set to 5 on the downstream socket.
This option causes zmq_close() to block for up to the specified number
of milliseconds to allow in-flight messages to be delivered.  The time
is a little short, but increasing it to a large value such as 2000
experimentally did not impact the test failure rate.

There is also a call to zmq_unbind(3) that occurs immediately
after the response is sent, in the state machine action callback
for FINALIZE (rc3) state.  I did not think this call was disruptive
to in-flight messages, but removing it or adding a brief sleep()
call right before it resolved the issue.  Note: the failures were
easily recreated on Ubuntu 22.04 LTS with libzmq3-dev-4.3.4-2.

The zmq_unbind(3) call was added in 2022 by flux-framework#4277
in an attempt to prevent follower brokers from reconnecting to the leader
broker of a system instance in FINALIZE (rc3) state.  The primary issue
was log noise.  However, in 2024 during El Capitan integration, it was
found that clients reconnecting during CLEANUP state were also a nuisance
and and a shutdown flag was added in flux-framework#5883 that
causes overlay.hello requests to be quietly rejected in CLEANUP state
and beyond.

The more recent measure, along with other measures added during El Capitan
integration, such as a reduced follower respawn rate, should be
sufficient to address the log noise issue without zmq_unbind(3).

Therefore, drop the zmq_unbind(3) call and avoid the race.
garlick added a commit to garlick/flux-core that referenced this pull request Oct 15, 2025
Problem: sharness tests occasionally hang during shutdown and
are killed by the flux-start exit timeout.

Failures were chiefly observed during development of
flux-framework#7112, a PR which refactors the overlay
code and likely changes some timing.  A particular sharness test,
t0035-content-sqlite-checkpoint.t, which uses "minimal"
personality (size=2) and thus progresses rapidly through the
final states on broker 0, began to fail a lot.  However, I think
we have seen this failure mode in CI intermittently on a variety
of different tests, so this change is proposed as a general fix.

When the hang occurs, it is observed that the final handshake
with broker rank 1 is not completed:
- broker 1 sends a goodbye request upon entering GOODBYE state
- broker 0 receives the goodbye request
- broker 0 sends a goodbye response
- broker 1 never receives the response
- broker 1 remains in GOODBYE state
- broker 1 is killed by flux-start's exit timeout

Since broker 0 rapidly finalizes after handing the response over
to zeromq, most likely the zeromq teardown is somehow aborting the
in-flight message.  However, there is protection for this in the
ZMQ_LINGER socket option, which we set to 5 on the downstream socket.
This option causes zmq_close() to block for up to the specified number
of milliseconds to allow in-flight messages to be delivered.  The time
is a little short, but increasing it to a large value such as 2000
experimentally did not impact the test failure rate.

There is also a call to zmq_unbind(3) that occurs immediately
after the response is sent, in the state machine action callback
for FINALIZE (rc3) state.  I did not think this call was disruptive
to in-flight messages, but removing it or adding a brief sleep()
call right before it resolved the issue.  Note: the failures were
easily recreated on Ubuntu 22.04 LTS with libzmq3-dev-4.3.4-2.

The zmq_unbind(3) call was added in 2022 by flux-framework#4277
in an attempt to prevent follower brokers from reconnecting to the leader
broker of a system instance in FINALIZE (rc3) state.  The primary issue
was log noise.  However, in 2024 during El Capitan integration, it was
found that clients reconnecting during CLEANUP state were also a nuisance
and and a shutdown flag was added in flux-framework#5883 that
causes overlay.hello requests to be quietly rejected in CLEANUP state
and beyond.

The more recent measure, along with other measures added during El Capitan
integration, such as a reduced follower respawn rate, should be
sufficient to address the log noise issue without zmq_unbind(3).

Therefore, drop the zmq_unbind(3) call and avoid the race.
@garlick garlick force-pushed the overlay_msgchan branch 2 times, most recently from f646dd9 to 94d4c24 Compare October 15, 2025 03:14
Problem: the broker moves message to and from the overlay
code using functions and callbacks, which won't work if the
overlay code moves to a built-in module.

Change the interface to a message channel consisting of
back-to-back interthread:// handles which can work when
the ends of the channel are in different threads.
Messages sent by the broker to this channel go out on
the overlay.  Messages received on the overlay are
selectively sent to the broker on this channel.

This change also moves some overlay routing decisions
from the broker to the overlay code.  Having a more clearly
demarcated interface may improve maintainability of
both components.

Adapt overlay unit test.
Problem: it may be interesting to observe the amount of message
backlog in the overlay's new interthread message channel when
a broker is under load.

Add send and receive queue counts to 'flux module stats overlay'
output.
@codecov
Copy link

codecov bot commented Oct 20, 2025

Codecov Report

❌ Patch coverage is 81.13208% with 20 lines in your changes missing coverage. Please review.
✅ Project coverage is 83.83%. Comparing base (807e874) to head (89eb9d2).
⚠️ Report is 1 commits behind head on master.

Files with missing lines Patch % Lines
src/broker/overlay.c 78.94% 16 Missing ⚠️
src/broker/broker.c 86.20% 4 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #7112      +/-   ##
==========================================
- Coverage   83.85%   83.83%   -0.02%     
==========================================
  Files         551      551              
  Lines       93397    93428      +31     
==========================================
+ Hits        78320    78330      +10     
- Misses      15077    15098      +21     
Files with missing lines Coverage Δ
src/broker/modhash.c 77.14% <100.00%> (ø)
src/broker/broker.c 76.30% <86.20%> (-0.20%) ⬇️
src/broker/overlay.c 82.12% <78.94%> (+0.34%) ⬆️

... and 10 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants