- 
                Notifications
    You must be signed in to change notification settings 
- Fork 54
WIP: broker: convert to interthread overlay channel #7112
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
dafc4e3    to
    37f7d12      
    Compare
  
    | Rebased on current master. I Got two test failures: 
 That failing job ID seems to be associated with  The last output from that test is: which is not the last test in the script.  So maybe the test hung and there was a timeout? Puzzling. Interesting that both failures are content-sqlite related. | 
| You can try running in the same mode as the inception test, i.e. run each test as a job in a Flux instance, and see if anything reproduces? That output is supposed to helpfully show the failing test name, not just  The log for the failing jobs is also output to a  | 
37f7d12    to
    fbc4f24      
    Compare
  
    fbc4f24    to
    4dfddcc      
    Compare
  
    Problem: sharness tests occasionally hang during shutdown and are killed by the flux-start exit timeout. Failures were chiefly observed during development of flux-framework#7112, a PR which refactors the overlay code and likely changes some timing. A particular sharness test, t0035-content-sqlite-checkpoint.t, which uses "minimal" personality (size=2) and thus progresses rapidly through the final states on broker 0, began to fail a lot. However, I think we have seen this failure mode in CI intermittently on a variety of different tests, so this change is proposed as a general fix. When the hang occurs, it is observed that the final handshake with broker rank 1 is not completed: - broker 1 sends a goodbye request upon entring GOODBYE state - broker 0 receives the goodbye request - broker 0 sends a goodbye response - broker 1 never receives the response - broker 1 remains in GOODBYE state - broker 1 is killed by flux-start's exit timeout Since broker 0 rapidly finalizes after handing the response over to zeromq, most likely the zeromq teardown is somehow aborting the in-flight message. However, there is protection for this in the ZMQ_LINGER socket option, which we set to 5 on the downstream socket. This option causes zmq_close() to block for up to the specified number of milliseconds to allow in-flight messages to be delivered. The time is a little short, but increasing it to a large value such as 2000 experimentally did not impact the test failure rate. There is also a call to zmq_unbind(3) that occurs immediately after the response is sent, in the state machine action callback for FINALIZE (rc3) state. I did not think this call was disruptive to in-flight messages, but removing it or adding a brief sleep() call right before it resolved the issue. Note: the failures were easily recreated on Ubuntu 22.04 LTS with libzmq3-dev-4.3.4-2. The zmq_unbind(3) call was added in 2022 by flux-framework#4277 in an attempt to prevent follower brokers from reconnecting to the leader broker of a system instance in FINALIZE (rc3) state. The primary issue was log noise. However, in 2024 during El Capitan integration, it was found that clients reconnecting during CLEANUP state were also a nuisance and and a shutdown flag was added in flux-framework#5883 that causes overlay.hello requests to be quietly rejected in CLEANUP state and beyond. The more recent measure, along with other measures added during El Capitan integration, such as an increased follower respawn rate, should be sufficient to address the log noise issue without zmq_unbind(3). Therefore, drop the zmq_unbind(3) call and avoid the race.
Problem: sharness tests occasionally hang during shutdown and are killed by the flux-start exit timeout. Failures were chiefly observed during development of flux-framework#7112, a PR which refactors the overlay code and likely changes some timing. A particular sharness test, t0035-content-sqlite-checkpoint.t, which uses "minimal" personality (size=2) and thus progresses rapidly through the final states on broker 0, began to fail a lot. However, I think we have seen this failure mode in CI intermittently on a variety of different tests, so this change is proposed as a general fix. When the hang occurs, it is observed that the final handshake with broker rank 1 is not completed: - broker 1 sends a goodbye request upon entring GOODBYE state - broker 0 receives the goodbye request - broker 0 sends a goodbye response - broker 1 never receives the response - broker 1 remains in GOODBYE state - broker 1 is killed by flux-start's exit timeout Since broker 0 rapidly finalizes after handing the response over to zeromq, most likely the zeromq teardown is somehow aborting the in-flight message. However, there is protection for this in the ZMQ_LINGER socket option, which we set to 5 on the downstream socket. This option causes zmq_close() to block for up to the specified number of milliseconds to allow in-flight messages to be delivered. The time is a little short, but increasing it to a large value such as 2000 experimentally did not impact the test failure rate. There is also a call to zmq_unbind(3) that occurs immediately after the response is sent, in the state machine action callback for FINALIZE (rc3) state. I did not think this call was disruptive to in-flight messages, but removing it or adding a brief sleep() call right before it resolved the issue. Note: the failures were easily recreated on Ubuntu 22.04 LTS with libzmq3-dev-4.3.4-2. The zmq_unbind(3) call was added in 2022 by flux-framework#4277 in an attempt to prevent follower brokers from reconnecting to the leader broker of a system instance in FINALIZE (rc3) state. The primary issue was log noise. However, in 2024 during El Capitan integration, it was found that clients reconnecting during CLEANUP state were also a nuisance and and a shutdown flag was added in flux-framework#5883 that causes overlay.hello requests to be quietly rejected in CLEANUP state and beyond. The more recent measure, along with other measures added during El Capitan integration, such as a reduced follower respawn rate, should be sufficient to address the log noise issue without zmq_unbind(3). Therefore, drop the zmq_unbind(3) call and avoid the race.
Problem: sharness tests occasionally hang during shutdown and are killed by the flux-start exit timeout. Failures were chiefly observed during development of flux-framework#7112, a PR which refactors the overlay code and likely changes some timing. A particular sharness test, t0035-content-sqlite-checkpoint.t, which uses "minimal" personality (size=2) and thus progresses rapidly through the final states on broker 0, began to fail a lot. However, I think we have seen this failure mode in CI intermittently on a variety of different tests, so this change is proposed as a general fix. When the hang occurs, it is observed that the final handshake with broker rank 1 is not completed: - broker 1 sends a goodbye request upon entring GOODBYE state - broker 0 receives the goodbye request - broker 0 sends a goodbye response - broker 1 never receives the response - broker 1 remains in GOODBYE state - broker 1 is killed by flux-start's exit timeout Since broker 0 rapidly finalizes after handing the response over to zeromq, most likely the zeromq teardown is somehow aborting the in-flight message. However, there is protection for this in the ZMQ_LINGER socket option, which we set to 5 on the downstream socket. This option causes zmq_close() to block for up to the specified number of milliseconds to allow in-flight messages to be delivered. The time is a little short, but increasing it to a large value such as 2000 experimentally did not impact the test failure rate. There is also a call to zmq_unbind(3) that occurs immediately after the response is sent, in the state machine action callback for FINALIZE (rc3) state. I did not think this call was disruptive to in-flight messages, but removing it or adding a brief sleep() call right before it resolved the issue. Note: the failures were easily recreated on Ubuntu 22.04 LTS with libzmq3-dev-4.3.4-2. The zmq_unbind(3) call was added in 2022 by flux-framework#4277 in an attempt to prevent follower brokers from reconnecting to the leader broker of a system instance in FINALIZE (rc3) state. The primary issue was log noise. However, in 2024 during El Capitan integration, it was found that clients reconnecting during CLEANUP state were also a nuisance and and a shutdown flag was added in flux-framework#5883 that causes overlay.hello requests to be quietly rejected in CLEANUP state and beyond. The more recent measure, along with other measures added during El Capitan integration, such as a reduced follower respawn rate, should be sufficient to address the log noise issue without zmq_unbind(3). Therefore, drop the zmq_unbind(3) call and avoid the race.
4dfddcc    to
    f1fc696      
    Compare
  
    Problem: sharness tests occasionally hang during shutdown and are killed by the flux-start exit timeout. Failures were chiefly observed during development of flux-framework#7112, a PR which refactors the overlay code and likely changes some timing. A particular sharness test, t0035-content-sqlite-checkpoint.t, which uses "minimal" personality (size=2) and thus progresses rapidly through the final states on broker 0, began to fail a lot. However, I think we have seen this failure mode in CI intermittently on a variety of different tests, so this change is proposed as a general fix. When the hang occurs, it is observed that the final handshake with broker rank 1 is not completed: - broker 1 sends a goodbye request upon entering GOODBYE state - broker 0 receives the goodbye request - broker 0 sends a goodbye response - broker 1 never receives the response - broker 1 remains in GOODBYE state - broker 1 is killed by flux-start's exit timeout Since broker 0 rapidly finalizes after handing the response over to zeromq, most likely the zeromq teardown is somehow aborting the in-flight message. However, there is protection for this in the ZMQ_LINGER socket option, which we set to 5 on the downstream socket. This option causes zmq_close() to block for up to the specified number of milliseconds to allow in-flight messages to be delivered. The time is a little short, but increasing it to a large value such as 2000 experimentally did not impact the test failure rate. There is also a call to zmq_unbind(3) that occurs immediately after the response is sent, in the state machine action callback for FINALIZE (rc3) state. I did not think this call was disruptive to in-flight messages, but removing it or adding a brief sleep() call right before it resolved the issue. Note: the failures were easily recreated on Ubuntu 22.04 LTS with libzmq3-dev-4.3.4-2. The zmq_unbind(3) call was added in 2022 by flux-framework#4277 in an attempt to prevent follower brokers from reconnecting to the leader broker of a system instance in FINALIZE (rc3) state. The primary issue was log noise. However, in 2024 during El Capitan integration, it was found that clients reconnecting during CLEANUP state were also a nuisance and and a shutdown flag was added in flux-framework#5883 that causes overlay.hello requests to be quietly rejected in CLEANUP state and beyond. The more recent measure, along with other measures added during El Capitan integration, such as a reduced follower respawn rate, should be sufficient to address the log noise issue without zmq_unbind(3). Therefore, drop the zmq_unbind(3) call and avoid the race.
Problem: sharness tests occasionally hang during shutdown and are killed by the flux-start exit timeout. Failures were chiefly observed during development of flux-framework#7112, a PR which refactors the overlay code and likely changes some timing. A particular sharness test, t0035-content-sqlite-checkpoint.t, which uses "minimal" personality (size=2) and thus progresses rapidly through the final states on broker 0, began to fail a lot. However, I think we have seen this failure mode in CI intermittently on a variety of different tests, so this change is proposed as a general fix. When the hang occurs, it is observed that the final handshake with broker rank 1 is not completed: - broker 1 sends a goodbye request upon entering GOODBYE state - broker 0 receives the goodbye request - broker 0 sends a goodbye response - broker 1 never receives the response - broker 1 remains in GOODBYE state - broker 1 is killed by flux-start's exit timeout Since broker 0 rapidly finalizes after handing the response over to zeromq, most likely the zeromq teardown is somehow aborting the in-flight message. However, there is protection for this in the ZMQ_LINGER socket option, which we set to 5 on the downstream socket. This option causes zmq_close() to block for up to the specified number of milliseconds to allow in-flight messages to be delivered. The time is a little short, but increasing it to a large value such as 2000 experimentally did not impact the test failure rate. There is also a call to zmq_unbind(3) that occurs immediately after the response is sent, in the state machine action callback for FINALIZE (rc3) state. I did not think this call was disruptive to in-flight messages, but removing it or adding a brief sleep() call right before it resolved the issue. Note: the failures were easily recreated on Ubuntu 22.04 LTS with libzmq3-dev-4.3.4-2. The zmq_unbind(3) call was added in 2022 by flux-framework#4277 in an attempt to prevent follower brokers from reconnecting to the leader broker of a system instance in FINALIZE (rc3) state. The primary issue was log noise. However, in 2024 during El Capitan integration, it was found that clients reconnecting during CLEANUP state were also a nuisance and and a shutdown flag was added in flux-framework#5883 that causes overlay.hello requests to be quietly rejected in CLEANUP state and beyond. The more recent measure, along with other measures added during El Capitan integration, such as a reduced follower respawn rate, should be sufficient to address the log noise issue without zmq_unbind(3). Therefore, drop the zmq_unbind(3) call and avoid the race.
Problem: sharness tests occasionally hang during shutdown and are killed by the flux-start exit timeout. Failures were chiefly observed during development of flux-framework#7112, a PR which refactors the overlay code and likely changes some timing. A particular sharness test, t0035-content-sqlite-checkpoint.t, which uses "minimal" personality (size=2) and thus progresses rapidly through the final states on broker 0, began to fail a lot. However, I think we have seen this failure mode in CI intermittently on a variety of different tests, so this change is proposed as a general fix. When the hang occurs, it is observed that the final handshake with broker rank 1 is not completed: - broker 1 sends a goodbye request upon entering GOODBYE state - broker 0 receives the goodbye request - broker 0 sends a goodbye response - broker 1 never receives the response - broker 1 remains in GOODBYE state - broker 1 is killed by flux-start's exit timeout Since broker 0 rapidly finalizes after handing the response over to zeromq, most likely the zeromq teardown is somehow aborting the in-flight message. However, there is protection for this in the ZMQ_LINGER socket option, which we set to 5 on the downstream socket. This option causes zmq_close() to block for up to the specified number of milliseconds to allow in-flight messages to be delivered. The time is a little short, but increasing it to a large value such as 2000 experimentally did not impact the test failure rate. There is also a call to zmq_unbind(3) that occurs immediately after the response is sent, in the state machine action callback for FINALIZE (rc3) state. I did not think this call was disruptive to in-flight messages, but removing it or adding a brief sleep() call right before it resolved the issue. Note: the failures were easily recreated on Ubuntu 22.04 LTS with libzmq3-dev-4.3.4-2. The zmq_unbind(3) call was added in 2022 by flux-framework#4277 in an attempt to prevent follower brokers from reconnecting to the leader broker of a system instance in FINALIZE (rc3) state. The primary issue was log noise. However, in 2024 during El Capitan integration, it was found that clients reconnecting during CLEANUP state were also a nuisance and and a shutdown flag was added in flux-framework#5883 that causes overlay.hello requests to be quietly rejected in CLEANUP state and beyond. The more recent measure, along with other measures added during El Capitan integration, such as a reduced follower respawn rate, should be sufficient to address the log noise issue without zmq_unbind(3). Therefore, drop the zmq_unbind(3) call and avoid the race.
f646dd9    to
    94d4c24      
    Compare
  
    Problem: the broker moves message to and from the overlay code using functions and callbacks, which won't work if the overlay code moves to a built-in module. Change the interface to a message channel consisting of back-to-back interthread:// handles which can work when the ends of the channel are in different threads. Messages sent by the broker to this channel go out on the overlay. Messages received on the overlay are selectively sent to the broker on this channel. This change also moves some overlay routing decisions from the broker to the overlay code. Having a more clearly demarcated interface may improve maintainability of both components. Adapt overlay unit test.
Problem: it may be interesting to observe the amount of message backlog in the overlay's new interthread message channel when a broker is under load. Add send and receive queue counts to 'flux module stats overlay' output.
94d4c24    to
    89eb9d2      
    Compare
  
    | Codecov Report❌ Patch coverage is  
 Additional details and impacted files@@            Coverage Diff             @@
##           master    #7112      +/-   ##
==========================================
- Coverage   83.85%   83.83%   -0.02%     
==========================================
  Files         551      551              
  Lines       93397    93428      +31     
==========================================
+ Hits        78320    78330      +10     
- Misses      15077    15098      +21     
 🚀 New features to boost your workflow:
 | 
Problem: the broker moves message to and from the overlay code using functions and callbacks, which won't work if the overlay code moves to a built-in module.
Change the interface to a message channel consisting of back-to-back interthread:// handles which can work when the ends of the channel are in different threads. Messages sent by the broker to this channel go out on the overlay. Messages received on the overlay are selectively sent to the broker on this channel.
This change also moves some overlay routing decisions from the broker to the overlay code. Having a more clearly demarcated interface may improve maintainability of both components.
Adapt overlay unit test.
Marking as a WIP pending possible test enhancements.