Skip to content

Conversation

JulianVentura
Copy link
Contributor

@JulianVentura JulianVentura commented Nov 4, 2024

Aggregator fetch missed batches

Description

This PR adds a fix so the aggregator is able to recover and process missed batches when an unknown response from an operator is received.

Type of change

  • Bug fix

Checklist

  • “Hotfix” to testnet, everything else to staging
  • Linked to Github Issue
  • This change depends on code or research by an external entity
    • Acknowledgements were updated to give credit
  • Unit tests added
  • This change requires new documentation.
    • Documentation has been added/updated.
  • This change is an Optimization
    • Benchmarks added/run
  • Has a known issue
  • If your PR changes the Operator compatibility (Ex: Upgrade prover versions)
    • This PR adds compatibility for operator for both versions and do not change batcher/docs/examples
    • This PR updates batcher and docs/examples to the newer version. This requires the operator are already updated to be compatible

Basic flow

On recover, if the aggregator has missed some new batches, the operators will process them and start sending signed responses to the aggregator. The main idea of this PR is that once this happens, the aggregator should first check if the received task is known to him (internal map) and if it isn't, try to fetch if from Ethereum logs. This should be done in a retry fashion, since network may be congested and events may take longer to arrive to certain RPC nodes.

Retry logic

After a response is received, the aggregator will check for the corresponding task in its internal map with 3 retries, waiting 1 sec, 2 sec, and 4 sec, respectively.
If the task is not found in internal maps, it will try to fetch logs. While doing so, some calls will be made to Ethereum, each of them having its own retry logic:

  • Get the current block number: Retry times (3 retries): 1 sec, 2 sec, 4 sec.
  • Filter batch: Retry times (3 retries): 1 sec, 2 sec, 4 sec.
  • Get batch state: Retry times (3 retries): 12 sec (1 Blocks), 24 sec (2 Blocks), 48 sec (4 Blocks)

How to test

  1. Start all services, including a local explorer and 3 operators.
  2. Shut down the aggregator.
  3. Start sending proofs with make batcher_send_infinite_sp1.
  4. Wait a few minutes, see how the batches start to accumulate on the explorer as Pending, the more you wait, the better.
  5. Start the aggregator.
  6. You should see how some of the missed batches get verified, as well as the new ones.

You should bump the operator MaxRetries under rpc_client.go so the operators keep retrying sending responses to the aggregator.

You may also modify the pending_batch_fetch_block_range config value under config-files/config-aggregator.yaml to test for different scenarios. On the same file, you may also bump garbage_collector_period or garbage_collector_tasks_age so batches are not removed before they get verified.

We should stress test this PR so we are sure that no concurrency bug is possible, and for that you should try with the following:

  • Use as many operators as possible
  • When aggregator is down, wait as much time as possible, so many batches accumulate ~50 or more.
  • Modify the operator's retries variables so they retry indefinitely and very rapidly (try with wait times below 1 second).
  • Set the pending_batch_fetch_block_range variable to a large number, so all batches are fetched.

Closes #1350

@JulianVentura JulianVentura self-assigned this Nov 4, 2024
Copy link
Member

@MarcosNicolau MarcosNicolau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested on macos and linux and everything worked fined. I left a few minor comments.

for i := 0; i < waitForEventRetries; i++ {
// Lock
agg.taskMutex.Lock()
agg.AggregatorConfig.BaseConfig.Logger.Info("- Locked Resources: Starting processing of Response")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This log is repeated on ProcessOperatorSignedTaskResponseV2.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I can change it if that's more clear

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

} else {
break
}
if !agg.waitForTask(signedTaskResponse) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find waitForTask a bit weird primarily because it mutates the map and it isn't very clear. Personally I would prefer to consolidate everything into a single function (getTask for example). Or at least, renaming it to something more descriptive, such as verifyTaskInMap or fetchTask.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean by consolidating everything in one function? You mean deleting waitForTask and including its contents inside ProcessOperatorSignedTaskResponseV2? I think I prefer to have an auxiliary function.
I also find waitForTask not very descriptive either, but the same happens with other names. I'll keep thinking an alternative, tryToFetchTask is my first candidate.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@Oppen
Copy link
Contributor

Oppen commented Nov 5, 2024

Looks good, need to test later.

@MauroToscano MauroToscano changed the title fix: aggregator recover missed batches fix: fetch task on task response if not cached Nov 7, 2024
@JulianVentura JulianVentura marked this pull request as draft November 7, 2024 17:46
@JulianVentura JulianVentura marked this pull request as ready for review November 12, 2024 14:07
@JulianVentura JulianVentura changed the title fix: fetch task on task response if not cached fix: (WIP) fetch task on task response if not cached Nov 12, 2024
@JulianVentura JulianVentura force-pushed the fix/aggregator-recover-lost-batches branch from fa1b39f to c8cb8e7 Compare November 13, 2024 14:44
@JulianVentura JulianVentura changed the title fix: (WIP) fetch task on task response if not cached fix: fetch task on task response if not cached Nov 13, 2024
@JulianVentura JulianVentura changed the title fix: fetch task on task response if not cached fix(aggregator): fetch task on task response if not cached Nov 13, 2024
@MauroToscano
Copy link
Contributor

Need one more test and a review

@JulianVentura JulianVentura force-pushed the fix/aggregator-recover-lost-batches branch from a0cd361 to 99c6c88 Compare November 13, 2024 17:55
Copy link
Contributor

@avilagaston9 avilagaston9 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Batches are not being verified after the merge:
image

@MarcosNicolau MarcosNicolau self-requested a review November 14, 2024 13:12
Copy link
Contributor

@avilagaston9 avilagaston9 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested with 3 operators!

Copy link
Member

@MarcosNicolau MarcosNicolau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It isn't working for me. I've tested it with 4 operators waited 15min and restarted the aggregator. Only a few tasks were verified, not all of them. Then, I tried sending new tasks with the network working normally and the aggregator wouldn't send the aggregated responses.

The part of the task retrieval does work thou.

@JulianVentura JulianVentura changed the title fix(aggregator): fetch task on task response if not cached fix(aggregator): (WIP) fetch task on task response if not cached Nov 19, 2024
@PatStiles PatStiles force-pushed the fix/aggregator-recover-lost-batches branch from 25c084a to 9392336 Compare November 21, 2024 18:34
Copy link
Contributor

@uri-99 uri-99 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

still wip.
also missing: avoid getting logs for same merkle_root in concurrency. otherwise, you will fetch in the log in blockchain once for every operator, even tho they are all of the same merkle_root

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants