fix(aggregator): (WIP) fetch task on task response if not cached #1351

JulianVentura · 2024-11-04T16:05:36Z

Aggregator fetch missed batches

Description

This PR adds a fix so the aggregator is able to recover and process missed batches when an unknown response from an operator is received.

Type of change

Bug fix

Checklist

Basic flow

On recover, if the aggregator has missed some new batches, the operators will process them and start sending signed responses to the aggregator. The main idea of this PR is that once this happens, the aggregator should first check if the received task is known to him (internal map) and if it isn't, try to fetch if from Ethereum logs. This should be done in a retry fashion, since network may be congested and events may take longer to arrive to certain RPC nodes.

Retry logic

After a response is received, the aggregator will check for the corresponding task in its internal map with 3 retries, waiting 1 sec, 2 sec, and 4 sec, respectively.
If the task is not found in internal maps, it will try to fetch logs. While doing so, some calls will be made to Ethereum, each of them having its own retry logic:

Get the current block number: Retry times (3 retries): 1 sec, 2 sec, 4 sec.
Filter batch: Retry times (3 retries): 1 sec, 2 sec, 4 sec.
Get batch state: Retry times (3 retries): 12 sec (1 Blocks), 24 sec (2 Blocks), 48 sec (4 Blocks)

How to test

Start all services, including a local explorer and 3 operators.
Shut down the aggregator.
Start sending proofs with make batcher_send_infinite_sp1.
Wait a few minutes, see how the batches start to accumulate on the explorer as Pending, the more you wait, the better.
Start the aggregator.
You should see how some of the missed batches get verified, as well as the new ones.

You should bump the operator MaxRetries under rpc_client.go so the operators keep retrying sending responses to the aggregator.

You may also modify the pending_batch_fetch_block_range config value under config-files/config-aggregator.yaml to test for different scenarios. On the same file, you may also bump garbage_collector_period or garbage_collector_tasks_age so batches are not removed before they get verified.

We should stress test this PR so we are sure that no concurrency bug is possible, and for that you should try with the following:

Use as many operators as possible
When aggregator is down, wait as much time as possible, so many batches accumulate ~50 or more.
Modify the operator's retries variables so they retry indefinitely and very rapidly (try with wait times below 1 second).
Set the pending_batch_fetch_block_range variable to a large number, so all batches are fetched.

Closes #1350

MarcosNicolau

I tested on macos and linux and everything worked fined. I left a few minor comments.

MarcosNicolau · 2024-11-04T17:51:12Z