Skip to content

Conversation

@colll78
Copy link
Contributor

@colll78 colll78 commented May 1, 2025

Summary

This CIP proposes introducing a memory-bound, Least Recently Used (LRU) in-memory caching mechanism within cardano-node to store deserialized Plutus scripts. The goal is to avoid redundant deserialization of scripts during transaction and block validation, particularly for frequently reused scripts (e.g., in DeFi or DAO contracts).

By caching deserialized PlutusScript objects keyed by their script hash, and reusing them when present, nodes can significantly reduce CPU overhead without changing any ledger rules or validation outcomes. Scripts not found in cache will be deserialized on-demand (cold reload), preserving correctness.

This is a runtime-only optimization that applies at the cardano-node level. It does not require any changes to the cardano-ledger repository or protocol parameters, and is fully backwards compatible and deterministic.

The proposed cache would be:

  • Bounded in memory (e.g., 256 MB)
  • Eviction-based (LRU policy)
  • Configurable via node settings

Importantly, if you consider the ledger history on mainnet over the past few months, the presence of such a caching mechanism with a bound of ~256MB would have reduced redundant deserializations to near zero. It also vastly increases the cost of spam attacks related to deserialization as you would need to execute unique scripts that are not in the cache which would require you to either submit each unique script as a reference script in an output to the chain (expensive large transactions) or to include the unique script in the witness set (increases the tx size and fee).

Link to rendered proposal

@stevenj
Copy link
Contributor

stevenj commented May 2, 2025

This looks like a very good idea. Not sure if it should be in the CIP, but I would suggest there should always be some kind of metrics associated with LRU caches, tracking at least:

  1. Actual space used. (Number of cached entries would vary depending on the size of the smart contract, and if its allocated 1G but only ever using 275MB then its probably too big).
  2. Number of contracts cached (because they are different sizes so its not possible to know from the cache size).
  3. Number of evictions (due to memory exhaustion of the cache, a big contract might need to evict multiple smaller ones to make space for itself in the cache).
  4. Number of misses (a miss doesn't mean there is an eviction if there is memory available).

So one can determine:

  1. Is it effective.
  2. Is it big enough.
    Both of which may change over time.

I would also suggest a pre-load strategy when the node is first run, so the cache is already warm. But again, it may not be necessary depending on how the cache behaves, so metrics would help one work out if pre-loading the cache is necessary or desirable.

@rphair rphair changed the title CIP-??? | Plutus Script Caching CIP-???? | Plutus Script Caching May 2, 2025
@rphair rphair added Category: Plutus Proposals belonging to the 'Plutus' category. State: Triage Applied to new PR afer editor cleanup on GitHub, pending CIP meeting introduction. labels May 2, 2025
Copy link
Collaborator

@rphair rphair left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking forward to introducing this at the next CIP meeting (https://hackmd.io/@cip-editors/111) and hopefully the Plutus contingent will be there.

rphair and others added 2 commits May 2, 2025 15:54
@nau
Copy link

nau commented May 3, 2025

Quick analysis of evaluation of 1200 blocks of epoch 543 shows there were evaluated 504 unique scripts of all versions with total size of (flat serialized) scripts: 1,288,321 bytes.
Guestimating 32 bytes per flat byte, we'd need cache of 32Mb for these scripts.

This cache is such an obvious optimization to do that I'm shocked it's never been implemented.

Copy link
Contributor

@ch1bo ch1bo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I love the idea, but not sure whether this is a CIP?

It's not changing how cardano works. While we could demand such caching for all block producing node implementations because it is a CIP, other performance criteria are not covered by CIPs and some alignment needs to happen anyways in a multi-node implementation world.


## Rationale: how does this CIP achieve its goals?

This proposal follows the same spirit as performance optimizations in other smart contract chains (e.g. Solana's JIT caching model). It does not modify on-chain structures or Plutus semantics — only the **execution engine of node software**.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it is not changing anything on-chain, is it even a CIP?

@colll78
Copy link
Contributor Author

colll78 commented May 5, 2025

I love the idea, but not sure whether this is a CIP?

It's not changing how cardano works. While we could demand such caching for all block producing node implementations because it is a CIP, other performance criteria are not covered by CIPs and some alignment needs to happen anyways in a multi-node implementation world.

I created a CIP because I want to socialize this change and its impact to the Cardano developer community, and for it to become a hard requirement for all node implementations.

The goal of this CIP is to be a first step towards getting rid of the non-linear scaling of fees based on the size of reference script bytes. Indeed, I could have proposed the reversion of that fee change directly in this proposal, however, given how risk adverse core developers tend to be, I figured it would be better to propose this as a standalone first step, and if accepted and integrated, then a second proposal which reverts the reference script byte fee scaling change with the justification being that this caching mechanism addresses the concerns that the fee scaling change was introduced for.

@rphair
Copy link
Collaborator

rphair commented May 6, 2025

When we created the Network and Consensus categories, each time there was a representative from that effort who had just submitted a CIP suited for that category... so we had the advantage of knowing that further CIPs coming in those categories could count on cooperation for review and to consider implementing them. cc @bwbush @jpraynaud

If as @colll78 suggests (and I support the idea) this change should be presented for community endorsement, this could be given a category of Node (since according to my understanding it doesn't affect Plutus itself)... although like @ch1bo says we wouldn't yet have confirmation of "Node" cooperation like in the other two categories: further complicated by the fact that there will be multiple "nodes" in the future (though I'd propose any Node CIP category cover all of them). cc @Ryun1

If there is any evidence that Haskell and/or Amaru node architects would generally be interested in what we used to require as enlisting in the CIP process, it would be helpful to see it here... and also to hear what more of them think about this particular proposal.

@rphair
Copy link
Collaborator

rphair commented May 13, 2025

@colll78 the CIP meeting today decided to leave this Unconfirmed until you can address these points...

We recognised @ch1bo's #1031 (review) that particular software enhancements don't work well in the CIP process: generally because there's no opportunity for the community itself to confirm or support a "standard" and because such changes are more properly requested, documented & progressed as enhancements for that software project;

I mentioned that there could be grounds for including it as a CIP anyway given that Ouroboros improvements are also being documented as CIPs... but @Crypto2099 correctly pointed out that taking this approach on one node implementation wouldn't solve, or even be done the same way, on another (e.g. Rust vs. Haskell).

So it appears the "social" value of having this documented in a CIP is problematic since we wouldn't know how to make it into a usable standard. Currently we can't even give it a category (not properly Plutus since it has nothing to do with the Plutus language itself) since Node seems a long way from being useful given the proliferation of node implementations.

@rphair rphair added State: Unconfirmed Triaged at meeting but not confirmed (or assigned CIP number) yet. and removed Category: Plutus Proposals belonging to the 'Plutus' category. State: Triage Applied to new PR afer editor cleanup on GitHub, pending CIP meeting introduction. labels May 13, 2025
@Quantumplation
Copy link
Contributor

Quantumplation commented May 20, 2025

IMO the CIP process likely needs an overhaul. We discussed this at the node diversity workshop, observing that Ethereum has:

EIP - improvements to the protocol itself that all nodes must agree on and adopt (if approved)
ERC - documenting community standards/patterns that are widespread or others might want to follow, but don't neccesarily impact the core node

(There might have been a third one, but I can't remember the details!)

So perhaps we need:

CPS - generic problem statements that might have a variety of possible solutions
CIP - specific changes to the protocol that all nodes would need to agree on and adopt (if approved)
CRC - community standards, best practices, etc.

The way I would expect Phil's goal to progress through these then is:

  • a CPS, describing why the ref script bytes is highly detrimental and needs a solution to remove
  • A CIP that outlines a protocol level performance benchmark that we propose all nodes must meet (regardless of how they do so); meeting that protocol level performance benchmark would allow us to make changes to remove the ref script bytes
  • One or more CRCs with suggested architectural design patterns that nodes could use to meet the performance benchmark and make it safe to adopt the CIP

In particular, the division between these artifacts is around "level of social consensus needed", which is the division you want when designing a process that is meant to reach and enforce a social consensus.

@rphair
Copy link
Collaborator

rphair commented May 20, 2025

thanks @Quantumplation ... let's continue that suggestion here:

@colll78
Copy link
Contributor Author

colll78 commented May 21, 2025

After discussion with @Quantumplation I'm on board with his approach, I will edit this CIP to remove any concrete suggestions regarding node architecture or implementation details, and instead to provide concrete constraints on the amount of work nodes do for large adversarial payloads targeting reference script deserialization that ensures that the amount of work done by nodes to process such is negligible (can be absorbed into the network) thus we can justify the removal of the fee.

Likewise, I will introduce another CIP for the ledger category that proposes that the fee per reference script byte change is reverted once a node is conformant with the first CIP.

I already maintain so many CIPs, so if possible, can someone else take the responsibility of drafting the CPS describing why the ref script bytes is highly detrimental and needs a solution to remove.

@WhatisRT
Copy link
Contributor

The issue with getting rid of the reference script fee is that you can still have attacks that maximize cache misses, in which SPOs would do a lot of work but aren't rewarded for it. I'd really like to get rid of this fee (especially that hideous recursive exponential) but the work in the worst-case scenario needs to be rewarded for somehow.

My suggestion to avoid this fee back then was to ensure that the operation required to get the script into a runnable state would be really cheap. This is relatively easy to do if you have a data structure without pointers because you can just read it into memory from wherever you store it, but Plutus scripts aren't such a data structure. I suggested a GHC compact region which sounds like a fantastic tool for exactly this purpose but apparently it doesn't quite work for Plutus scripts. I also don't know whether this problem is easy to solve in other languages like Rust (I guess you could just have some binary encoding with 'local pointers' and when reading it in all you do is allocate memory for every indirection & substitute in the pointer to that allocated memory).

So I think it is possible in principle to avoid this fee, but you really need to have fast loading of all scripts and not just a cache. At least if you want SPOs to be sufficiently paid for their work in the presence of attacks.

@colll78
Copy link
Contributor Author

colll78 commented May 21, 2025

Firstly, I agree with you 100% that we should work to reduce the cost of deserialization / setup time for scripts in the first place, that will benefit us regardless.

However, I think we can get rid of the reference script bytes fee with a cache + a high one-time fee for reference script deployment. In Solana for instance, to deploy the equivalent of what would be a 100kb reference script on Cardano, the transaction fee incurred would be 89 SOL or roughly $14,906.

On Cardano, instead of charging a lot for reference script deployment, the transaction fee to deploy a ~16kb reference script is roughly 1 ada, and we prevent bloat by requiring a large min-ada deposit to compensate for the low fee, the min-ada required for a UTxO with a ~16kb reference script is roughly ~68 ada. The issue is that users can simply reclaim this ada at any-time by sending the reference script UTxO back to themselves (very low tx fee). Instead, we should make the transaction fee for deploying such large reference scripts be non-trivial. Developers would unilaterally prefer to pay a very high one-time fee to deploy a contract in order to reduce the fee per transaction that uses their contract (reference script byte fee).

With the above, it would cost hundreds of thousands of dollars for a malicious attacker to construct malicious payloads to attempt to create cache misses.

Another future solution is to allow reference scripts to be submitted as blob transactions and cost those transactions accordingly. Blobs will likely be released in the next year or two and will allow publishing large amounts of data directly to the chain outside of the UTxO set but such that it can still be referenced in the execution layer and accessed by the ledger.

@WhatisRT
Copy link
Contributor

Sadly this doesn't quite work because an attacker doesn't need to deploy scripts themself to cause cache misses. As long as there are enough reference scripts around to exceed the size of the cache + what you need to fill up one transaction the attacker can just cycle through all of them. And since the attack doesn't require succeeding scripts then can all just fail them really quickly by providing 0 execution units leading to low costs. An attacker would prefer larger scripts because these lead to lower costs but it's not necessary.

Sorry for always being the guy saying that something doesn't work 😅 But I really like these suggestions and if you have more thoughts I'd like to hear them!

@zliu41
Copy link
Contributor

zliu41 commented May 28, 2025

There are so many cache replacement policies that it's not obvious why LRU is the best.

If your goal is reducing fees, then I'd suggest a random replacement policy, or at least have some degree of randomness. Like @WhatisRT said - any deterministic algorithm would be subject to certain access patterns that make it perform poorly.

@colll78
Copy link
Contributor Author

colll78 commented May 30, 2025

Sadly this doesn't quite work because an attacker doesn't need to deploy scripts themself to cause cache misses. As long as there are enough reference scripts around to exceed the size of the cache + what you need to fill up one transaction the attacker can just cycle through all of them. And since the attack doesn't require succeeding scripts then can all just fail them really quickly by providing 0 execution units leading to low costs. An attacker would prefer larger scripts because these lead to lower costs but it's not necessary.

Sorry for always being the guy saying that something doesn't work 😅 But I really like these suggestions and if you have more thoughts I'd like to hear them!

I disagree. Why would the cache store failing scripts? A transaction with a script that fails in phase 2 consumes collateral, which can of-course be costed higher than a legitimate transaction (and should be, because there is no reason any honest actor would ever submit such a transaction).

I think to state that this doesn't work, we need more concrete proof, because this is how Solana addresses the exact same problem and allows them to not cost deserialization. They use an LRU cache with a high degree of randomness baked in. If this doesn't work, then the Solana mainnet is vulnerable to the same exact deserialization attack.

@WhatisRT
Copy link
Contributor

WhatisRT commented Jun 2, 2025

I disagree. Why would the cache store failing scripts? A transaction with a script that fails in phase 2 consumes collateral, which can of-course be costed higher than a legitimate transaction (and should be, because there is no reason any honest actor would ever submit such a transaction).

The cache has to store failing scripts because the attack gets even easier if it doesn't. Remember that cache misses are the issue here, and so if I know my scripts aren't going to be cached I can just do many transactions with exactly the same scripts.

And yes, you could not cache the scripts and charge a higher collateral that makes up for this difference. That means that you're now moving the complexity from the fee calculation to the collateral calculation. This is a bit nicer for the users because the actual fees are lower, but it doesn't actually make things easier.

Another issue is that you can also do this attack with succeeding scripts. I don't know how viable it is, but I'm sure you could write some tooling that finds scripts on mainnet that you can use for your attack. So if this attack is viable and there are enough honest scripts that you can abuse then all of these optimizations won't save you.

I think to state that this doesn't work, we need more concrete proof, because this is how Solana addresses the exact same problem and allows them to not cost deserialization. They use an LRU cache with a high degree of randomness baked in. If this doesn't work, then the Solana mainnet is vulnerable to the same exact deserialization attack.

Do you have a source for this? I wonder if Solana just has a massive cache size which they can do because they require beefy machines. Randomness also helps a lot here because you need a much larger pool of scripts to choose from to get a large amount of cache misses. And if you only have a couple of cache misses then there's no attack.

However, all of this requires benchmarking & simulation. Our standard of proof shouldn't be 'Solana does it so it must be fine'. They have different hardware requirements, different consensus and lots of other differences, so what works for them might not work for us.

@colll78
Copy link
Contributor Author

colll78 commented Jun 2, 2025

The cache has to store failing scripts because the attack gets even easier if it doesn't. Remember that cache misses are the issue here, and so if I know my scripts aren't going to be cached, I can just do many transactions with exactly the same scripts.

And yes, you could not cache the scripts and charge a higher collateral that makes up for this difference. That means that you're now moving the complexity from the fee calculation to the collateral calculation. This is a bit nicer for the users because the actual fees are lower, but it doesn't actually make things easier.

Introducing very high fees for transactions that fail phase 2 validation is in and of itself a good change that encourages the intended behavior of the network. It is impossible for an honest party to submit a phase 2 invalid transaction unwillingly. There are close to zero of such transactions throughout the history of the Cardano mainnet and all of them have been submitted deliberately for testing or otherwise. The "complexity" for collateral calculation should be simple relative to the complexity of fee calculation, because the goal for collateral calculation is just to make it very high, as we know that an honest user should never have their collateral consumed.

However, all of this requires benchmarking & simulation. Our standard of proof shouldn't be 'Solana does it so it must be fine'. They have different hardware requirements, different consensus and lots of other differences, so what works for them might not work for us.

I agree that shouldn't be proof that this solution will work for us, my point was that given this is the case, we cannot immediately conclude that this solution will not work for us. It provides an argument for why we should explore this as a potential solution on Cardano.

@Ryun1
Copy link
Collaborator

Ryun1 commented Jul 8, 2025

hello!

Cool proposal
but im not sure how well it fits into the current CIPs processes, as mentioned a couple times above.

This proposal firmly seems to be more of a implementation specific detail,
rather than a standard that is needed for all core node implementations.

I see this in a similar vein to LSM Tree / UTxO-HD, where it is a optimisation that a node implementer may or may not implement.
There is no required changes to any part of Cardano's core categories which would require standardisation.

Although I feel this proposal doesn't exactly fit into the current CIP landscape
pending discussions on CIP process improvements via #1040
a place for this sort of proposal could be made within CIPs

while discussions of CIP process evolution continue
i feel its best to leave this proposal as is, allow discussions to continue
but without a number

@Crypto2099
Copy link
Collaborator

On the topic of whether or not this CIP is "on topic" I would like to say that I've loved the conversation that has already occurred within this thread but, it raises and important point, that being:

  • How do we socialize potential ideas (maybe we should cache scripts) and then move into some amount of effort expended (testing, simulation) to justify further expense of effort to get to implementation because an idea is shown to have merit?

Also I do like the idea of creating a standard for Node implementers because it may be that we have identified this caching/serialization issue as problematic to block producers and therefore all must be able to handle a transaction with 27 different reference scripts within 500ms (whatever, making up numbers here) and that becomes a benchmarking test for whether or not a block-producing node is compliant (something for Cardano Blueprint maybe @ch1bo?).

I'm not satisfied with @colll78's argument that "Solana does it this way so we should too!" (I know that's reductive and not exactly what he said, don't kill me) but I'm also not convinced by @WhatisRT arguments that this one solution may lead to other issues so it's not worth considering. We've already rather poorly traded one issue (i.e. a DoS vulnerability from the deserialization of reference scripts) with another issue (extremely high transaction costs for transactions that have legitimate need for multiple reference scripts).

Finding solutions that can bring the "per tx" fees for end users down are admirable and should be supported so I would love to see how we move this into a "next phase" for testing and simulation to see whether Phil's performance assumptions are true or false.

@WhatisRT
Copy link
Contributor

Ah, sorry, I didn't want to give the impression that some things aren't worth considering. We just need to make sure that we don't allow for attacks of this kind, which requires careful analysis.

Let me also suggest a hybrid approach: If the attack becomes harder to execute, we could lower the fees. Maybe a randomized cache combined with some other strategies makes this attack already expensive on the attacker, and then the fees could be lowered to compensate for this. I don't have the time to look into this, but it might be worth a shot if anyone is interested.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

State: Unconfirmed Triaged at meeting but not confirmed (or assigned CIP number) yet.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

10 participants