Replace unmaintained `encoding` dep with `encoding_rs` #45

wez · 2025-09-22T13:54:26Z

This PR iteratively (across its constituent commits) replaces the encoding crate (#38) with the encoding_rs crate.

It does this by introducing an Encoding struct to replace the use of encoding names; the struct has a decode method that calls through to encoding. This ensures that name -> encoding resolution handled "centrally".

Then the underlying implementation is switched over to encoding_rs.

Some cleanup around handling of aliases is part of this sequence of changes.

We can use a bytes::Regex to directly match without decoding as ascii first. This helps nudge closer to resolving nickspring#38

This was intended to be essentially a refactory commit that moves away from operating on encoding names and free functions that call into the unmaintained encoding crate, instead towards a type that can then introduce calls to the successor encoding_rs crate. I'd intended that no functional changes occur as part of this change, but some of the largesets tests started to fail: on closer inspection, there were a number of utf-8 files that were placed in the ascii directory: they were misclassified and the previous logic considered them to be validly ascii when they were actually utf-8. I don't see precisely where this change in behavior originates from, but I suspect that it is to do with some name canonicalization. My resolution here is to move those test files to the utf-8 directory as they are now correctly classified. The material changes in this commit are the definitions of the various Encoding instances and the corresponding alias maps. The rest of the changes are largely mechanical consequence of moving from strings to &Encoding. refs: nickspring#38

Adds encoding_rs, but does not remove encoding. Both implementations exist in the code and the results of both can be compared by changing an `if false` to `if true`. When I do that, a single windows-1255 test case has a discrepancy, which is due to encoding_rs's implementation accepting the input file. I think that this is acceptable and reasonable. I'll remove encoding in a follow up commit. refs: nickspring#38

There are a couple additional functions that still need to be migrated. refs: nickspring#38

This completes the migration away from the unmaintained encoding crate. closes: nickspring#38

This removes the following warnings from cargo audit: ``` $ cargo audit Crate: atty Version: 0.2.14 Warning: unmaintained Title: `atty` is unmaintained Date: 2024-09-25 ID: RUSTSEC-2024-0375 URL: https://rustsec.org/advisories/RUSTSEC-2024-0375 Dependency tree: atty 0.2.14 └── criterion 0.3.6 └── charset-normalizer-rs 1.1.0 Crate: serde_cbor Version: 0.11.2 Warning: unmaintained Title: serde_cbor is unmaintained Date: 2021-08-15 ID: RUSTSEC-2021-0127 URL: https://rustsec.org/advisories/RUSTSEC-2021-0127 Dependency tree: serde_cbor 0.11.2 └── criterion 0.3.6 └── charset-normalizer-rs 1.1.0 Crate: atty Version: 0.2.14 Warning: unsound Title: Potential unaligned read Date: 2021-07-04 ID: RUSTSEC-2021-0145 URL: https://rustsec.org/advisories/RUSTSEC-2021-0145 warning: 3 allowed warnings found ```

Previously, the mess detector used an unbounded cache. Unbounded cache size is undesirable as it presents as a potential attack surface if the detector is operating on data coming in from the network: an attacker could generate poison content consisting of all possible unicode codepoints across a series of otherwise innocuous messages to gradually consume the memory on the system. This commit switches it to be a fixed size cache, equal to the initial reserved capacity that was in use prior to this change.

Since the data is ordered, using a binary search can help to shave off some small overheads for repeated incidence of higher numerical value code points.

Does not appear to be needed any more

Doesn't materially change the overall performance profile, but removes the pair of hash set constructions that occur in the inner loop of alphabet_languages by refactoring the table to build the alphabet set once, then avoids the second set by recognizing that we only need the size of the intersection rather than to materialize the complete intersection.

Move that functionality into the Encoding type to keep the association tighter.

This minor change mostly just improves the typing of the various parameters and results.

This means that we don't need to allocate a copy of the payload. We don't reference the payload content but only its length when comparing matches, so we keep the length instead.

Previously, we'd always materialize the decoded string, even if we knew that we didn't need the result. This commit adjusts to use the more complex streaming decode api from encoding_rs so that we can set an upper bound on memory usage when operating in validation only mode. FWIW, this change has no measurable impact on the results of running the performance tests.

wez added 18 commits September 22, 2025 14:48

remove use of encoding::all::ASCII from any_specified_encoding

84cf897

We can use a bytes::Regex to directly match without decoding as ascii first. This helps nudge closer to resolving nickspring#38

remove encoding crate from Encoding impl

a99a474

There are a couple additional functions that still need to be migrated. refs: nickspring#38

remove encoding crate dep

e128c64

This completes the migration away from the unmaintained encoding crate. closes: nickspring#38

entity.rs: comments to doc comments

7f62054

bump version

7aba23f

utils: use binary search for unicode_range impl

47f4b78

Since the data is ordered, using a binary search can help to shave off some small overheads for repeated incidence of higher numerical value code points.

remove redundant clippy annotation

f05edf7

Does not appear to be needed any more

refactor: mb_encoding_languages -> Encoding::language()

5219eb8

Move that functionality into the Encoding type to keep the association tighter.

refactor: return &Encoding rather than &str for CharsetMatch methods

60bee4d

This minor change mostly just improves the typing of the various parameters and results.

CharsetMatch: remove payload field

241b407

This means that we don't need to allocate a copy of the payload. We don't reference the payload content but only its length when comparing matches, so we keep the length instead.

fix typo in error message

a59c4c5

fixup doc tests

1f1dc21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Replace unmaintained `encoding` dep with `encoding_rs` #45

Replace unmaintained `encoding` dep with `encoding_rs` #45

Uh oh!

wez commented Sep 22, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Replace unmaintained encoding dep with encoding_rs #45

Are you sure you want to change the base?

Replace unmaintained encoding dep with encoding_rs #45

Uh oh!

Conversation

wez commented Sep 22, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Replace unmaintained `encoding` dep with `encoding_rs` #45

Replace unmaintained `encoding` dep with `encoding_rs` #45