Skip to content

Conversation

wez
Copy link
Contributor

@wez wez commented Sep 22, 2025

This PR iteratively (across its constituent commits) replaces the encoding crate (#38) with the encoding_rs crate.

It does this by introducing an Encoding struct to replace the use of encoding names; the struct has a decode method that calls through to encoding. This ensures that name -> encoding resolution handled "centrally".

Then the underlying implementation is switched over to encoding_rs.

Some cleanup around handling of aliases is part of this sequence of changes.

wez added 18 commits September 22, 2025 14:48
We can use a bytes::Regex to directly match without decoding
as ascii first.

This helps nudge closer to resolving
nickspring#38
This was intended to be essentially a refactory commit that moves
away from operating on encoding names and free functions that
call into the unmaintained encoding crate, instead towards a
type that can then introduce calls to the successor encoding_rs
crate.

I'd intended that no functional changes occur as part of this
change, but some of the largesets tests started to fail: on closer
inspection, there were a number of utf-8 files that were placed
in the ascii directory: they were misclassified and the previous
logic considered them to be validly ascii when they were actually
utf-8.

I don't see precisely where this change in behavior originates from, but
I suspect that it is to do with some name canonicalization.  My
resolution here is to move those test files to the utf-8 directory as
they are now correctly classified.

The material changes in this commit are the definitions of the various
Encoding instances and the corresponding alias maps.

The rest of the changes are largely mechanical consequence of moving
from strings to &Encoding.

refs: nickspring#38
Adds encoding_rs, but does not remove encoding.  Both implementations
exist in the code and the results of both can be compared by changing an
`if false` to `if true`.

When I do that, a single windows-1255 test case has a discrepancy,
which is due to encoding_rs's implementation accepting the input
file.

I think that this is acceptable and reasonable.

I'll remove encoding in a follow up commit.

refs: nickspring#38
There are a couple additional functions that still need
to be migrated.

refs: nickspring#38
This completes the migration away from the unmaintained encoding crate.

closes: nickspring#38
This removes the following warnings from cargo audit:

```
$ cargo audit
Crate:     atty
Version:   0.2.14
Warning:   unmaintained
Title:     `atty` is unmaintained
Date:      2024-09-25
ID:        RUSTSEC-2024-0375
URL:       https://rustsec.org/advisories/RUSTSEC-2024-0375
Dependency tree:
atty 0.2.14
└── criterion 0.3.6
    └── charset-normalizer-rs 1.1.0

Crate:     serde_cbor
Version:   0.11.2
Warning:   unmaintained
Title:     serde_cbor is unmaintained
Date:      2021-08-15
ID:        RUSTSEC-2021-0127
URL:       https://rustsec.org/advisories/RUSTSEC-2021-0127
Dependency tree:
serde_cbor 0.11.2
└── criterion 0.3.6
    └── charset-normalizer-rs 1.1.0

Crate:     atty
Version:   0.2.14
Warning:   unsound
Title:     Potential unaligned read
Date:      2021-07-04
ID:        RUSTSEC-2021-0145
URL:       https://rustsec.org/advisories/RUSTSEC-2021-0145

warning: 3 allowed warnings found
```
Previously, the mess detector used an unbounded cache.

Unbounded cache size is undesirable as it presents as a potential attack
surface if the detector is operating on data coming in from the network:
an attacker could generate poison content consisting of all possible
unicode codepoints across a series of otherwise innocuous messages to
gradually consume the memory on the system.

This commit switches it to be a fixed size cache, equal to the initial
reserved capacity that was in use prior to this change.
Since the data is ordered, using a binary search can help to shave off
some small overheads for repeated incidence of higher numerical value
code points.
Does not appear to be needed any more
Doesn't materially change the overall performance profile, but removes
the pair of hash set constructions that occur in the inner loop of
alphabet_languages by refactoring the table to build the alphabet set
once, then avoids the second set by recognizing that we only need the
size of the intersection rather than to materialize the complete
intersection.
Move that functionality into the Encoding type to keep
the association tighter.
This minor change mostly just improves the typing of the various
parameters and results.
This means that we don't need to allocate a copy of the payload.
We don't reference the payload content but only its length when
comparing matches, so we keep the length instead.
Previously, we'd always materialize the decoded string, even if we knew
that we didn't need the result.

This commit adjusts to use the more complex streaming decode api from
encoding_rs so that we can set an upper bound on memory usage when
operating in validation only mode.

FWIW, this change has no measurable impact on the results of
running the performance tests.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant