Skip to content

Conversation

@montvid
Copy link

@montvid montvid commented Oct 28, 2025

After image ocr with --no_fitz_preprocess a jsonl file is generated with wrong image dimensions to the original because smart_resize is used in the parser.py script. As per https://arxiv.org/abs/2307.06304 NaViT can ingest any dimension image so no need for smart_resize.

After image ocr with --no_fitz_preprocess a jsonl file is generated with wrong image dimensions to the original because smart_resize is used in the parser.py script. As per https://arxiv.org/abs/2307.06304 NaViT can ingest any dimension image so no need for smart_resize.
delete smart_resize
@ygfrancois
Copy link
Collaborator

Smart resize here is to show the real input size to model, the model server will online do the smart resize to keep the input size divisible by vision patch size.

@ygfrancois ygfrancois closed this Oct 31, 2025
@montvid
Copy link
Author

montvid commented Oct 31, 2025

Thanks for clarification! Now I understand that NaViT-style encoders accept arbitrary aspect ratios/resolutions, but the deployed implementation still enforces patch-size divisibility to form valid patch tokens efficiently.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants