NeoBERT

Important

This is a fork of the original chandar-lab/NeoBERT, refactored to support experimentation. ⚠️ WIP/active development⚠️

NeoBERT

Description

NeoBERT is a next-generation encoder model for English text representation, pre-trained from scratch on the RefinedWeb dataset. NeoBERT integrates state-of-the-art advancements in architecture, modern data, and optimized pre-training methodologies. It is designed for seamless adoption: it serves as a plug-and-play replacement for existing base models, relies on an optimal depth-to-width ratio, and leverages an extended context length of 4,096 tokens. Despite its compact 250M parameter footprint, it is the most efficient model of its kind and achieves state-of-the-art results on the massive MTEB benchmark, outperforming BERT large, RoBERTa large, NomicBERT, and ModernBERT under identical fine-tuning conditions.

Paper: paper
Model: huggingface
Documentation: docs/

Get started

Note

See the comprehensive quickstart doc in docs/

Ensure you have the following dependencies installed:

pip install transformers torch  # Core dependencies
# For GPU optimization (build from source for latest GPUs):
# pip install flash-attn --no-build-isolation
# pip install -v --no-build-isolation git+https://github.com/facebookresearch/xformers.git@main

If you would like to use sequence packing (un-padding), you will need to also install flash-attention:

pip install transformers torch  # Core dependencies
# For GPU optimization (build from source for latest GPUs):
# pip install flash-attn --no-build-isolation
# pip install -v --no-build-isolation git+https://github.com/facebookresearch/xformers.git@main flash_attn

It is much safer to first install as stated above. Then, you can clone this repo and install it in editable mode:

git clone https://github.com/pszemraj/NeoBERT.git
cd NeoBERT
pip install -e .

This will install the neobert package and all remaining dependencies¹.

Note

If you want to install the development dependencies (for testing, linting, etc.), use pip install -e .[dev].

How to use

Load the official model using Hugging Face Transformers:

For Text Embeddings

from transformers import AutoModel, AutoTokenizer

model_name = "chandar-lab/NeoBERT"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModel.from_pretrained(model_name, trust_remote_code=True)

# Tokenize input text
text = "NeoBERT is the most efficient model of its kind!"
inputs = tokenizer(text, return_tensors="pt")

# Generate embeddings
outputs = model(**inputs)
embedding = outputs.last_hidden_state[:, 0, :]  # CLS token embedding
print(embedding.shape)

For Masked Language Modeling

from transformers import AutoModelForMaskedLM, AutoTokenizer

model_name = "chandar-lab/NeoBERT"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForMaskedLM.from_pretrained(model_name, trust_remote_code=True)

# Fill in masked tokens
text = "The quick brown [MASK] jumps over the lazy dog."
inputs = tokenizer(text, return_tensors="pt")

# Get predictions
outputs = model(**inputs)
mask_token_index = (inputs["input_ids"] == tokenizer.mask_token_id).nonzero(as_tuple=True)[1]
predicted_token_id = outputs.logits[0, mask_token_index].argmax(axis=-1)
print(tokenizer.decode(predicted_token_id))

Documentation

For detailed guides and documentation, see the Documentation:

Quick Start Guide - Get up and running in 5 minutes
Training Guide - Pretraining and fine-tuning
Evaluation Guide - GLUE and MTEB benchmarks
Configuration System - Understanding configs
Architecture Details - Technical model details

Features

Feature	NeoBERT
`Depth-to-width`	28 × 768
`Parameter count`	250M
`Activation`	SwiGLU
`Positional embeddings`	RoPE
`Normalization`	Pre-RMSNorm
`Data Source`	RefinedWeb
`Data Size`	2.8 TB
`Tokenizer`	google/bert
`Context length`	4,096
`MLM Masking Rate`	20%
`Optimizer`	AdamW
`Scheduler`	CosineDecay
`Training Tokens`	2.1 T
`Efficiency`	FlashAttention

License

Model weights and code repository are licensed under the permissive MIT license.

Citation

If you use this model in your research, please cite:

@misc{breton2025neobertnextgenerationbert,
      title={NeoBERT: A Next-Generation BERT},
      author={Lola Le Breton and Quentin Fournier and Mariam El Mezouar and Sarath Chandar},
      year={2025},
      eprint={2502.19587},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2502.19587},
}

Training and Development

This repository includes the complete training and evaluation codebase for NeoBERT, featuring:

Configuration System

Hierarchical YAML configs with command-line overrides
Task-specific configurations for pretraining, GLUE, contrastive learning, and MTEB evaluation
CPU-friendly test configs for development and validation

# Basic training with config file
python scripts/pretraining/pretrain.py --config configs/pretrain_neobert.yaml

# Override specific parameters
python scripts/pretraining/pretrain.py \
    --config configs/pretrain_neobert.yaml \
    --trainer.per_device_train_batch_size 32 \
    --optimizer.lr 2e-4

Repository Structure

configs/ - YAML configuration files (README)
scripts/ - Training and evaluation scripts (README)
jobs/ - Shell scripts for running experiments (README)
tests/ - Comprehensive test suite (README)
src/neobert/ - Core model and training code

Quick Start for Training

Install dependencies:
```
pip install -e .
```
Run tests to validate setup:
```
python tests/run_tests.py
```

Start with a small test run:

python scripts/pretraining/pretrain.py --config tests/configs/pretraining/test_tiny_pretrain.yaml

Scale up to full training:

python scripts/pretraining/pretrain.py --config configs/pretrain_neobert.yaml

Testing

The repository includes a comprehensive test suite that verifies:

Configuration system functionality
Model architecture and forward passes
Training pipeline integration
CPU-only compatibility (no GPU required for tests)

Exporting Models to HuggingFace

Use the export workflow documented in /docs/export.md for checkpoint conversion and validation. Script-level usage lives in /scripts/export-hf/README.md.

Technically, this command installs everything, but package order/resolution is not guaranteed, so it is better to install the core dependencies first. ↩

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

NeoBERT

Description

Get started

How to use

For Text Embeddings

For Masked Language Modeling

Documentation

Features

License

Citation

Training and Development

Configuration System

Repository Structure

Quick Start for Training

Testing

Exporting Models to HuggingFace

About

Uh oh!

Releases 6

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
configs		configs
docs		docs
jobs		jobs
scripts		scripts
src/neobert		src/neobert
tests		tests
.gitignore		.gitignore
README.md		README.md
pyproject.toml		pyproject.toml

pszemraj/NeoBERT

Folders and files

Latest commit

History

Repository files navigation

NeoBERT

Description

Get started

How to use

For Text Embeddings

For Masked Language Modeling

Documentation

Features

License

Citation

Training and Development

Configuration System

Repository Structure

Quick Start for Training

Testing

Exporting Models to HuggingFace

Footnotes

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases 6

Languages