Results on `BRIGHT` not matching

I ran the model on the BRIGHT benchmark using the following code:
```python
import torch
import mteb

prompts_dict = {
    "BrightRetrieval": "Given a Post, retrieve relevant passages that help answer the post",
}

tasks = mteb.get_tasks(tasks=["BrightRetrieval"])
evaluation = mteb.MTEB(tasks=tasks)

model = mteb.get_model(
    "ReasonIR/ReasonIR-8B",
    model_kwargs={"torch_dtype": torch.bfloat16},
    prompts_dict=prompts_dict,
)

evaluation.run(
    model,
    save_predictions=True,
    output_folder="results",
    encode_kwargs={"batch_size": 1},
)
```
The results are as follows:
  | Bio. | Earth. | Econ. | Psy. | Rob. | Stack. | Sus. | Leet. | Pony | AoPS | TheoQ. | TheoT. | Avg.
-- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | --
ReasonIR | 24.31 | 30.83 | 24.27 | 28.95 | 18.40 | 21.68 | 20.57 | 18.14 | 9.49 | 4.84 | 18.21 | 26.42 | 20.51

In the paper:
<img width="1028" height="242" alt="image" src="https://github.com/user-attachments/assets/7c28521e-9a16-424d-8947-256690a46104" />

_Originally posted by @whybe-choi in https://github.com/embeddings-benchmark/mteb/issues/3221#issuecomment-3355490399_



Possible solution will be to create different tasks per subset.
            

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Results on `BRIGHT` not matching #3268

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Results on BRIGHT not matching #3268

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Results on `BRIGHT` not matching #3268