Skip to content

Results on BRIGHT not matching #3268

@Samoed

Description

@Samoed

I ran the model on the BRIGHT benchmark using the following code:

import torch
import mteb

prompts_dict = {
    "BrightRetrieval": "Given a Post, retrieve relevant passages that help answer the post",
}

tasks = mteb.get_tasks(tasks=["BrightRetrieval"])
evaluation = mteb.MTEB(tasks=tasks)

model = mteb.get_model(
    "ReasonIR/ReasonIR-8B",
    model_kwargs={"torch_dtype": torch.bfloat16},
    prompts_dict=prompts_dict,
)

evaluation.run(
    model,
    save_predictions=True,
    output_folder="results",
    encode_kwargs={"batch_size": 1},
)

The results are as follows:

  Bio. Earth. Econ. Psy. Rob. Stack. Sus. Leet. Pony AoPS TheoQ. TheoT. Avg.
ReasonIR 24.31 30.83 24.27 28.95 18.40 21.68 20.57 18.14 9.49 4.84 18.21 26.42 20.51

In the paper:
image

Originally posted by @whybe-choi in #3221 (comment)

Possible solution will be to create different tasks per subset.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingreproquestion and issues related to reproducibility

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions