-
Notifications
You must be signed in to change notification settings - Fork 7.1k
Description
Reminder
- I have read the above rules and searched the existing issues.
System Info
(llama-factory) root@ZOE-XJY:/home/zysoft/Leo/LLaMA-Factory# llamafactory-cli env
[2025-08-11 11:48:44,275] [INFO] [real_accelerator.py:254:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-08-11 11:48:45,286] [INFO] [logging.py:107:log_dist] [Rank -1] [TorchCheckpointEngine] Initialized with serialization = False
llamafactory
version: 0.9.4.dev0- Platform: Linux-6.6.87.1-microsoft-standard-WSL2-x86_64-with-glibc2.35
- Python version: 3.12.9
- PyTorch version: 2.7.0+cu128 (GPU)
- Transformers version: 4.52.4
- Datasets version: 3.4.1
- Accelerate version: 1.7.0
- PEFT version: 0.15.2
- TRL version: 0.9.6
- GPU type: NVIDIA GeForce RTX 5060 Ti
- GPU number: 1
- GPU memory: 15.93GB
- DeepSpeed version: 0.17.2
- Bitsandbytes version: 0.46.0
Reproduction
报错详细:
0%| | 0/6 [00:00<?, ?it/s]
File "/root/anaconda3/envs/llama-factory/bin/llamafactory-cli", line 8, in
sys.exit(main())
^^^^^^
File "/home/zysoft/Leo/LLaMA-Factory-main/src/llamafactory/cli.py", line 151, in main
COMMAND_MAPcommand
File "/home/zysoft/Leo/LLaMA-Factory-main/src/llamafactory/train/tuner.py", line 110, in run_exp
_training_function(config={"args": args, "callbacks": callbacks})
File "/home/zysoft/Leo/LLaMA-Factory-main/src/llamafactory/train/tuner.py", line 74, in _training_function
run_rm(model_args, data_args, training_args, finetuning_args, callbacks)
File "/home/zysoft/Leo/LLaMA-Factory-main/src/llamafactory/train/rm/workflow.py", line 65, in run_rm
train_result = trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/anaconda3/envs/llama-factory/lib/python3.12/site-packages/transformers/trainer.py", line 2240, in train
return inner_training_loop(
^^^^^^^^^^^^^^^^^^^^
File "/root/anaconda3/envs/llama-factory/lib/python3.12/site-packages/transformers/trainer.py", line 2509, in _inner_training_loop
batch_samples, num_items_in_batch = self.get_batch_samples(epoch_iterator, num_batches, args.device)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/anaconda3/envs/llama-factory/lib/python3.12/site-packages/transformers/trainer.py", line 5263, in get_batch_samples
batch_samples.append(next(epoch_iterator))
^^^^^^^^^^^^^^^^^^^^
File "/root/anaconda3/envs/llama-factory/lib/python3.12/site-packages/accelerate/data_loader.py", line 566, in iter
current_batch = next(dataloader_iter)
^^^^^^^^^^^^^^^^^^^^^
File "/root/anaconda3/envs/llama-factory/lib/python3.12/site-packages/torch/utils/data/dataloader.py", line 733, in next
data = self._next_data()
^^^^^^^^^^^^^^^^^
File "/root/anaconda3/envs/llama-factory/lib/python3.12/site-packages/torch/utils/data/dataloader.py", line 1515, in _next_data
return self._process_data(data, worker_id)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/anaconda3/envs/llama-factory/lib/python3.12/site-packages/torch/utils/data/dataloader.py", line 1550, in _process_data
data.reraise()
File "/root/anaconda3/envs/llama-factory/lib/python3.12/site-packages/torch/_utils.py", line 750, in reraise
raise exception
Caught ValueError in DataLoader worker process 0.
Original Traceback (most recent call last):
File "/root/anaconda3/envs/llama-factory/lib/python3.12/site-packages/torch/utils/data/_utils/worker.py", line 349, in _worker_loop
data = fetcher.fetch(index) # type: ignore[possibly-undefined]
^^^^^^^^^^^^^^^^^^^^
File "/root/anaconda3/envs/llama-factory/lib/python3.12/site-packages/torch/utils/data/_utils/fetch.py", line 55, in fetch
return self.collate_fn(data)
^^^^^^^^^^^^^^^^^^^^^
File "/home/zysoft/Leo/LLaMA-Factory-main/src/llamafactory/data/collator.py", line 277, in call
return super().call(concatenated_features)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/zysoft/Leo/LLaMA-Factory-main/src/llamafactory/data/collator.py", line 216, in call
raise ValueError("Qwen2-VL/Qwen2.5-Omni model requires 3D position ids for mrope.")
ValueError: Qwen2-VL/Qwen2.5-Omni model requires 3D position ids for mrope.
0%| | 0/6 [00:02<?, ?it/s]
eval
val_size: 0.01
per_device_eval_batch_size: 1
eval_strategy: steps
eval_steps: 100
Others
配置文件:
model
model_name_or_path: output/0804_zoe_UItars_train_0804_2102
image_max_pixels: 1053696
image_min_pixels: 200704
min_pixels=256 * 28 * 28, # 200,704
max_pixels=1280 * 28 * 28 # 1,003,520
2116800
trust_remote_code: true
flash_attn: fa2
enable_liger_kernel: true
use_unsloth_gc: false
use_unsloth: false
method
stage: rm
do_train: true
finetuning_type: lora
lora_rank: 16
lora_target: all
freeze_vision_tower: false
dataset
dataset: 0811_UItars_button_test_1
template: qwen2_vl
cutoff_len: 8192
max_samples: 1000000
overwrite_cache: true
preprocessing_num_workers: 4
dataloader_num_workers: 4
media_dir: data/mix_0808_zoe_noise
output
output_dir: saves/0811_UItars_button_test_1
logging_steps: 10
save_steps: 500
save_total_limit: 0
save_strategy: steps
plot_loss: true
overwrite_output_dir: true
save_only_model: false
report_to: swanlab # choices: [none, wandb, tensorboard, swanlab, mlflow]
train
per_device_train_batch_size: 2
gradient_accumulation_steps: 4
learning_rate: 5.0e-5
num_train_epochs: 6.0
lr_scheduler_type: cosine
warmup_ratio: 0.1
bf16: true
ddp_timeout: 180000000
resume_from_checkpoint: null