This repository contains the PyTorch implementations for the paper:
- Speech enhancement based on cascaded two flows [1]
Presentation video in english, Presentation video in korean, Post about the work, Paper link
This repository builds upon previous great works:
- [FlowSE] https://github.com/seongq/flowmse
- [SGMSE] https://github.com/sp-uhh/sgmse
- [SGMSE-CRP] https://github.com/sp-uhh/sgmse_crp
- [BBED] https://github.com/sp-uhh/sgmse-bbed
- [StoRM] https://github.com/sp-uhh/storm
- Create a new virtual environment with Python 3.10 (we have not tested other Python versions, but they may work).
- Install the package dependencies via
pip install -r requirements.txt. - W&B is required.
Training is done by executing train.py. A minimal running example with default settings (as in our paper [1]) can be run with
python train.py --base_dir <your_dataset_dir>where your_dataset_dir should be a containing subdirectories train/ and valid/ (optionally test/ as well).
Trained models are saved a directory named "logs".
Each subdirectory must itself have two subdirectories clean/ and noisy/, with the same filenames present in both. We currently only support training with .wav files.
To get the training set WSJ0+CHiME3 (H), WSJ0+CHiME3 (L) and WSJ0+Reverb, we refer to https://github.com/sp-uhh/sgmse and https://github.com/sp-uhh/storm.
To see all available training options, run python train.py --help.
We provide pretrained checkpoints for the models trained on WSJ0+CHiME3 (H), WSJ0+CHiME3 (L), WSJ0+Reverb, Voicebank/DEMNAD (VB-DMD). All checkpoints can be downloaded here
To evaluate on a test set, run
python evaluate_cascading.py --test_dir <your_test_dataset_dir> --folder_destination <your_enh_result_save_dir> --ckpt <path_to_model_checkpoint> --N_second <num_of_time_steps_for_the_second_flow>"N_second" is the evaluation number of the numerical integration for the second flow. For the first flow, we set the number of evaluation to be 1.
your_test_dataset_dir should contain a subfolder test which contains subdirectories clean and noisy. clean and noisy should contain .wav files.
[1] Seonggyu Lee, Sein Cheong, Sangwook Han, Kihyuk Kim and Jong Won Shin, “Speech Enhancement based on cascaded two flows” in Proceedings of Interspeech, Aug. 2025.
