A powerful tool for generating high-quality text-to-speech (TTS) datasets from audio or video files. This utility automates the process of segmenting media files into optimal chunks and transcribing them accurately, saving hours of manual work.
This tool helps you:
- Segment audio or video files into optimal chunks for TTS training
- Transcribe those segments automatically using OpenAI's Whisper model
- Generate a properly formatted dataset ready for TTS model training
Perfect for creating custom voice datasets, podcast transcription, lecture processing, and more.
- Flexible audio segmentation based on natural speech boundaries
- High-quality transcription using OpenAI's Whisper models
- Handles both video and audio files as input
- Configurable parameters for optimal segmentation
- GPU acceleration support for faster processing
- Multi-language support for transcription
- Python 3.7+
- ffmpeg (required for audio processing)
Requires the command-line tool ffmpeg to be installed on your system, which is available from most package managers:
# on Ubuntu or Debian
sudo apt update && sudo apt install ffmpeg
# on Arch Linux
sudo pacman -S ffmpeg
# on MacOS using Homebrew (https://brew.sh/)
brew install ffmpeg
# on Windows using Chocolatey (https://chocolatey.org/)
choco install ffmpeg
# on Windows using Scoop (https://scoop.sh/)
scoop install ffmpeg# Clone this repository
git clone https://github.com/gokhaneraslan/tts-dataset-generator.git
cd tts-dataset-generator
# Install dependencies
pip install -r requirements.txtThe main dependencies include:
- pydub (for audio processing)
- openai-whisper (for transcription)
- torch (for GPU acceleration, optional)
- moviepy (for video handling)
- natsort (for natural sorting, optional)
To choose the best model for your system, please visit for available models and required VRAM information.
OpenAI Whisper
If you do not have enough equipment to choose the "large" whisper model, we recommend you to use google colab.
python main.py --file your_audio_or_video_file.mp4 --model large --language tr
# Or
python main.py --f your_audio_or_video_file.mp4 -m small -l trThis will:
- Process your file and segment it into chunks
- Save the audio segments to MyTTSDataset/wavs/
- Transcribe each segment
- Generate a metadata file at MyTTSDataset/metadata.csv
python main.py \
  --file your_file.mp4 \
  --min-duration 3.0 \
  --max-duration 8.0 \
  --silence-threshold -35 \
  --min-silence-len 500 \
  --keep-silence 100 \
  --model medium \
  --language tr \
  --ljspeech True \
  --sample_rate 22050| Parameter | Description | Default | 
|---|---|---|
| --fileor-f | Input audio or video file path | (Required) | 
| --min-duration | Minimum segment duration in seconds | 3.0 | 
| --max-duration | Maximum segment duration in seconds | 10.0 | 
| --silence-threshold | Audio level (dBFS) below which is considered silence | -40 | 
| --min-silence-len | Minimum silence duration (ms) to mark a split point | 250 | 
| --keep-silence | Padding silence (ms) to keep at segment boundaries | 150 | 
| --modelor-m | Whisper model size (tiny/base/small/medium/large) | large | 
| --languageor-l | Language code for transcription and number conversion | en | 
| --ljspeech | Dataset format for coqui-ai/TTS formatter ljspeech | True | 
| --sample_rate | Must be the same as the sampling rate of the sounds in the dataset | 22050 | 
| --log-level | Set logging level (DEBUG/INFO/WARNING/ERROR/CRITICAL) | INFO | 
The tool supports multiple languages through the Whisper model. Some commonly used language codes:
- tr- Turkish
- en- English
- fr- French
- de- German
- es- Spanish
- it- Italian
- ja- Japanese
- zh- Chinese
- ru- Russian
- pt- Portuguese
Run the tool to see a complete list of supported language codes.
The tool uses a sophisticated audio segmentation approach optimized for TTS dataset creation:
- 
Silence Detection: - Uses pydub's split_on_silencefunction which analyzes the audio waveform to identify natural pauses in speech
- The silence_thresh_dbfsparameter (-40 dBFS by default) determines what volume level is considered silence
- The min_silence_len_msparameter (250ms by default) sets how long a silence must be to trigger a split
 
- Uses pydub's 
- 
Duration Constraints: - Filters segments to ensure they're between min_duration_sandmax_duration_sseconds
- This is crucial for TTS training as too-short segments lack context, while too-long segments can cause training instability
 
- Filters segments to ensure they're between 
- 
Natural Boundaries: - Preserves a small amount of silence (keep_silence_ms) at the beginning/end of each segment
- This creates natural-sounding pauses that improve TTS model prosody
 
- Preserves a small amount of silence (
- 
Padding: - Adds consistent silence padding (250ms) to the end of each segment
- This gives TTS models consistent end-of-utterance patterns to learn from
 
For optimal results:
- For clean studio recordings, use higher silence_thresh_dbfsvalues (-45 to -50)
- For recordings with background noise, use lower values (-30 to -35)
- For fast speech, reduce min_silence_len_msto catch shorter pauses
- For slow, deliberate speech, increase min_silence_len_ms
The transcription uses OpenAI's Whisper, a state-of-the-art speech recognition model:
- Supports multiple languages and accents
- Performs well even with background noise and varying audio quality
- Different model sizes balance accuracy vs. speed/resource requirements
- Automatically handles punctuation and capitalization
This tool generates datasets compatible with most modern TTS frameworks. Here's how it fits into typical TTS training workflows:
- Input: Raw audio/video recordings of a voice talent
- Process: Segmentation β Transcription β Formatting
- Output: Segmented audio files + aligned transcripts
- Load the generated dataset into your TTS system
- Most systems require:
- Text normalization (lowercase, num2words, basic cleaning)
- Audio feature extraction (mel spectrograms, etc.)
- Text tokenization
 
The generated dataset works seamlessly with:
- 
Traditional TTS Models: - Tacotron 2
- FastSpeech/FastSpeech 2
- Tacotron-DDC
- Transformer-TTS
 
- 
Modern Neural TTS Frameworks: - π VITS/VITS2: Fully compatible with VITS training requirements
- π XTTS/Tortoise: The output format is ready for fine-tuning
- π Coqui TTS: Dataset directly works with Coqui's training scripts
- π ESPnet-TTS: Compatible with minimal preprocessing
- π Mozilla TTS: Ready for training without additional formatting
 
The generated metadata.csv follows the LJSpeech format widely used in TTS training:
--ljspeech True
segment_1|The quick brown fox jumps over the lazy dog.|The quick brown fox jumps over the lazy dog.
segment_2|She sells seashells by the seashore.|She sells seashells by the seashore.
--ljspeech False
wavs/segment_1.wav|The quick brown fox jumps over the lazy dog.
wavs/segment_2.wav|She sells seashells by the seashore.
This format is supported by most TTS frameworks out-of-the-box or with minimal adaptation.
For Coqui TTS:
# Train a Tacotron2 model using our generated dataset
tts train --config_path config.json --coqpit.datasets.0.path ./MyTTSDatasetFor VITS:
# Modify VITS config to point to our dataset
sed -i 's|"training_files":.*|"training_files":"./MyTTSDataset/metadata.csv",|g' configs/vits.json
# Start training
python train.py -c configs/vits.json -m vitsLet's walk through a complete example:
- Prepare a video file of someone speaking clearly
- Run the tool to process the file:
python main.py --file interview.mp4 --model medium --language en- 
Examine the output: - Audio segments are saved in MyTTSDataset/wavs/
- Transcriptions are in MyTTSDataset/metadata.csv
 
- Audio segments are saved in 
- 
Format of ljspeech metadata.csv: wav_filename|text|normalized_text or text 
segment_1|Hello and welcome to our tutorial on text to speech.|Hello and welcome to our tutorial on text to speech.
segment_2|Today we'll learn how to create high quality voice datasets.|Today we'll learn how to create high quality voice datasets.
segment_3|The first step is to record clear audio samples.|The first step is to record clear audio samples.
- Format of metadata.csv: wav_filename|text
wavs/segment_1.wav|Hello and welcome to our tutorial on text to speech.
wavs/segment_2.wav|Today we'll learn how to create high quality voice datasets.
wavs/segment_3.wav|The first step is to record clear audio samples.
- Use the dataset for training TTS models like Tacotron, VITS, F5-TTS, piper or other custom voice systems
- For faster transcription, use a machine with a CUDA-compatible GPU
- For better transcription accuracy, use the largemodel (requires more RAM/VRAM)
- For faster processing but less accuracy, use the smallorbasemodels
- Adjust the silence-thresholdparameter if your audio has background noise
For optimal TTS training results:
- 
Recording Quality: - Use a high-quality microphone in a quiet environment
- Maintain consistent audio levels and microphone distance
- Sample rate of 22050Hz or 44100Hz is recommended
 
- 
Speech Characteristics: - Clear, consistent speaking style
- Natural but not overly dramatic prosody
- Consistent speaking pace
 
- 
Dataset Composition: - Aim for 30+ minutes of clean audio for basic voice cloning
- 1-2 hours for high-quality results
- 5+ hours for production-grade TTS systems
- Include diverse phonetic coverage (pangrams, digit sequences, etc.)
 
- OpenAI Whisper for the transcription model
- pydub for audio processing
Contributions are welcome! Please feel free to submit a Pull Request.