Journal-Utilities

Utilities and Documentation for creating contents for the Active Inference Journal https://github.com/ActiveInferenceInstitute/ActiveInferenceJournal

This repository provides a local transcription pipeline using WhisperX with SurrealDB for storage and retrieval.

WhisperX Transcription Pipeline

Installation

Prerequisites

Install uv - Fast Python package installer

curl -LsSf https://astral.sh/uv/install.sh | sh

Ensure CUDA 12.8 is installed for GPU support (optional but recommended)

Setup with uv

# Clone the repository
git clone https://github.com/ActiveInferenceInstitute/Journal-Utilities.git
cd Journal-Utilities

# Create virtual environment and install dependencies
uv venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate

# Install all dependencies including PyTorch with CUDA support
uv pip install -e .
# For CUDA 12.8 support (required for GPU acceleration)
uv pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu128

# Install cuDNN 8 (required for pyannote speaker embeddings)
sudo apt install libcudnn8 libcudnn8-dev -y
sudo ldconfig

# For development
uv pip install -e ".[dev]"

Note: After installation, you'll need to apply compatibility patches to WhisperX for pyannote.audio 4.0+. Run:

python scripts/patch_whisperx.py

Install ffmpeg

wget -O - -q  https://github.com/yt-dlp/FFmpeg-Builds/releases/download/latest/ffmpeg-master-latest-linux64-gpl.tar.xz | xz -qdc| tar -x

Setup .env file

Generate a Hugging Face Token and accept the user agreement for the following models:
- Segmentation
- Speaker-Diarization-3.1
- Speaker-Diarization-Community-1 (for speaker embeddings)
Get the YouTube Data API v3 Key from https://console.developers.google.com/apis/
Get Your Coda API Token at https://coda.io/account, scroll to "API settings," and generate an API token.
Configure environment variables:

cp .env.sample .env

Update the following values in .env:

HUGGINGFACE_TOKEN: Your Hugging Face token
API_KEY: Your YouTube Data API v3 key
WAV_DIRECTORY: Directory for WAV file storage
OUTPUT_DIR: Output directory for processed files
JOURNAL_REPO_DIR: Path to Active Inference Journal repository
CODA_API_TOKEN: Your Coda API token (for fetching session data)

Usage

Complete Workflow

The typical workflow consists of these steps:

# 1. Start the database
make db-start

# 2. Fetch latest data from Coda API
make fetch-coda

# 3. Import sessions into SurrealDB (with audit trail)
make import-sessions

# 4. Fetch metadata from YouTube API
make fetch-metadata

# 5. Run WhisperX transcription
make transcribe

# 6. Copy processed files to journal repository
make copy-to-journal

Individual Steps

Fetch Data from Coda

make fetch-coda

Downloads the latest session data from Coda API. The JSON file can be formatted in VS Code with Format Document for better readability.

Import Sessions

make import-sessions
# Or with custom JSON file:
python src/ingest_db_create_wav.py --step import --json /path/to/file.json

Imports sessions with full audit trail tracking. Use rollback functions if needed.

Fetch YouTube Metadata

make fetch-metadata

Any "private video" failures should be added to src/private_videos.json to skip youtube metadata fetching and transcription.

Run Transcription

make transcribe

This script:

Loads WAV files from the database
Performs transcription using WhisperX
Applies speaker diarization and alignment
Stores results back in SurrealDB

Copy to Journal

make copy-to-journal

Organizes transcripts by category/series/episode in the journal repository.

Query Database

surreal sql --endpoint http://localhost:8080 --username root --password root --namespace actinf --database actinf

Example queries:

-- View all sessions
SELECT * FROM session;

-- View transcribed sessions
SELECT * FROM session WHERE transcribed = true;

-- View sessions pending transcription
SELECT * FROM session WHERE transcribed = false AND is_private != true;

-- View specific session by name
SELECT * FROM session WHERE session_name = 'video_id';

-- View import audit trail
SELECT * FROM import_audit ORDER BY timestamp DESC LIMIT 10;

-- View recent import summary
SELECT * FROM import_audit WHERE operation = 'import_summary' ORDER BY timestamp DESC;

Database Maintenance

# Upgrade SurrealDB
sudo surreal upgrade

# Fix database after upgrade
surreal fix rocksdb://database

Testing

Run unit tests:

python -m unittest tests.test_output_final_artifacts
python -m unittest tests.test_transcript

Project Structure

Journal-Utilities/
├── src/                     # Main transcription pipeline
│   ├── ingest_db_create_wav.py  # Multi-step ingestion with CLI
│   ├── transcribe.py            # WhisperX transcription
│   ├── output_final_artifacts.py # Process final outputs
│   └── private_videos.json      # List of private video IDs
├── tests/                   # Unit tests
├── data/                    # Database and output files
│   ├── database/           # SurrealDB storage
│   ├── input/              # Input data files (Coda JSON)
│   └── output/             # Processed outputs
├── Archive/                 # Archived AssemblyAI tools
│   ├── 1_youtube_to_audio/
│   ├── 2_audio_to_markdown/
│   ├── 5_markdown_to_final/
│   └── ...
├── Makefile                # Workflow automation
├── CLAUDE.md               # Documentation for Claude Code
├── README.md               # This file
├── .env.sample             # Environment configuration template
└── pyproject.toml          # Python package configuration

Archived Components

The AssemblyAI-based transcription tools have been moved to the Archive/ directory. These legacy tools provided cloud-based transcription with features like custom vocabulary boosting, spell checking, and document conversion. They remain available for historical reference but are no longer actively maintained.

Acknowledgements

WhisperX transcription pipeline and SurrealDB integration contributed by Holly Grimm @hollygrimm, 2024
Initial AssemblyAI scripts and documentation contributed by Dave Douglass, November 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Journal-Utilities

WhisperX Transcription Pipeline

Installation

Prerequisites

Setup with uv

Install ffmpeg

Setup .env file

Usage

Complete Workflow

Individual Steps

Fetch Data from Coda

Import Sessions

Fetch YouTube Metadata

Run Transcription

Copy to Journal

Query Database

Database Maintenance

Testing

Project Structure

Archived Components

Acknowledgements

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors 4

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 129 Commits
.vscode		.vscode
Archive		Archive
data		data
scripts		scripts
src		src
tests		tests
.env.sample		.env.sample
.gitignore		.gitignore
.python-version		.python-version
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
TODO.md		TODO.md
pyproject.toml		pyproject.toml

License

ActiveInferenceInstitute/Journal-Utilities

Folders and files

Latest commit

History

Repository files navigation

Journal-Utilities

WhisperX Transcription Pipeline

Installation

Prerequisites

Setup with uv

Install ffmpeg

Setup .env file

Usage

Complete Workflow

Individual Steps

Fetch Data from Coda

Import Sessions

Fetch YouTube Metadata

Run Transcription

Copy to Journal

Query Database

Database Maintenance

Testing

Project Structure

Archived Components

Acknowledgements

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors 4

Uh oh!

Languages

Packages