Welcome to Extract Kit - a powerful, modern solution for document data extraction and conversion, powered by AI vision models! Transform your PDFs and Office documents into structured, validated data with ease.
Extract Kit is a monorepo containing a suite of production-ready TypeScript/Bun-based tools that leverage cutting-edge AI vision models and advanced parsing techniques to intelligently extract and convert data from various document formats.
This repository contains the following open-source packages:
A service that leverages cutting-edge AI vision models (Scaleway, Mistral, Ollama) to intelligently extract structured data from PDF documents. It's perfect for processing invoices, receipts, tables, or any custom document type.
- NPM Package:
@aidalinfo/pdf-processor - Documentation: Read the PDF Processor README
A modern TypeScript library for converting Office documents (DOCX) to Markdown format. It's optimized for the Bun ecosystem and includes advanced support for mathematical equations (OMML) and tables.
- NPM Package:
@aidalinfo/office-to-markdown - Documentation: Read the Office to Markdown README
- 🤖 AI-Powered Extraction: Harness the power of Scaleway Pixtral, Mistral AI, Ollama LLaVA, and custom AI providers for PDF processing.
- 🔄 Advanced DOCX Conversion: Convert
.docxfiles to clean Markdown, preserving structure, styles, and even complex mathematical equations. - 📄 Smart PDF Processing: Automatic PDF to optimized image conversion using Sharp for better AI recognition.
- ⚡ Lightning Fast: Built on Bun runtime with parallel worker processing for maximum performance.
- 🔒 Type-Safe: Full TypeScript implementation with Zod schema validation for reliable data extraction.
- 📦 Dual Mode: Use as a standalone REST API microservice or as a library in your existing Node.js/Bun projects.
For detailed instructions, please refer to the README file of the specific package you want to use:
- For PDF Extraction:
packages/pdf-processor/README.md - For DOCX Conversion:
packages/office-to-markdown/README.md
We welcome contributions! Please read our contribution guidelines (coming soon). For now, you can:
- Fork the repository
- Create your feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.
- Built with Bun - The fast JavaScript runtime.
- PDF Processor is powered by Scaleway AI, Mistral AI, and Ollama.
- Office to Markdown is inspired by Microsoft's MarkItDown.
Happy Extracting! 🎉