Extract Kit 🚀

Welcome to Extract Kit - a powerful, modern solution for document data extraction and conversion, powered by AI vision models! Transform your PDFs and Office documents into structured, validated data with ease.

🎯 What is Extract Kit?

Extract Kit is a monorepo containing a suite of production-ready TypeScript/Bun-based tools that leverage cutting-edge AI vision models and advanced parsing techniques to intelligently extract and convert data from various document formats.

🛠️ Available Packages

This repository contains the following open-source packages:

1. PDF Processor

A service that leverages cutting-edge AI vision models (Scaleway, Mistral, Ollama) to intelligently extract structured data from PDF documents. It's perfect for processing invoices, receipts, tables, or any custom document type.

NPM Package: @aidalinfo/pdf-processor
Documentation: Read the PDF Processor README

2. Office to Markdown

A modern TypeScript library for converting Office documents (DOCX) to Markdown format. It's optimized for the Bun ecosystem and includes advanced support for mathematical equations (OMML) and tables.

NPM Package: @aidalinfo/office-to-markdown
Documentation: Read the Office to Markdown README

✨ Key Features

🤖 AI-Powered Extraction: Harness the power of Scaleway Pixtral, Mistral AI, Ollama LLaVA, and custom AI providers for PDF processing.
🔄 Advanced DOCX Conversion: Convert .docx files to clean Markdown, preserving structure, styles, and even complex mathematical equations.
📄 Smart PDF Processing: Automatic PDF to optimized image conversion using Sharp for better AI recognition.
⚡ Lightning Fast: Built on Bun runtime with parallel worker processing for maximum performance.
🔒 Type-Safe: Full TypeScript implementation with Zod schema validation for reliable data extraction.
📦 Dual Mode: Use as a standalone REST API microservice or as a library in your existing Node.js/Bun projects.

🚀 Quick Start

For detailed instructions, please refer to the README file of the specific package you want to use:

For PDF Extraction: packages/pdf-processor/README.md
For DOCX Conversion: packages/office-to-markdown/README.md

🤝 Contributing

We welcome contributions! Please read our contribution guidelines (coming soon). For now, you can:

Fork the repository
Create your feature branch (git checkout -b feature/amazing-feature)
Commit your changes (git commit -m 'Add amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments

Built with Bun - The fast JavaScript runtime.
PDF Processor is powered by Scaleway AI, Mistral AI, and Ollama.
Office to Markdown is inspired by Microsoft's MarkItDown.

Happy Extracting! 🎉

Name		Name	Last commit message	Last commit date
Latest commit History 62 Commits
.github/workflows		.github/workflows
packages		packages
sparrow-doc		sparrow-doc
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
README.md		README.md
example.env.development		example.env.development

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Extract Kit 🚀

🎯 What is Extract Kit?

🛠️ Available Packages

1. PDF Processor

2. Office to Markdown

✨ Key Features

🚀 Quick Start

🤝 Contributing

📄 License

🙏 Acknowledgments

About

Uh oh!

Releases 4

Packages

Uh oh!

Languages

License

aidalinfo/extract-kit

Folders and files

Latest commit

History

Repository files navigation

Extract Kit 🚀

🎯 What is Extract Kit?

🛠️ Available Packages

1. PDF Processor

2. Office to Markdown

✨ Key Features

🚀 Quick Start

🤝 Contributing

📄 License

🙏 Acknowledgments

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 4

Packages 0

Uh oh!

Languages

Packages