An intelligent web application that lets you upload PDF, Word, or scanned image documents, and ask questions using semantic search + OCR + AI embeddings.
- 📤 Upload PDF, Word, or Image files
- 🔍 Extract text using OCR (pytesseract for image/PDF scans)
- 🧠 Semantic search using HuggingFace Sentence Embeddings
- ⚡ FAISS vector similarity search
- ⚙️ Background processing with Celery + Redis
- 🎨 Clean UI using Bootstrap
- 🔎 (Optional) LangChain-based RAG for improved Q&A
Feature | Tool |
---|---|
Backend | Django |
AI Embeddings | HuggingFace Transformers (e.g., sentence-transformers/all-MiniLM-L6-v2 ) |
OCR | pytesseract |
Vector Search | FAISS |
Async Tasks | Celery + Redis |
Frontend | Bootstrap |
Bonus | LangChain (for RAG-style QA) |
DOCSEARCH/
│
├── docsearch/ # Django core project
│ ├── __init__.py
│ ├── asgi.py
│ ├── celery.py # Celery app definition
│ ├── settings.py
│ ├── urls.py
│ └── wsgi.py
│
├── search/ # Core app for document handling
│ ├── models.py
│ ├── views.py
│ ├── tasks.py # Celery background jobs
│ ├── utils.py # OCR, embedding logic
│ └── ...
│
├── templates/
│ └── search/
│ └── base.html # HTML UI
│
├── static/ # Bootstrap or custom CSS/JS
├── faiss_index/ # Stores FAISS vector indexes
├── media/ # Uploaded user documents
├── .env # Environment variables
├── db.sqlite3 # Local development DB
├── manage.py
├── README.md
└── requirements.txt
-
User Uploads Document
- Upload PDF, Word, or Image file via web form.
-
OCR + Text Extraction
- If the document is an image or scanned PDF, text is extracted using
pytesseract
.
- If the document is an image or scanned PDF, text is extracted using
-
Embedding Generation
- Extracted text is split into chunks and converted into vector embeddings using HuggingFace's
all-MiniLM-L6-v2
model.
- Extracted text is split into chunks and converted into vector embeddings using HuggingFace's
-
FAISS Indexing
- Embeddings are stored in a FAISS index for fast similarity search.
-
User Asks a Question
- The user types a natural-language question on the frontend.
-
Semantic Search
- The system searches the FAISS index to find the most relevant document chunks.
-
Answer Generation
- The best matching content is returned as a response.
- (Optional) Use LangChain or GPT-based model to generate more refined answers (RAG-style).
-
Result Display
- The answer is shown on the web interface with a reference to the source document.
This project wouldn't be possible without the amazing open-source tools and libraries:
- 🤗 HuggingFace Transformers – for sentence embeddings and NLP models
- 🧠 FAISS – for efficient similarity search on large vectors
- 🔍 pytesseract – Python wrapper for Google's Tesseract OCR
- 🔄 Celery – for background task processing
- 💽 Redis – message broker for Celery
- 🧱 Django – web framework used to build the core backend
- 🧰 Bootstrap – for clean, responsive frontend UI
- 🧠 LangChain (optional) – for retrieval-augmented generation (RAG) using LLMs
Author: Dharmendra Yadav
📧 Email: dkydevops@gmail.com
🌐 Website: https://www.dydevops.com
📱 WhatsApp: +91 9452428546