📄 Smart Document Search (Django + HuggingFace + FAISS)

An intelligent web application that lets you upload PDF, Word, or scanned image documents, and ask questions using semantic search + OCR + AI embeddings.

🚀 Features

📤 Upload PDF, Word, or Image files
🔍 Extract text using OCR (pytesseract for image/PDF scans)
🧠 Semantic search using HuggingFace Sentence Embeddings
⚡ FAISS vector similarity search
⚙️ Background processing with Celery + Redis
🎨 Clean UI using Bootstrap
🔎 (Optional) LangChain-based RAG for improved Q&A

🧠 Tech Stack

Feature	Tool
Backend	Django
AI Embeddings	HuggingFace Transformers (e.g., `sentence-transformers/all-MiniLM-L6-v2`)
OCR	pytesseract
Vector Search	FAISS
Async Tasks	Celery + Redis
Frontend	Bootstrap
Bonus	LangChain (for RAG-style QA)

📂 Folder Structure

DOCSEARCH/
│
├── docsearch/                  # Django core project
│   ├── __init__.py
│   ├── asgi.py
│   ├── celery.py               # Celery app definition
│   ├── settings.py
│   ├── urls.py
│   └── wsgi.py
│
├── search/                     # Core app for document handling
│   ├── models.py
│   ├── views.py
│   ├── tasks.py                # Celery background jobs
│   ├── utils.py                # OCR, embedding logic
│   └── ...
│
├── templates/
│   └── search/
│       └── base.html           # HTML UI
│
├── static/                     # Bootstrap or custom CSS/JS
├── faiss_index/                # Stores FAISS vector indexes
├── media/                      # Uploaded user documents
├── .env                        # Environment variables
├── db.sqlite3                  # Local development DB
├── manage.py
├── README.md
└── requirements.txt

🔁 Example Flow

User Uploads Document
- Upload PDF, Word, or Image file via web form.
OCR + Text Extraction
- If the document is an image or scanned PDF, text is extracted using pytesseract.
Embedding Generation
- Extracted text is split into chunks and converted into vector embeddings using HuggingFace's all-MiniLM-L6-v2 model.
FAISS Indexing
- Embeddings are stored in a FAISS index for fast similarity search.
User Asks a Question
- The user types a natural-language question on the frontend.
Semantic Search
- The system searches the FAISS index to find the most relevant document chunks.
Answer Generation
- The best matching content is returned as a response.
- (Optional) Use LangChain or GPT-based model to generate more refined answers (RAG-style).
Result Display
- The answer is shown on the web interface with a reference to the source document.

🙌 Credits

This project wouldn't be possible without the amazing open-source tools and libraries:

🤗 HuggingFace Transformers – for sentence embeddings and NLP models
🧠 FAISS – for efficient similarity search on large vectors
🔍 pytesseract – Python wrapper for Google's Tesseract OCR
🔄 Celery – for background task processing
💽 Redis – message broker for Celery
🧱 Django – web framework used to build the core backend
🧰 Bootstrap – for clean, responsive frontend UI
🧠 LangChain (optional) – for retrieval-augmented generation (RAG) using LLMs

📬 Contact

Author: Dharmendra Yadav
📧 Email: dkydevops@gmail.com
🌐 Website: https://www.dydevops.com
📱 WhatsApp: +91 9452428546

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

📄 Smart Document Search (Django + HuggingFace + FAISS)

🚀 Features

🧠 Tech Stack

📂 Folder Structure

🔁 Example Flow

🙌 Credits

📬 Contact

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
docsearch		docsearch
faiss_index		faiss_index
media/documents		media/documents
search		search
templates		templates
.env		.env
README.md		README.md
db.sqlite3		db.sqlite3
manage.py		manage.py
requirements.txt		requirements.txt

dydevops/Smart-Document-Search

Folders and files

Latest commit

History

Repository files navigation

📄 Smart Document Search (Django + HuggingFace + FAISS)

🚀 Features

🧠 Tech Stack

📂 Folder Structure

🔁 Example Flow

🙌 Credits

📬 Contact

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages