This project is a full-featured NLP pipeline designed to analyze the mysterious research documents left behind by Dr. X, a fictional scientist who vanished under mysterious circumstances. The goal is to extract, summarize, understand, and translate his research using local, offline NLP tools — no internet or cloud APIs required.
- ✅ Multi-format file ingestion (.pdf,.docx,.csv,.xlsx,.xls,.xlsm)
- ✅ Token-based chunking with metadata (filename, page, chunk number)
- ✅ Local vector search using ChromaDB
- ✅ RAG Q&A system powered by local LLaMA (via Ollama)
- ✅ Automatic translation of English answers to Arabic
- ✅ Local summarization of full documents
- ✅ ROUGE metric evaluation
- ✅ Performance logging (tokens/sec for all major components)
- ✅ Fully modular & offline
├── file_reader.py      # Extracts text & tables from all formats 
├── chunker.py          # Tokenizes and chunks text with cl100k_base 
├── embedding_pipeline.py # Embeds chunks and stores in ChromaDB 
├── rag_qa_system.py # Runs Q&A retrieval + local LLaMA generation 
├── translation_utils.py # Translates answers to Arabic (offline) 
├── summarizer.py # Summarizes files + evaluates with ROUGE 
└── requirements.txt # All dependencies
📁 files/ 
└── All input files (.pdf, .docx, .csv, etc.)
| Component | Tool/Library | 
|---|---|
| LLM (local) | Ollama(e.g.llama2,tinyllama) | 
| Embedding | sentence-transformers(MiniLM) | 
| Vector DB | ChromaDB (PersistentClient) | 
| Translation | argos-translate(EN ➝ AR) | 
| Summarization | Falconsai/text_summarization | 
| Metrics | tiktoken,rouge-score,time | 
- Extract text + tables from PDFs, Word, and Excel files.
- Chunk the text based on tokens (cl100k_base).
- Embed chunks using MiniLM and store in a local ChromaDB.
- Ask Questions via a CLI — the system retrieves relevant chunks and generates an answer using LLaMA.
- Translate the answer into Arabic.
- Summarize full documents and measure summary quality with ROUGE.
❓ Ask a question about Dr. X's documents:
> What was his last known research?
💬 English Answer:
Dr. X’s final study focused on zero-point energy manipulation using ancient resonance systems.
🗣️ Arabic Translation:
ركزت الدراسة الأخيرة للدكتور إكس على التلاعب بطاقة النقطة الصفرية باستخدام أنظمة الرنين القديمة.| Task | Tokens | Time | TPS | 
|---|---|---|---|
| Embedding | 1,200 | 1.8 sec | ~666 TPS | 
| RAG Generation | 620 | 1.2 sec | ~516 TPS | 
| Summarization | 1,500 | 3.0 sec | ~500 TPS | 
- ✅ PDF (.pdf) \
- ✅ Word (.docx) \
- ✅ Excel (.xlsx, .xls, .xlsm) \
- ✅ CSV (.csv) \
- ✅ Multi-sheet support with pandas \
pip install -r requirements.txt- install Ollama: https://ollama.com/download
ollama pull tinyllamapython embedding_pipeline.pypython rag_qa_system.pypython summarizer.py- ✅ Executes correctly across all modules
- ✅ Efficient + logs tokens/sec
- ✅ Translates and summarizes with high fluency
- ✅ Handles all required file formats
- ✅ Uses appropriate local LLMs and vector DB
- ✅ Clean code, modular design, creative solution