An intelligent academic paper summarization system that combines computer vision and natural language processing to extract, analyze, and summarize research papers from PDF format.
This project provides an end-to-end solution for automatically summarizing academic papers by:
- Converting PDF pages to images
- Using object detection to identify different document elements (text, titles, figures, tables)
- Extracting text using OCR (Optical Character Recognition)
- Analyzing figures with vision-language models
- Generating comprehensive summaries using large language models
The system consists of two main components:
A Jupyter notebook that handles:
- PDF processing and page extraction
- Object detection using Detectron2 with Faster R-CNN
- Text extraction using EasyOCR
- Figure extraction and processing
- Communication with the summarization server
A Flask-based API server that provides:
- Figure analysis using LLaVA (Large Language and Vision Assistant)
- Text summarization using Ollama with Llama 3.2
- Page-wise and full-paper summarization
- RESTful API endpoint for processing requests
- Multi-modal Analysis: Processes both text and visual elements (figures, tables)
- Intelligent Object Detection: Identifies different document components with 70% confidence threshold
- OCR Text Extraction: Supports English text recognition
- Figure Understanding: Uses vision-language models to describe and summarize figures
- Hierarchical Summarization: Creates page-level summaries before generating final paper summary
- PDF Output: Converts final summary to PDF format using Pandoc
cv2
(OpenCV)easyocr
detectron2
pdf2image
requests
numpy
flask
ollama
transformers
(for LLaVA model)tqdm
- Pandoc: For converting Markdown summary to PDF
- Faster R-CNN Model: Pre-trained model file (
faster-rcnn.pth
) - Ollama: For running Llama 3.2 model locally
- GPU: Recommended for faster processing (CPU fallback available)
- Memory: Minimum 8GB RAM (16GB+ recommended for large papers)
- Storage: Space for model files and temporary image processing
-
Clone the repository
git clone <repository-url> cd paper-summarizer
-
Install Python dependencies
pip install opencv-python easyocr detectron2 pdf2image requests numpy flask transformers tqdm
-
Install Ollama Follow instructions at ollama.ai to install Ollama
-
Install Pandoc
# macOS brew install pandoc # Ubuntu/Debian sudo apt-get install pandoc
-
Download required models
- Place your Faster R-CNN model file as
faster-rcnn.pth
in the project directory - The LLaVA model will be downloaded automatically on first use
- Place your Faster R-CNN model file as
python server.py
The server will start on http://localhost:8000
- Place your PDF file in the project directory as
paper.pdf
- Open and run the
main.ipynb
notebook - The system will:
- Process each page of the PDF
- Extract text and figures
- Send data to the summarization server
- Generate a comprehensive summary
- Save the summary as both
summary.md
andsummary.pdf
- Host:
0.0.0.0
(accepts external connections) - Port:
8000
- Models: LLaVA 1.5 7B, Llama 3.2
- Score Threshold: 0.7 (70% confidence)
- Classes: text, title, list, table, figure
- Model: Faster R-CNN with ResNet-50 FPN backbone
Processes paper content and returns a summary.
Request:
- Files:
figures
- Array of figure images (JPEG format) - Form Data:
full_text
: JSON string containing extracted text from all pagesfigure_pages
: JSON string containing figure count per page
Response:
{
"summary": "Generated paper summary in Markdown format"
}
The generated summary includes:
- Title of the Paper
- Problem Statement
- Methodology
- Key Findings & Results
- Conclusion & Implications
- Processing time depends on paper length and complexity
- Figure analysis requires significant computational resources
- First run may be slower due to model downloads
- Consider using GPU acceleration for better performance
- Currently supports English text only
- Requires manual placement of PDF as
paper.pdf
- Server URL is hardcoded in the notebook
- Limited to specific object detection classes
To contribute to this project:
- Fork the repository
- Create a feature branch
- Make your changes
- Submit a pull request
- Detectron2: For object detection capabilities
- EasyOCR: For text extraction
- LLaVA: For vision-language understanding
- Ollama: For local LLM inference
- Transformers: For model integration