Web Scraping, Data Chunking, and Question Answering System

This project involves developing a comprehensive system for web scraping, data chunking, vector database creation, retrieval, re-ranking, and question answering using advanced AI techniques. Below are the detailed tasks and components of the system:

Directory Structure

├── volumes/
│   ├── data/
│   └── logs/
├── docker-compose.yml
├── my_spider.py
├── process_data.py
├── README.md
├── requirements.txt
├── setup.sh
├── streamlit_app.py

Set up the repository

git clone https://github.com/Aktharnvdv/Retrieval-Augmented-Generation.git
cd Retrieval-Augmented-Generation
./setup.sh

Task Description

Web Crawling

Web crawler was developed using Scrapy, which by default scrapes the Nvidia documentation website. It extracts text from the main page as well as sublinks, scraping data up to a depth of 5.web crawler (my_spider.py)

Data Chunking and Vector Database Creation

Milvus is the vector database used and similarity based chunking is used to chunk the scraped data.

Converted the texts into chunks using similarity based method.
Chunks were converted into embedding vectors using BERT models.
Created a vector database using Milvus.
Stored embedding vectors using FLAT (Flat) and IVF (Inverted File) indexing methods.
Included metadata such as the web-link of the extracted chunk, embeddings, and texts in the database.

Process and chunk the scraped data (process_data.py) Configure and create Milvus vector database (pymilvus)

Retrieval and Re-ranking

Employed query expansion techniques to enhance retrieval.
Used hybrid retrieval methods combining BM25 and BERT/bi-encoder based method DPR for retrieving relevant data from the vector database.
Re-ranked retrieved data based on relevance and similarity to the query.
Implemented retrieval and re-ranking strategies (BM25, DPR)

Question Answering

Utilized a language model (Google Gemini) for question answering.
Generated accurate answers based on the retrieved and re-ranked data.
Integrated with LLM for question answering (transformers, google-ai-generativelanguage (Gemini))

User Interface

Developed a user interface using frameworks like Streamlit, allowing users to input queries and display retrieved answers in a user-friendly manner.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Web Scraping, Data Chunking, and Question Answering System

Directory Structure

Set up the repository

Task Description

Web Crawling

Data Chunking and Vector Database Creation

Retrieval and Re-ranking

Question Answering

User Interface

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
.gitignore		.gitignore
README.md		README.md
docker-compose.yml		docker-compose.yml
my_spider.py		my_spider.py
process_data.py		process_data.py
requirements.txt		requirements.txt
setup.sh		setup.sh
streamlit_app.py		streamlit_app.py

Aktharnvdv/Retrieval-Augmented-Generation

Folders and files

Latest commit

History

Repository files navigation

Web Scraping, Data Chunking, and Question Answering System

Directory Structure

Set up the repository

Task Description

Web Crawling

Data Chunking and Vector Database Creation

Retrieval and Re-ranking

Question Answering

User Interface

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages