Skip to content

alfiensukma/knowledge-graph-llm-rag

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

25 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Knowledge Graph Construction with LLMs

Build a Computer Science knowledge graph from PDFs using LLMs, store it in Neo4j, validate topics against the Computer Science Ontology (CSO), and explore via a Streamlit chatbot. The project also includes LLM Apriori-like mining for associations and optional LDA/LSA topic modeling pipelines whose outputs are mapped back to CSO topics.

Contents


Prerequisites

  • Python 3.10+
  • Neo4j (local or cloud) running and reachable
  • Google Gemini API Key (from Google AI Studio) - i'm using free version, 1.000.000 tokens/minutes
  • PDF files placed under llm-knowledge-graph/data/pdfs

Optional (only if you run native Cypher/GDS Apriori variants):

  • APOC & GDS plugins in Neo4j

Setup & Installation

# 1) Create virtual environment
python -m venv venv

# 2) Activate venv
# Windows (Command Prompt):
. venv\Scripts\activate

# Windows (PowerShell):
. venv\Scripts\Activate.ps1

# macOS/Linux:
source . venv/bin/activate

# 3) Install dependencies
pip install -r requirements.txt

# 4) Go to project root
cd llm-knowledge-graph

Environment Variables

Create a .env in the project root:

# Gemini
GEMINI_API_KEY=your_api_key_from_google_ai_studio

# Neo4j
NEO4J_URI=neo4j://localhost:7687
NEO4J_USERNAME=neo4j
NEO4J_PASSWORD=your_password

Put your .env file here:

knowledge-graph-llm-rag
└─ venv
└─ .env
└─ .gitiginore
└─ llm-knowledge-graph/
└─ requirements.txt

Tip: Ensure your Neo4j instance is running and accepts Bolt connections at the URI you set.


Data

Put your CS papers (PDF) here:

llm-knowledge-graph/
└─ data/
   └─ pdfs/
      ├─ paper1.pdf
      ├─ paper2.pdf
      └─ ...

How to Run

1) Create Topics from CSO (RDF)

This parses the Computer Science Ontology (RDF) and loads (:Topic) nodes (and their hierarchy) into Neo4j.

Prerequisites

Ensure you're in the correct directory and venv is active: Screenshot 2025-08-25 080019

Run this command

python create_topic_from_cso.py

Expected Result

image

This process will feel long when retrieving data from an RDF file for the first time.


2) Create Paper

Creates (:Paper) from PDFs.

Prerequisites

Ensure you're in the correct directory and venv is active: Screenshot 2025-08-25 080019

Make sure there is a PDF file

  • Please note: You can only process one PDF file at a time. This project uses a limited tokens (version Gemini).
  • You can select an UNPROCESSED PDF file from list to generate into a node.
  • To avoid being warned that the token usage limit has been reached, please pause for at least 1 minute after finishing processing 1 PDF file.

Run this command

python create_paper.py

Expected Result

image

Graph: result_create_paper_graph


3) Topic Mapping using LLMs

Links topics via LLM (validated against CSO) → (:Paper)-[:HAS_TOPIC]->(:Topic).

Prerequisites

Ensure you're in the correct directory and venv is active: Screenshot 2025-08-25 080019

Make sure there is a Paper node and Topic node

  • Please note: You can ONLY process ONE selected Paper at a time. This project uses a limited tokens (version Gemini).
  • You can select an UNPROCESSED Paper from list to generate topic mapping.
  • The list of papers that appears only shows PDFs that have been generated as nodes. Therefore, repeat step 2 to generate a new paper.

Run this command

python create_mapping_topic.py

Expected Result

result_topic_mapping

Graph: result topic_mapping


4) Topic Modeling (LDA or LSA) using LLMs

Runs topic modeling LDA or LSA on your PDFs and prints results using LLM.

Prerequisites

Ensure you're in the correct directory and venv is active: Screenshot 2025-08-25 080019

Run this command

python run_llm_topic_modeling.py

Expected Result

image

5) Topic Modeling (LDA or LSA)

Runs classical topic modeling LDA or LSA (scikit-learn) on your PDFs and prints results.

Prerequisites

Ensure you're in the correct directory and venv is active: Screenshot 2025-08-25 080019

Run this command

python run_topic_modeling.py

Expected Result

image

6) Topic Combinations using LLMs

Generate topic combinations per paper → creates (:TopicCombination) and (:Paper)-[:HAS_TOPIC_COMBINATION]->(:TopicCombination):

Prerequisites

Ensure you're in the correct directory and venv is active: Screenshot 2025-08-25 080019

Make sure there is a Paper node and Topic node

  • Please note: You can ONLY process ONE selected Paper at a time. This project uses a limited tokens (version Gemini).
  • Make sure all required PDF files have been generated into nodes

Run this command

python create_combination.py

Expected Result

result_combination

Graph: result_combination_graph


7) Apriori-like mining via LLM

Run Apriori-like mining via LLM → creates (:FrequentTopicSet), (:LeftTopicSet)-[:RULES]->( :RightTopicSet) with support & confidence:

Prerequisites

Ensure you're in the correct directory and venv is active: Screenshot 2025-08-25 080019

Make sure there is a Paper node and Topic node

Make sure all required PDF files have been generated into nodes

Run this command

python run_llm_apriori.py

you can also changing method, use full LLMs or Hybrid (with itertools python) in run_llm_apriori.py.

APRIORI_MODE = "hybrid" # "hybrid" or "full"

The Apriori logic (frequent itemsets, rules) is driven by LLM; Cypher is used only to persist the results into Neo4j.

Expected Result

result_llm_apriori

Graph: result_llm_apriori_graph


8) Recommendations (LLM Apriori-like)

Recommends papers based on the learned co-occurrence patterns:

Prerequisites

Ensure you're in the correct directory and venv is active: Screenshot 2025-08-25 080019

Make sure there is a Paper node and Topic node

  • The list of papers that appears only shows PDFs that have been generated as nodes. Therefore, repeat step 2 to generate a new paper.
  • You can select more than 1 paper to be used as a sample. For example, Select Paper: 1, 2, 3

Run this command

python run_recommendation.py

Expected Result

result_recommendation

9) Chatbot (Streamlit)

Run embedding first

python services\embedding_service.py
image

Graph + vector: image

To avoid the warning that the token usage limit has been reached, please ONLY use ONE pdf file during the embedding process.

Browse & query your graph via a simple UI:

cd chatbot
streamlit run main.py

Open your browser at http://localhost:8501.

Expected Result

chatbot


Script Index

Script Purpose
create_topic_from_cso.py Imports CSO topics + hierarchy into Neo4j.
create_paper.py Creates (:Paper) nodes from PDFs.
create_mapping_topic.py Links topics (HAS_TOPIC) using LLM.
run_topic_modeling.py Runs LDA or LSA (sklearn) and prints topics/terms.
run_llm_topic_modeling.py Runs LDA or LSA using LLMs and prints topics/terms (LSA/LDA-like).
create_combination.py Generates all topic combinations per paper and persists (:TopicCombination).
run_llm_apriori.py Runs LLM Apriori-like to create (:FrequentTopicSet) and association rules.
run_recommendation.py Recommends papers using the LLM Apriori-like outputs.
chatbot/main.py Streamlit chatbot for querying and recommending papers.

Neo4j Notes

  • This project uses the official Bolt driver through a small GraphService wrapper.
  • If you later want to run pure Cypher/GDS Apriori (instead of LLM Apriori-like), you’ll need APOC and GDS plugins enabled and configured.
  • For full-text topic lookups, the code creates a FULLTEXT INDEX on (:Topic {label}) automatically (if not present).
  • Array property queries: if a node property is an array, compare using array syntax, e.g.
    MATCH (c:TopicCombination) 
    WHERE c.items = ['neural network'] 
    RETURN c

Troubleshooting

  • No topics created / “No topics found”
    Ensure you ran create_topic_from_cso.py successfully and your Neo4j credentials are correct.

  • LLM prompt errors (missing template variables / dict has no attribute …)
    These happen when the LLM returns malformed JSON or the prompt placeholders aren’t escaped. The code includes guards and parsers; if it still happens, check console logs for the printed raw snippet.

  • Only a few topics get mapped
    That’s expected: mapping uses strict, CSO-guarded matching and confidence thresholds to avoid wrong links. Tune:

    • min_confidence (e.g., 0.85)
    • top_k_map_each (terms per model)
    • max_topics_in_prompt (candidate pool size)
  • Long PDFs
    The LLM pipelines read full document content. If you hit model context limits, set a safety cap (e.g., max_context_chars) in LLMTopicModelingService initialization.


Acknowledgments

About

Building knowledge graph and extraction topics with LLMs with Computer Science Ontology for validation.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages