Build a Computer Science knowledge graph from PDFs using LLMs, store it in Neo4j, validate topics against the Computer Science Ontology (CSO), and explore via a Streamlit chatbot. The project also includes LLM Apriori-like mining for associations and optional LDA/LSA topic modeling pipelines whose outputs are mapped back to CSO topics.
- Prerequisites
- Setup & Installation
- Environment Variables
- Data
- How to Run
- Script Index
- Neo4j Notes
- Troubleshooting
- Acknowledgments
- Python 3.10+
- Neo4j (local or cloud) running and reachable
- Google Gemini API Key (from Google AI Studio) - i'm using free version, 1.000.000 tokens/minutes
- PDF files placed under
llm-knowledge-graph/data/pdfs
Optional (only if you run native Cypher/GDS Apriori variants):
- APOC & GDS plugins in Neo4j
# 1) Create virtual environment
python -m venv venv
# 2) Activate venv
# Windows (Command Prompt):
. venv\Scripts\activate
# Windows (PowerShell):
. venv\Scripts\Activate.ps1
# macOS/Linux:
source . venv/bin/activate
# 3) Install dependencies
pip install -r requirements.txt
# 4) Go to project root
cd llm-knowledge-graphCreate a .env in the project root:
# Gemini
GEMINI_API_KEY=your_api_key_from_google_ai_studio
# Neo4j
NEO4J_URI=neo4j://localhost:7687
NEO4J_USERNAME=neo4j
NEO4J_PASSWORD=your_passwordPut your .env file here:
knowledge-graph-llm-rag
└─ venv
└─ .env
└─ .gitiginore
└─ llm-knowledge-graph/
└─ requirements.txt
Tip: Ensure your Neo4j instance is running and accepts Bolt connections at the URI you set.
Put your CS papers (PDF) here:
llm-knowledge-graph/
└─ data/
└─ pdfs/
├─ paper1.pdf
├─ paper2.pdf
└─ ...
This parses the Computer Science Ontology (RDF) and loads (:Topic) nodes (and their hierarchy) into Neo4j.
Ensure you're in the correct directory and venv is active:
![]()
python create_topic_from_cso.py
This process will feel long when retrieving data from an RDF file for the first time.
Creates (:Paper) from PDFs.
Ensure you're in the correct directory and venv is active:
![]()
- Please note: You can only process one PDF file at a time. This project uses a limited tokens (version Gemini).
- You can select an UNPROCESSED PDF file from list to generate into a node.
- To avoid being warned that the token usage limit has been reached, please pause for at least 1 minute after finishing processing 1 PDF file.
python create_paper.py
Links topics via LLM (validated against CSO) → (:Paper)-[:HAS_TOPIC]->(:Topic).
Ensure you're in the correct directory and venv is active:
![]()
- Please note: You can ONLY process ONE selected Paper at a time. This project uses a limited tokens (version Gemini).
- You can select an UNPROCESSED Paper from list to generate topic mapping.
- The list of papers that appears only shows PDFs that have been generated as nodes. Therefore, repeat step 2 to generate a new paper.
python create_mapping_topic.py
Runs topic modeling LDA or LSA on your PDFs and prints results using LLM.
Ensure you're in the correct directory and venv is active:
![]()
python run_llm_topic_modeling.py
Runs classical topic modeling LDA or LSA (scikit-learn) on your PDFs and prints results.
Ensure you're in the correct directory and venv is active:
![]()
python run_topic_modeling.py
Generate topic combinations per paper → creates (:TopicCombination) and (:Paper)-[:HAS_TOPIC_COMBINATION]->(:TopicCombination):
Ensure you're in the correct directory and venv is active:
![]()
- Please note: You can ONLY process ONE selected Paper at a time. This project uses a limited tokens (version Gemini).
- Make sure all required PDF files have been generated into nodes
python create_combination.py
Run Apriori-like mining via LLM → creates (:FrequentTopicSet), (:LeftTopicSet)-[:RULES]->( :RightTopicSet) with support & confidence:
Ensure you're in the correct directory and venv is active:
![]()
Make sure all required PDF files have been generated into nodes
python run_llm_apriori.pyyou can also changing method, use full LLMs or Hybrid (with itertools python) in run_llm_apriori.py.
APRIORI_MODE = "hybrid" # "hybrid" or "full"The Apriori logic (frequent itemsets, rules) is driven by LLM; Cypher is used only to persist the results into Neo4j.
Recommends papers based on the learned co-occurrence patterns:
Ensure you're in the correct directory and venv is active:
![]()
- The list of papers that appears only shows PDFs that have been generated as nodes. Therefore, repeat step 2 to generate a new paper.
- You can select more than 1 paper to be used as a sample. For example,
Select Paper: 1, 2, 3
python run_recommendation.py
Run embedding first
python services\embedding_service.py
To avoid the warning that the token usage limit has been reached, please ONLY use ONE pdf file during the embedding process.
Browse & query your graph via a simple UI:
cd chatbot
streamlit run main.pyOpen your browser at http://localhost:8501.
| Script | Purpose |
|---|---|
create_topic_from_cso.py |
Imports CSO topics + hierarchy into Neo4j. |
create_paper.py |
Creates (:Paper) nodes from PDFs. |
create_mapping_topic.py |
Links topics (HAS_TOPIC) using LLM. |
run_topic_modeling.py |
Runs LDA or LSA (sklearn) and prints topics/terms. |
run_llm_topic_modeling.py |
Runs LDA or LSA using LLMs and prints topics/terms (LSA/LDA-like). |
create_combination.py |
Generates all topic combinations per paper and persists (:TopicCombination). |
run_llm_apriori.py |
Runs LLM Apriori-like to create (:FrequentTopicSet) and association rules. |
run_recommendation.py |
Recommends papers using the LLM Apriori-like outputs. |
chatbot/main.py |
Streamlit chatbot for querying and recommending papers. |
- This project uses the official Bolt driver through a small
GraphServicewrapper. - If you later want to run pure Cypher/GDS Apriori (instead of LLM Apriori-like), you’ll need APOC and GDS plugins enabled and configured.
- For full-text topic lookups, the code creates a FULLTEXT INDEX on
(:Topic {label})automatically (if not present). - Array property queries: if a node property is an array, compare using array syntax, e.g.
MATCH (c:TopicCombination) WHERE c.items = ['neural network'] RETURN c
-
No topics created / “No topics found”
Ensure you rancreate_topic_from_cso.pysuccessfully and your Neo4j credentials are correct. -
LLM prompt errors (missing template variables / dict has no attribute …)
These happen when the LLM returns malformed JSON or the prompt placeholders aren’t escaped. The code includes guards and parsers; if it still happens, check console logs for the printed raw snippet. -
Only a few topics get mapped
That’s expected: mapping uses strict, CSO-guarded matching and confidence thresholds to avoid wrong links. Tune:min_confidence(e.g.,0.85)top_k_map_each(terms per model)max_topics_in_prompt(candidate pool size)
-
Long PDFs
The LLM pipelines read full document content. If you hit model context limits, set a safety cap (e.g.,max_context_chars) inLLMTopicModelingServiceinitialization.
- Neo4j GraphAcademy — Courses:
- Computer Science Ontology (CSO) — for topic validation





