Healthcare NLP Classifier – Urgency Detection from Clinical Notes

This project builds a custom NLP pipeline to classify clinical text notes into urgency levels: EMERGENCY, URGENT, ROUTINE, and NON-URGENT. It uses synthetic, HIPAA-compliant data to simulate a real-world clinical triage support system.

This tool is built to accept any CSV dataset, no matter the number of columns or rows. It automatically removes empty columns and rows, isolates the relevant text column, cleans it, and classifies each entry into one of four urgency levels using machine learning. No special formatting or structure is required. Just a CSV file with a column containing clinical text.

Key Takeaways

Built a full NLP classification pipeline from scratch, including data cleaning, feature engineering, model training, and evaluation
Implemented a composite scoring framework to select statistically robust models using both test and cross-validation metrics
Engineered a flexible, column-agnostic pipeline capable of handling real-world messy datasets—critical in applied AI workflows
Demonstrated practical experience with text vectorization, label encoding, and performance tradeoff analysis
Gained hands-on exposure to model interpretability challenges in NLP—an essential skill in real-world AI applications
Designed the project structure for modularity and future deployment (e.g., as an API or Streamlit app), reflecting industry best practices

Features

Preprocessing using NLTK and spaCy
Label encoding for multi-class classification
TF-IDF vectorization
Models tested: Multinomial Naive Bayes, SGD Classifier, Random Forest, and Logistic Regression
Evaluation using accuracy, confusion matrix, and classification report
Clean, modular codebase for reuse and scalability

Model Performance

Best Accuracy Achieved: 85.00% test accuracy
Cross-Validation Accuracy: 78.75% ± 4.54%
Top Performing Model: Multinomial Naive Bayes (based on composite scoring of accuracy and CV stats)

Dataset

Since real medical data is protected under HIPAA and not publicly available, OpenAI was used to generate a synthetic dataset for training all models in this project.

Source: Custom synthetic dataset
Format: CSV with text and label columns
Classes: EMERGENCY, URGENT, ROUTINE, NON-URGENT
This dataset contains no real patient data.

Requirements

Install all required dependencies using:

pip install -r requirements.txt

Main libraries used:

pandas
scikit-learn
nltk
spacy
tqdm

Sample Output

Input Text:  "Patient reports chest pain and shortness of breath."
Predicted Urgency:  EMERGENCY

Real-World Use

This classifier simulates the core logic of a triage support tool used in hospitals or digital health platforms. In a production setting, it could:

Automatically flag urgent clinical notes from large volumes of incoming text
Prioritize patient cases based on urgency (e.g., EMERGENCY vs NON-URGENT)
Support clinicians, nurses, or telehealth platforms in real-time decision-making
Be integrated into EHR systems or patient intake apps to improve workflow efficiency

While this project uses synthetic data, the pipeline design mirrors what would be required for deployment on real HIPAA-compliant clinical data after proper validation.

Future Improvements

Streamlit deployment for interactive prediction
Integration with BERT or other transformer models
Model explainability with SHAP or LIME
This project serves as a foundation for more advanced work in clinical NLP, including transformer-based models (e.g., BERT) and production-grade model deployment.

Engineering Highlights

Composite scoring system combining test and cross-validation metrics to select the statistically best model
Custom clean_text() preprocessing pipeline built from scratch using regex, NLTK, and spaCy
Dynamically loads and filters any dataset shape (column-agnostic design)
Designed with reproducibility and modularity in mind for easy handoff or deployment

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
healthcare_urgency_classifier.ipynb		healthcare_urgency_classifier.ipynb
requirements.txt		requirements.txt
synthetic_clinical_urgency_dataset_noisy_augmented.csv		synthetic_clinical_urgency_dataset_noisy_augmented.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Healthcare NLP Classifier – Urgency Detection from Clinical Notes

Key Takeaways

Features

Model Performance

Dataset

Requirements

Sample Output

Real-World Use

Future Improvements

Engineering Highlights

About

Uh oh!

Releases

Packages

Languages

License

Tdeniz1/NLP-Clinical-Urgency-Classifier

Folders and files

Latest commit

History

Repository files navigation

Healthcare NLP Classifier – Urgency Detection from Clinical Notes

Key Takeaways

Features

Model Performance

Dataset

Requirements

Sample Output

Real-World Use

Future Improvements

Engineering Highlights

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages