This project builds a custom NLP pipeline to classify clinical text notes into urgency levels: EMERGENCY, URGENT, ROUTINE, and NON-URGENT. It uses synthetic, HIPAA-compliant data to simulate a real-world clinical triage support system.
This tool is built to accept any CSV dataset, no matter the number of columns or rows. It automatically removes empty columns and rows, isolates the relevant text column, cleans it, and classifies each entry into one of four urgency levels using machine learning. No special formatting or structure is required. Just a CSV file with a column containing clinical text.
- Built a full NLP classification pipeline from scratch, including data cleaning, feature engineering, model training, and evaluation
- Implemented a composite scoring framework to select statistically robust models using both test and cross-validation metrics
- Engineered a flexible, column-agnostic pipeline capable of handling real-world messy datasets—critical in applied AI workflows
- Demonstrated practical experience with text vectorization, label encoding, and performance tradeoff analysis
- Gained hands-on exposure to model interpretability challenges in NLP—an essential skill in real-world AI applications
- Designed the project structure for modularity and future deployment (e.g., as an API or Streamlit app), reflecting industry best practices
- Preprocessing using NLTK and spaCy
- Label encoding for multi-class classification
- TF-IDF vectorization
- Models tested: Multinomial Naive Bayes, SGD Classifier, Random Forest, and Logistic Regression
- Evaluation using accuracy, confusion matrix, and classification report
- Clean, modular codebase for reuse and scalability
- Best Accuracy Achieved:
85.00%
test accuracy - Cross-Validation Accuracy:
78.75% ± 4.54%
- Top Performing Model:
Multinomial Naive Bayes
(based on composite scoring of accuracy and CV stats)
Since real medical data is protected under HIPAA and not publicly available, OpenAI was used to generate a synthetic dataset for training all models in this project.
- Source: Custom synthetic dataset
- Format: CSV with
text
andlabel
columns - Classes: EMERGENCY, URGENT, ROUTINE, NON-URGENT
- This dataset contains no real patient data.
Install all required dependencies using:
pip install -r requirements.txt
Main libraries used:
pandas
scikit-learn
nltk
spacy
tqdm
Input Text: "Patient reports chest pain and shortness of breath."
Predicted Urgency: EMERGENCY
This classifier simulates the core logic of a triage support tool used in hospitals or digital health platforms. In a production setting, it could:
- Automatically flag urgent clinical notes from large volumes of incoming text
- Prioritize patient cases based on urgency (e.g., EMERGENCY vs NON-URGENT)
- Support clinicians, nurses, or telehealth platforms in real-time decision-making
- Be integrated into EHR systems or patient intake apps to improve workflow efficiency
While this project uses synthetic data, the pipeline design mirrors what would be required for deployment on real HIPAA-compliant clinical data after proper validation.
- Streamlit deployment for interactive prediction
- Integration with BERT or other transformer models
- Model explainability with SHAP or LIME
- This project serves as a foundation for more advanced work in clinical NLP, including transformer-based models (e.g., BERT) and production-grade model deployment.
- Composite scoring system combining test and cross-validation metrics to select the statistically best model
- Custom
clean_text()
preprocessing pipeline built from scratch using regex, NLTK, and spaCy - Dynamically loads and filters any dataset shape (column-agnostic design)
- Designed with reproducibility and modularity in mind for easy handoff or deployment