This project applies machine learning techniques to predict Coronary Heart Disease (CHD) risk using medical data. The goal is to identify the most accurate classification model and discover the most predictive factors for CHD diagnosis.
- Support Vector Machine (SVM) - F1: 0.858, Accuracy: 83.9%
- K-Nearest Neighbors (KNN) - F1: 0.851, Accuracy: 83.0%
- Logistic Regression - F1: 0.837, Accuracy: 82.6%
The analysis revealed the following features as most predictive of CHD:
- AP (Angina Pectoris) - Most significant predictor
- RZ (ST Depression) - Cardiac stress indicator
- BloodSugar - Diabetes-related risk factor
- HFmax (Maximum Heart Rate) - Cardiovascular fitness indicator
- ECG_ST - Specific ECG abnormality pattern
- Angina Pectoris emerged as the strongest single predictor, confirming its clinical importance as a CHD symptom
- Traditional risk factors like cholesterol and blood pressure were less discriminative than symptomatic indicators
- The ECG "ST" pattern was the most relevant among ECG classifications
- Age, while important, was not among the top 5 predictors when combined with other factors
- Size: 270 patients (after removing 1 invalid entry)
- Features: 11 medical parameters
- Target: Binary CHD classification (55% positive, 45% negative)
- Data Quality: Minimal missing values, primarily cholesterol (handled via imputation)
- Age, Blood Pressure, Cholesterol, Blood Sugar
- Maximum Heart Rate (HFmax), ST Depression (RZ)
- Gender, ECG patterns (Normal, LVH, ST), Angina Pectoris (AP)
- Missing Value Treatment: Cholesterol zeros imputed using similar patient profiles
- Categorical Encoding: Label encoding for binary variables, one-hot encoding for ECG
- Data Splitting: 75% training, 25% testing with stratified sampling
Evaluated 7 classification algorithms:
- K-Nearest Neighbors, Support Vector Machine, Logistic Regression
- Decision Tree, Random Forest, Gradient Boosting, Neural Network
- Hyperparameter Tuning: Grid search for top 3 models
- Feature Engineering: Tested polynomial features, interactions, dimensionality reduction
- Cross-Validation: 5-fold CV for robust performance estimation
- Accuracy, Precision, Recall, F1-Score
- ROC-AUC and Precision-Recall curves
- Confusion matrices for detailed error analysis
| Model | Accuracy | F1-Score | Precision | Recall | AUC |
|---|---|---|---|---|---|
| SVM (Optimized) | 83.9% | 0.858 | 83.5% | 88.2% | 0.882 |
| KNN (Optimized) | 83.0% | 0.851 | 84.6% | 84.3% | 0.845 |
| Logistic Regression | 82.6% | 0.837 | 85.1% | 83.5% | 0.894 |
- SVM: Highest sensitivity (88.2%) - best for screening applications
- KNN: Most balanced performance across all metrics
- Logistic Regression: Best overall discriminative ability (AUC: 0.894)
conda install scikit-learn conda-forge::umap-learn seaborn numpy=1.26.4- scikit-learn: Core ML algorithms and evaluation
- pandas/numpy: Data manipulation and analysis
- seaborn/matplotlib: Visualization
- umap-learn: Advanced dimensionality reduction
- PCA: Explained 34% variance in first 2 components
- t-SNE: Revealed cluster patterns in patient data
- UMAP: Best separation of CHD classes
Models showed different decision-making patterns:
- SVM: Complex non-linear boundaries
- KNN: Local neighborhood-based decisions
- Logistic Regression: Linear separation with clear probabilistic interpretation
- Screening Priority: Patients with angina pectoris require immediate attention
- Risk Assessment: Combine ST depression, blood sugar, and max heart rate for comprehensive evaluation
- ECG Interpretation: ST pattern abnormalities are highly indicative of CHD risk
- Traditional risk factors (cholesterol, blood pressure) may need contextual interpretation
- Symptom-based indicators outperform demographic factors in this dataset
- Model could support clinical decision-making but requires validation on larger, diverse populations
- Small dataset size (270 patients)
- Single-center data (potential selection bias)
- Limited external validation
- Data Expansion: Include additional clinical parameters and larger patient cohorts
- Advanced Models: Explore ensemble methods and deep learning approaches
- Temporal Analysis: Incorporate patient history and disease progression
- External Validation: Test on independent datasets from different medical centers
- Feature Interactions: Deeper analysis of risk factor combinations
The complete analysis is available in chd_classifier.ipynb. The notebook runs end-to-end in seconds and includes:
- Comprehensive data exploration
- Model training and optimization
- Detailed performance analysis
- Clinical interpretation of results
chd-classifier/
├── chd_classifier.ipynb # Main analysis notebook
├── CHD_Classification.csv # Dataset
├── helpers/ # Utility functions
│ ├── eval_model.py # Model evaluation
│ ├── optimizer.py # Hyperparameter optimization
│ ├── plotters.py # Visualization functions
│ └── feature_engineering.py # Feature engineering utilities
└── README.md # This file