This project aims to explore and analyze the Titanic dataset and apply various machine learning models to predict the survival of passengers. The analysis includes data cleaning, feature engineering, and model training using Logistic Regression, SVM, and Decision Tree algorithms. The project also involves creating and saving pipelines for the Decision Tree and Logistic Regression models. The accuracy achieved is not great beacause the dataset is better suited for classification models, but this implementation is solely for demonstration of usage of regression models and creating and saving pipelines.
- Overview
- Dataset
- Installation
- Usage
- Exploratory Data Analysis
- Model Training
- Pipelines
- Results
- Contributing
The dataset used for this project is the Titanic dataset, which can be downloaded from Kaggle. It contains information about the passengers on the Titanic, including whether they survived or not.
To run this project locally, follow these steps:
Clone the repository:
git clone https://github.com/Ich-Asadullah/Exploratory_Analysis_and_Training_different_models_on_Titanic_data
Navigate to the project directory:
cd Exploratory_Analysis_and_Training_different_models_on_Titanic_data
Install the required packages:
pip install sklearn pandas seaborn matplotlib
- Open the Jupyter Notebook
- Run the notebook cells to perform EDA and train the models.
The EDA section includes:
- Data Cleaning: Handling missing values and outliers.
- Feature Engineering: Creating new features from existing ones.
- Data Visualization: Visualizing the distribution of features and their relationships with the target variable.
The following models are trained and evaluated in the notebook:
- Logistic Regression
- Support Vector Machine (SVM)
- Decision Tree
Pipelines for the Decision Tree
and Logistic Regression
models have been created and saved to ensure smooth model training and evaluation. These pipelines include all the necessary preprocessing steps and the model itself.
The performance of each model is evaluated using appropriate metrics. The results are compared to determine the best-performing model for this dataset.
Contributions are welcome! Please feel free to submit a Pull Request.