-
-
Notifications
You must be signed in to change notification settings - Fork 49
Car Price Prediction
In this article, we are going to discuss two important models namely Linear Regression and Decision Tree Regression that can be used for predicting the car prices in a given dataset. The entire notebook is available at GITHUB.
The machine learning model preparation follows these simple steps as narrated below,
- Importing the libraries & loading the data
- Data Analysis
- Model Selection
- Data preparation (data cleaning & modification)
- Feature Engineering
- Model Fitting
- Model Prediction
- Model Accuracy
- Comparing the models
We have taken a dataset from Kaggle and the dataset can be downloaded from here - CarPrice.csv
We will be needing basic libraries in python like NumPy & Pandas and for the data analysis and visualization purpose, we will be using Matplotlib and Seaborn.
For uploading the data we will be using pd.read_csv method as the dataset is in CSV(comma separated value) format.
# importing the library
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
plt.rcParams['figure.figsize'] = (20.0, 10.0)
# importing the data
data = pd.read_csv("car_price_assignment.csv")
Data analysis is a crucial part of any Machine Learning model. We will be doing a brief analysis (including visual analysis) of the dataset that has been provided to us.
# method to get the required info out about the data
data.info()
# method to describe the data
data.describe()
Interpreting the output
As seen in the above Notebook, our dataset has 26 columns having a distribution of 8 float64 entries, 8 int64 entries and 10 objects. There is a total of 205 entries in each column and it can be concluded that the columns are non-empty.
One important thing to notice, the mean price of the car is 13276.710571, having a standard deviation of 7988.852332.
The price range is concentrated more in the 5000 - 20000 region and to check the outliers in the price column, we can use boxplots.
To get the correlation among the columns of the dataset we are going to use heatmap.
As we can see, we have a number of features and our target is to predict the price. It can be evident from the above analysis that we can proceed with Multiple Linear Regression models in order to predict the prices of the cars.
For the MLR model, we may have many model choices but Linear Regression and Decision Tree Regressions are two of the best model to be used in this use case for predicting the price.
To prepare the data for the model we need to select our features for the MLR. The main question that arises here is, how to select the features? Features must be selected in accordance with the correlation of a column variable with the target variable. The more the correlation the better feature it can be, but we need to look into the correlation of a feature with other features as well because a strong correlation of a feature with others features may give rise to prediction problems.
In accordance with the heatmap, we may select some features as below, for our dataset,
reg_features = ['wheelbase','carlength','carwidth','horsepower','curbweight','enginesize','citympg','highwaympg']
reg_target = ['price']
Anyone having a good knowledge of vehicle mechanics may generate few new features by using some core concepts. For example, as we can see in the column, we have two features namely, horsepower and peakrpm. A new feature named torque may be derived using the following formula,
data['torque'] = (data['horsepower']*5252)/data['peakrpm']
The new feature torque has a very good positive correlation with the target column price.
Several more features like fuel efficiency etc can be derived from the dataset but we will be limiting ourselves to only one feature as our aim is to understand the model.
For the linear regression model we are going to take the help of the sklearn library in python. The entire code can be described below,
# importing a method to split the data into train and test set
from sklearn.model_selection import train_test_split
# using the method to split the data into train and test data(70-30 format)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=101)
# importing the linear regression model (inbuilt)
from sklearn.linear_model import LinearRegression
# creating a linear regression object
lm = LinearRegression()
# fitting the model with the training data
lm.fit(X_train,y_train)
For the decision tree regression model, we are going to take the help of the sklearn library in python. The entire code can be described below,
# importing the decision tree regressor from sklearn
from sklearn.tree import DecisionTreeRegressor
# create a decision tree regressor object
regressor = DecisionTreeRegressor(random_state = 3)
# fit the regressor with training data
regressor.fit(X_train, y_train)
To predict the output, we need to run our trained model on the test dataset and get the predicted data in order to check the model accuracy, this can be done as below,
# generating the predictions of the Linear Regression model
predictions = lm.predict(X_test)
# generating the predictions of the Decision Tree Regression Model
y_pred = regressor.predict(X_test)
This part of a model building is equally important because after checking the accuracy we may do some more feature engineering or data manipulation in order to get the accuracy better.
To check the accuracy normally we use metrics methods that are defined in the sklearn library of python.
# importing the metrics methods
from sklearn import metrics
# printing the mean absolute error
print('MAE:', metrics.mean_absolute_error(y_test, predictions))
# printing the mean square error
print('MSE:', metrics.mean_squared_error(y_test, predictions))
# printing the root mean square error
print('RMSE:', np.sqrt(metrics.mean_squared_error(y_test, predictions)))
The model fitting of the regression models can be checked by using a scatterplot.
The Root Mean Square Error metrics can be used as an accuracy comparator for any two Regression models. For a certain data model the lower the RMSE value is the better prediction it gives.
From the above two models we have got the following RMSE output,
For the Linear Regression Model - 3363.3345325181976
For the Decision Tree Regression Model - 2330.660260354084
So, the decision tree model can provide us with better prediction as evident from the above RMSE scores.
[This article is contributed by Abhijit Tripathy, Python & ML Developer]
copyright @Abhijit Tripathy