Car Price Prediction

Car Price Prediction Problem

In this article, we are going to discuss two important models namely Linear Regression and Decision Tree Regression that can be used for predicting the car prices in a given dataset. The entire notebook is available at GITHUB.

The machine learning model preparation follows these simple steps as narrated below,

Importing the libraries & loading the data
Data Analysis
Model Selection
Data preparation (data cleaning & modification)
Feature Engineering
Model Fitting
Model Prediction
Model Accuracy
Comparing the models

Importing the libraries & Loading the data

We have taken a dataset from Kaggle and the dataset can be downloaded from here - CarPrice.csv

We will be needing basic libraries in python like NumPy & Pandas and for the data analysis and visualization purpose, we will be using Matplotlib and Seaborn.

For uploading the data we will be using pd.read_csv method as the dataset is in CSV(comma separated value) format.

# importing the library
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
plt.rcParams['figure.figsize'] = (20.0, 10.0)

# importing the data
data = pd.read_csv("car_price_assignment.csv")

Data Analysis

Data analysis is a crucial part of any Machine Learning model. We will be doing a brief analysis (including visual analysis) of the dataset that has been provided to us.

# method to get the required info out about the data
data.info()

# method to describe the data
data.describe()

Interpreting the output

As seen in the above Notebook, our dataset has 26 columns having a distribution of 8 float64 entries, 8 int64 entries and 10 objects. There is a total of 205 entries in each column and it can be concluded that the columns are non-empty.

One important thing to notice, the mean price of the car is 13276.710571, having a standard deviation of 7988.852332.

The price range is concentrated more in the 5000 - 20000 region and to check the outliers in the price column, we can use boxplots.

To get the correlation among the columns of the dataset we are going to use heatmap.

Model Selection

As we can see, we have a number of features and our target is to predict the price. It can be evident from the above analysis that we can proceed with Multiple Linear Regression models in order to predict the prices of the cars.

For the MLR model, we may have many model choices but Linear Regression and Decision Tree Regressions are two of the best model to be used in this use case for predicting the price.

Data preparation (data cleaning & modification)

To prepare the data for the model we need to select our features for the MLR. The main question that arises here is, how to select the features? Features must be selected in accordance with the correlation of a column variable with the target variable. The more the correlation the better feature it can be, but we need to look into the correlation of a feature with other features as well because a strong correlation of a feature with others features may give rise to prediction problems.

In accordance with the heatmap, we may select some features as below, for our dataset,

reg_features = ['wheelbase','carlength','carwidth','horsepower','curbweight','enginesize','citympg','highwaympg']
reg_target = ['price']

Feature Engineering

Anyone having a good knowledge of vehicle mechanics may generate few new features by using some core concepts. For example, as we can see in the column, we have two features namely, horsepower and peakrpm. A new feature named torque may be derived using the following formula,

data['torque'] = (data['horsepower']*5252)/data['peakrpm']

The new feature torque has a very good positive correlation with the target column price.

Several more features like fuel efficiency etc can be derived from the dataset but we will be limiting ourselves to only one feature as our aim is to understand the model.

Model Fitting

Linear Regression Model

For the linear regression model we are going to take the help of the sklearn library in python. The entire code can be described below,

# importing a method to split the data into train and test set
from sklearn.model_selection import train_test_split

# using the method to split the data into train and test data(70-30 format)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=101)

# importing the linear regression model (inbuilt)
from sklearn.linear_model import LinearRegression

# creating a linear regression object
lm = LinearRegression()

# fitting the model with the training data
lm.fit(X_train,y_train)

Decision Tree Regression Model

For the decision tree regression model, we are going to take the help of the sklearn library in python. The entire code can be described below,

# importing the decision tree regressor from sklearn
from sklearn.tree import DecisionTreeRegressor  
  
# create a decision tree regressor object 
regressor = DecisionTreeRegressor(random_state = 3)  
  
# fit the regressor with training data
regressor.fit(X_train, y_train)

Model Predictions

To predict the output, we need to run our trained model on the test dataset and get the predicted data in order to check the model accuracy, this can be done as below,

# generating the predictions of the Linear Regression model
predictions = lm.predict(X_test)

# generating the predictions of the Decision Tree Regression Model
y_pred = regressor.predict(X_test)

Model Accuracy

This part of a model building is equally important because after checking the accuracy we may do some more feature engineering or data manipulation in order to get the accuracy better.

To check the accuracy normally we use metrics methods that are defined in the sklearn library of python.

# importing the metrics methods 
from sklearn import metrics

# printing the mean absolute error
print('MAE:', metrics.mean_absolute_error(y_test, predictions))

# printing the mean square error
print('MSE:', metrics.mean_squared_error(y_test, predictions))

# printing the root mean square error
print('RMSE:', np.sqrt(metrics.mean_squared_error(y_test, predictions)))

The model fitting of the regression models can be checked by using a scatterplot.

Comparing The Models

The Root Mean Square Error metrics can be used as an accuracy comparator for any two Regression models. For a certain data model the lower the RMSE value is the better prediction it gives.

From the above two models we have got the following RMSE output,

For the Linear Regression Model - 3363.3345325181976

For the Decision Tree Regression Model - 2330.660260354084

So, the decision tree model can provide us with better prediction as evident from the above RMSE scores.

[This article is contributed by Abhijit Tripathy, Python & ML Developer]

copyright @Abhijit Tripathy

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Car Price Prediction

Car Price Prediction Problem

Importing the libraries & Loading the data

Data Analysis

Model Selection

Data preparation (data cleaning & modification)

Feature Engineering

Model Fitting

Linear Regression Model

Decision Tree Regression Model

Model Predictions

Model Accuracy

Comparing The Models

Uh oh!

Clone this wiki locally