End-to-End Machine Learning Project

We are going to Build an End-to-End Machine Learning Project. This demo project will help you how to think while working on End-to-End machine Learning Project. This will help you to Analyze details from business point of view rather then just modeling.

Table of Contents

Let’s say you are working as a data scientist in a company. What are the main steps we will go through:

Look at the big picture.
Get the data.
Discover and visualize the data to gain insights.
Prepare the data for Machine Learning algorithms.
Select a model and train it.
Fine-tune your model.
Test your model on Test Dataset.

What is Machine Learning

Machine learning is a subset of artificial intelligence that tells us about the development of algorithms and statistical models which actually allow computers to learn and make predictions or make decisions without being explicitly programmed.

What we really do when we say we are working on machine learning model.

Basically our end goal is, from given data create a new working model which will predict the target values. Means you will have some input data and maybe some output variables as well, by finding the patterns on this two input and output data you can create a model that will work for new data as well.

Learn from Example

Let’s take a example where I will try to cover all necessary end-to-end machine learning model training steps. So shall we start ?

Download given dataset of students placement: Placement

We will work on this dataset. It is a dummy dataset for demo purposes only.

The given dataset has three columns, first two are input(independent) and third one is output(dependent). We will split our dataset into two parts mainly Training and Testing. First we will train our model on training data and then, that trained model we will use on test data to check accuracy of our model. Simple right. At this point you will need little knowledge of Pandas library. I would suggest to check out my article on pandas first Pandas.

Steps to Create a Model:

Preprocess + EDA + Feature Selection
Extract input and output cols
Scale the values (mean and std. dev)
Train test split (cross validation)
Train the model
Evaluate the model/model selection
Deploy the model

Note: These are not fixed steps. These steps are which I usually follow for my approach, and for any beginner understanding of these are good enough.

Open the Dataset and Preview

import numpy as np
import pandas as pd

df = pd.read_csv('placement.csv')

df.head()

     Unnamed:0	 cgpa	iq	placement
0	0	 6.8	123.0	 1
1	1	 5.9	106.0	 0
2	2	 5.3	121.0	 0
3	3	 7.4	132.0	 1
4	4	 5.8	142.0	 0

Here we are reading out csv file and showing data in tabular form. By looking at the data we can see that there is three columns, independent and dependent. We can quickly check if dataset contains any null values or not. We can do that by two ways first, df.isnull().sum() and second, df.info(). As there is no null values in our dataset we are good to go. If we had null values then we would have handled it with some special techniques.

In our data there are some unnecessary columns (Unnamed:0) as well that we have to drop. So just select all necessary ones.

df = df.iloc[:, 1:]
df.head()

        cgpa	iq	placement
0	6.8	123.0	 1
1	5.9	106.0	 0
2	5.3	121.0	 0
3	7.4	132.0	 1
4	5.8	142.0	 0

As you can see one column has dropped. Remember here cgpa and iq are input columns and placement is our target column. This was our preprocess steps.

EDA (Exploratory Data Analysis)

This is the process of examining dataset and understanding a dataset to uncover patterns, trends, anomalies, and relationships between variables. It involves Data Collection, Data Cleaning, Univariate Analysis, Visualization, and many more steps. We will use only Visualization here to get the understanding of data.

import matplotlib.pyplot as plt

plt.scatter(df['cgpa'], df['iq'], c=df['placement'])
plt.xlabel('cgpa')
plt.ylabel('iq')

By executing this code, you would obtain a scatter plot where the x-coordinate is the ‘cgpa’ values, the y-coordinate is the ‘iq’ values, and the color of each data point is determined by the ‘placement’ variable. This plot helps visualize any potential relationships or patterns between the ‘cgpa’ and ‘iq’ variables while considering the ‘placement’ variable as an additional factor. So according to my understanding we should apply logistic regression here.

Now how I came to this point that which algorithm we should be using, please don’t worry about it right now. Our only goal is to understand the model building process.

Extract Input and Output Column

Separate independent and dependent variables.

X = df.iloc[:,0:2]
y = df.iloc[:,-1]

Here Capital X is input and small y is output.

Train Test Split

Now do train test and split on our data to create subsets and perform training on that subsets. More about train test split in here.

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size = 0.2)

Now Scale the values. Again this step is not mandatory, if you feel like doing it then only do it.

from sklearn.preprocessing import StandardScaler

scalar = StandardScaler()

It demonstrates the usage of the StandardScaler class from the sklearn.preprocessing module in scikit-learn (sklearn). It is a common practice to import and instantiate this class when performing data preprocessing tasks. scalar = StandardScaler(): This line creates an instance of the StandardScaler class and assigns it to the variable scalar.

X_train = scalar.fit_transform(X_train)

scalar.fit_transform(X_train): This line of code applies the fit_transform method on the scalar object using X_train as the input. The fit_transform method performs two steps: fitting the scaler object to the training data and applying the transformation to the training data in a single operation.

fit(): This step calculates the parameters needed to perform the desired transformation. For example, in the case of a scaler object, it calculates the mean and standard deviation (or other scaling parameters) from the training data.
transform(): This step applies the calculated parameters to the data, transforming it based on the specified method. For example, in this case, transform() scales or standardizes the features in X_train based on the mean and standard deviation calculated during the fitting step.

X_test = scalar.transform(X_test)

Now use the Logistic Regression algorithm and train the model.

Train the Model

from sklearn.linear_model import LogisticRegression
clf = LogisticRegression()

# model training
clf.fit(X_train, y_train)

clf: It represents the machine learning model which could be any classifier or regressor from scikit-learn or another library. This model has specific methods and attributes that allow it to learn patterns and make predictions based on the provided data.

fit: It is a method of the clf model that trains the model using the provided training data. The fit method takes two arguments: the feature matrix X_train and the target vector y_train. The model learns from the provided data by adjusting its internal parameters or coefficients to minimize the discrepancy between the predicted outputs and the actual target values.

By executing this code the machine learning model specified by clf is trained using the training data X_train and the corresponding target values y_train. The model learns patterns, relationships, or decision boundaries in the data to make accurate predictions or estimates.

Evaluate the Model

y_pred = clf.predict(X_test)
y_pred

predict: It is a method of the trained model (clf) that makes predictions on new, unseen data. The predict method takes the test feature matrix (X_test) as input and returns the predicted target values based on the learned patterns in the training data.

By executing this code the trained model (clf) is used to generate predictions for the test data (X_test). The predicted target values are stored in the variable y_pred, which can be used for evaluation or further analysis.

Check Accuracy Score

Now this is the score that we want. it tells us about our model and how accurately it is working. The accuracy_score function from scikit-learn compares the predicted values (y_pred) with the true values (y_test) and returns the accuracy, which is the proportion of correctly predicted values.

# Check accuracy score
from sklearn.metrics import accuracy_score

accuracy_score(y_test, y_pred)

0.95

Our accuracy score is 95% accurate which is a good sign. That’s it you did it, you have created a model that can predict the placement of students by just providing a input data.

Full Code is Here.

Conclusion

I hope you must learn something from my very basic steps to create any model and train it. You can try different datasets and calculate the accuracy score there. Building an end-to-end machine learning (ML) project involves a series of steps that transform raw data into a deployed ML model. Throughout this process, you engage in data preprocessing, model selection and training, evaluation, and deployment.