InsideSherpa KPMG Virtual Internship Data Exploration and Model Development
Just stumbled here, this is a second post in the series of KPMG Virtual internship on InsideSherpa. Want to know more about InsideSherpa and Virtual internship then visit my earlier post Inside InsideSherpa Virtual Internship
You can find my earlier post on Data Quality Issues here.
Module 2, Data Exploration and Model Development is the toughest
module in the KPMG virtual internship but at the same time, the learning curve
is extremely exponential. The solution is expected in three steps of Data
Exploration, Model Development and Interpretation.
Data Exploration is a step in which you get to know your
data well. This step more of what each column of data conveys to you. This step
involves the use of basic statistics like mean, median, mode to better understand
the distribution of the data. Use an excel pivot table, plot histograms, scatter
plots, frequency table, and relative frequency table.
Classify each variable into predictor and response/outcome
variable. To understand the data better than you can look at ways to transform
current variables into more meaningful data. Use binning, combine sheets, add
or subtract two or more columns to get information that might be more important
to the business
The next step Model Development is the step where you start
to form links between different variables in the dataset. Using Correlation to
identify whether there is a significant relationship between a predictor
variable and the outcome variable. For categorical variable, you can use ANOVA
(Analysis of Variance) to see whether two groups are significantly different
from each other.
The next possible step could be to use current customer data
to identify which of the customers are more valuable using RFM model (Recency,
Frequency and Monetary Value, which are three important parameters for any
business) and K-Means Clustering. This detailed article will provide you with a
detailed step by step procedure to apply RFM model using Python.
The output of the RFM model i.e the overall score can now be
a possible candidate for the response variable.
Now using the predictor variable and the response variable
you can use multiple linear regression to obtain a model linear equation between
predictor and response variable. For categorical variable use dummies or one
hot encoding techniques to convert them from qualitative to quantitative variables.
Also, use “train_split” process to divide the current customer dataset into
train and test dataset. At last, evaluate your model by analysing the R-square and
RMSE.
Below is the list of few links that I found helpful for
understanding multiple linear regression and other concepts:-
- Statistics 101: Multiple Linear Regression, The Very Basics
- Machine Learning Tutorial Python - 3: Linear Regression Multiple Variables
- Multiple Linear Regression With Python
Hope you got some idea as to how you can proceed.
Note: Please note neither the above method is the best approach nor its 100% correct. I am also a learner if you find something more useful or a better approach then please let me know by commenting.
Thanks and Happy Learning :)
InsideSherpa KPMG Virtual Internship Data Exploration and Model Development
Reviewed by KnowMore
on
April 26, 2020
Rating:
No comments: