InsideSherpa KPMG Virtual Internship Data Exploration and Model Development



Just stumbled here, this is a second post in the series of KPMG Virtual internship on InsideSherpa. Want to know more about InsideSherpa and Virtual internship then visit my earlier post Inside InsideSherpa Virtual Internship

You can find my earlier post on Data Quality Issues here.

Module 2, Data Exploration and Model Development is the toughest module in the KPMG virtual internship but at the same time, the learning curve is extremely exponential. The solution is expected in three steps of Data Exploration, Model Development and Interpretation.

Data Exploration is a step in which you get to know your data well. This step more of what each column of data conveys to you. This step involves the use of basic statistics like mean, median, mode to better understand the distribution of the data. Use an excel pivot table, plot histograms, scatter plots, frequency table, and relative frequency table.

Classify each variable into predictor and response/outcome variable. To understand the data better than you can look at ways to transform current variables into more meaningful data. Use binning, combine sheets, add or subtract two or more columns to get information that might be more important to the business

The next step Model Development is the step where you start to form links between different variables in the dataset. Using Correlation to identify whether there is a significant relationship between a predictor variable and the outcome variable. For categorical variable, you can use ANOVA (Analysis of Variance) to see whether two groups are significantly different from each other.
 
The next possible step could be to use current customer data to identify which of the customers are more valuable using RFM model (Recency, Frequency and Monetary Value, which are three important parameters for any business) and K-Means Clustering. This detailed article will provide you with a detailed step by step procedure to apply RFM model using Python.
The output of the RFM model i.e the overall score can now be a possible candidate for the response variable.

Now using the predictor variable and the response variable you can use multiple linear regression to obtain a model linear equation between predictor and response variable. For categorical variable use dummies or one hot encoding techniques to convert them from qualitative to quantitative variables. Also, use “train_split” process to divide the current customer dataset into train and test dataset. At last, evaluate your model by analysing the R-square and RMSE.
Below is the list of few links that I found helpful for understanding multiple linear regression and other concepts:-
  1. Statistics 101: Multiple Linear Regression, The Very Basics
  2. Machine Learning Tutorial Python - 3: Linear Regression Multiple Variables
  3. Multiple Linear Regression With Python
Hope you got some idea as to how you can proceed. 

Note: Please note neither the above method is the best approach nor its 100% correct. I am also a learner if you find something more useful or a better approach then please let me know by commenting.

Thanks and Happy Learning :)
InsideSherpa KPMG Virtual Internship Data Exploration and Model Development InsideSherpa KPMG Virtual Internship Data Exploration and Model Development Reviewed by KnowMore on April 26, 2020 Rating: 5

No comments:

Powered by Blogger.