InsideSherpa KPMG Data Analytics Virtual Internship

If you just reached here searching for some virtual internship and want to know more about InsideSherpa Virtual Internship Platform then visit my earlier article Inside InsideSherpa Virtual Internship.

Now if you are a participant of the course and are looking for a solution to the modules then you will be disappointed as below post only describes some ways how you can move ahead with the module if you are stuck. Moreover, it might not be 100% accurate and must only be taken as guidance.

The first module is the Data Quality Assessment, it is a step in data analysis where you look for issues with the data set provided to you. There can be n number of reasons for those issues but rectifying those issues before proceeding it very important which can otherwise lead to highly inaccurate resulting into loss of time, efforts and money.

One of the most common data quality issues is Missing Values, in the three sheets provided look for the column with empty cells. The missing values can be treated in multiples ways:

· Generally, in case of categorical data (for example “wealth_segment” in “CustomerDemographic” sheet), you cannot assign any random value or a value that occurs most as this will skew the data. In such cases, if the total number of rows with missing data is a small percentage of the total say ~ 1%, then you can consider deleting/ignoring them. If the number of the rows with missing data is large, then it’s better to exclude that column from further analysis.

· In case of continuous data, you can fill the empty cells with average values

Consistency is another issue that arises as the data is collected at different point of interaction and by different people and they tend to enter a value with the same meaning in different ways. Every column in the dataset should have only one way of representation for values with the same meaning (for example:- gender Male can be represented as Male or M).

Digging more deeper, look for the association between different sheets, i.e. the data given in each sheet should be associated with each other. Any mismatch can be a probable candidate for the data quality issues and must be reported. Look for the columns that are present in multiple sheets.

Most of the data cleaning process as mentioned above can be easily be done using Microsoft Excel using filters (short cut – ctrl+shift+L) and carefully going through each column.

If you want to learn how to perform data cleaning using python then you can go through this link.

Hope you might have found this helpful, if yes then, please share it with your friends and colleagues. Want to get some idea about Data Exploration and Model Development have a look at this post.

Thanks

InsideSherpa KPMG Data Analytics Virtual Internship

No comments:

Popular Posts

Labels

Recent

Popular

Comments

Blog Archive

Featured post

Inside InsideSherpa Virtual Internship

Contact us

Report Abuse

Popular Posts

Search This Blog

About Me