InsideSherpa KPMG Data Analytics Virtual Internship
If you just reached here searching for some virtual internship and want to know more about InsideSherpa Virtual Internship Platform then visit my earlier article Inside InsideSherpa Virtual Internship.
Now if you are a participant of the course and are looking
for a solution to the modules then you will be disappointed as below post only
describes some ways how you can move ahead with the module if you are stuck. Moreover, it might not be 100% accurate and must only be taken as guidance.
The first module is the Data Quality Assessment, it is a step in
data analysis where you look for issues with the data set provided to you. There
can be n number of reasons for those issues but rectifying those issues before
proceeding it very important which can otherwise lead to highly inaccurate resulting
into loss of time, efforts and money.
One of the most common data quality issues is Missing Values,
in the three sheets provided look for the column with empty cells. The missing
values can be treated in multiples ways:
·
Generally, in case of categorical data (for
example “wealth_segment” in “CustomerDemographic” sheet), you cannot assign any
random value or a value that occurs most as this will skew the data. In such
cases, if the total number of rows with missing data is a small percentage of
the total say ~ 1%, then you can consider deleting/ignoring them. If the number
of the rows with missing data is large, then it’s better to exclude that
column from further analysis.
Consistency is another issue that arises as the data is
collected at different point of interaction and by different people and they
tend to enter a value with the same meaning in different ways. Every column in
the dataset should have only one way of representation for values with the same meaning (for example:- gender Male can be represented as Male or M).
Digging more deeper, look for the association between different
sheets, i.e. the data given in each sheet should be associated with each other.
Any mismatch can be a probable candidate for the data quality issues and must
be reported. Look for the columns that are present in multiple sheets.
Most of the data cleaning process as mentioned above can be
easily be done using Microsoft Excel using filters (short cut – ctrl+shift+L) and
carefully going through each column.
If you want to learn how to perform data cleaning using python
then you can go through this link.
Hope you might have found this helpful, if yes then, please share it with your friends and colleagues. Want to get some idea about Data Exploration and Model Development have a look at this post.
Thanks
InsideSherpa KPMG Data Analytics Virtual Internship
Reviewed by KnowMore
on
April 24, 2020
Rating:

No comments: