Data in the real world is not of good quality. It will have incorrect data due to human error, transmission error, collection error etc. We will look at the data quality errors and how to handle those error in detail.
Error could be because
- Incomplete (Missing)
Lacking attribute values eg. Age =””
- Noisy
Containing noise, outliers eg. Age = “-50
- Inconsistent
Containing discrepancies
Eg. Age = “10” , Birthdate = “03/03/2003”
Rating recorded “1,2,3” , now being recorded as “A,B,C”
- Duplicate
Containing same attribute values
Eg. Two records with the same information of person named “Nanda”
Missing (Incomplete) Data:
Data is not always available
It could be because of the below
- Equipment malfunction
- Inconsistent with other recorded data and thus deleted
- Data not entered due to misunderstanding
- Certain data not considered important during data collection
How to handle missing data?
- Ignore those values
- Fill in manually
- Fill in automatically with global constant, attribute mean, most probable values – inference based on Bayes Theorem or decision tree
Noisy Data:
Random error or variance in a measured variable
Error could be because of the below
- Equipment malfunction
- Data entry problems
- Data transmission problems
- Technology limitation
Noise refers to modification of original values
Outliers are data objects with characteristics that are different than most of the other data objects in the data set.
How to handle noisy data?
- Binning
First sort data and partition into equal-frequency bins then smooth by bin means, smooth by bin median, smooth bin by boundaries etc.
- Regression
Smooth by fitting the data into regression functions
- Clustering
Detect and remove outliers
- Combined computer and human inspection
Detect suspicious values and check by human
Inconsistent Data:
Data set may contain discrepancies or mismatch
How to handle inconsistent data?
- Data Scrubbing:
Use simple domain knowledge to detect errors and make corrections
- Data Auditing
By analyzing data to discover rules and relationship to detect violators for making corrections.
Duplicate Data:
Data set may include data objects that are duplicates.
How to handle duplicate data:
- Eliminate data
Hope this article will be helpful for you.