Data in the real world is not of good quality. It will have incorrect data due to human error, transmission error, collection error etc. We will look at the data quality errors and how to handle those error in detail.

Error could be because

  • Incomplete (Missing)

Lacking attribute values eg. Age =””

  • Noisy

Containing noise, outliers eg. Age = “-50

  • Inconsistent

Containing discrepancies

Eg. Age = “10” , Birthdate = “03/03/2003”

       Rating recorded “1,2,3” , now being recorded as “A,B,C”

  • Duplicate

Containing same attribute values

Eg. Two records with the same information of person named “Nanda”

Missing (Incomplete) Data:

Data is not always available

It could be because of the below

  • Equipment malfunction
  • Inconsistent with other recorded data and thus deleted
  • Data not entered due to misunderstanding
  • Certain data not considered important during data collection

How to handle missing data?

  • Ignore those values
  • Fill in manually
  • Fill in automatically with global constant, attribute mean, most probable values – inference based on Bayes Theorem or decision tree

Noisy Data:

Random error or variance in a measured variable

Error could be because of the below

  • Equipment malfunction
  • Data entry problems
  • Data transmission problems
  • Technology limitation

Noise refers to modification of original values

Outliers are data objects with characteristics that are different than most of the other data objects in the data set.

How to handle noisy data?

  • Binning

First sort data and partition into equal-frequency bins then smooth by bin means, smooth by bin median, smooth bin by boundaries etc.

  • Regression

Smooth by fitting the data into regression functions

  • Clustering

Detect and remove outliers

  • Combined computer and human inspection

Detect suspicious values and check by human

Inconsistent Data:

Data set may contain discrepancies or mismatch

How to handle inconsistent data?

  • Data Scrubbing:

Use simple domain knowledge to detect errors and make corrections

  • Data Auditing

By analyzing data to discover rules and relationship to detect violators for making corrections.

Duplicate Data:

Data set may include data objects that are duplicates.

How to handle duplicate data:

  • Eliminate data

Hope this article will be helpful for you.