Is Your Data Clean or Dirty?

By October 19, 2016Blog

downloadOver the weekend I read an incredible post from SAS Big Data evangelist Tamara Dull. I love her down-to-earth and real life perspectives on Big Data, and your analogy of cleaning the car hit home for me. She is spot on – clean data pays dividends in being able to get better insights.

But, what is clean data? What is that threshold that says your data is clean versus dirty?

Could data even be “too clean”?

(pause to hear gasps from my OCD readers)

Clean data and clean houses

Taking this to a real life example, I can say first hand there are often different definitions of what clean is. For example, my wife is very keen on keeping excess items off our kitchen counters, to the point where she’ll see something that doesn’t belong and put it in the first cabinet or drawer she encounters that has space for it. Me on the other hand, I’m big on finding what I believe is the right place for it. Both of us have the same goal in mind – get the counters clean.

To each of us, there’s value in our approaches – which is efficiency. Hers is optimized at the front end, mine at the back end. However, the end result of each of our “cleaning” could have negative impacts (with my approach, it’s my wife’s inability to find where I put something – with my wife’s method, it’s having items fall out of a cabinet on me as I open it).

Is “clean” to one person the same as everyone?

The life lesson above teaches something critical about data – clean isn’t a cut and dry threshold. And taking a page from Tamara’s post, it’s also not a static definition.

The trap you can quickly fall into is thinking of data in the same terms as you would have looked at structured data. While yes, part of the challenge is to understand what the data is and its relationships, the more crucial challenge is how you intend to consume the data and then use it. This is a shift from the RDBMS thinking of focusing on normalization and structure first and then usage second. With the Big Data-esque ways of consuming and processing data (streaming, ML, AI, IOT) combined with velocity, variability, and volume, the use-case mindset is exactly where your focus should be.

“Use case first approach” is how we look at these technologies at ODPi. We look at questions like “Here is the data I have, and this is what I’m trying to find out – what is the right approach/tools/patterns to use?” and how they can be answered. We ensure all of our compliant platforms, interoperable apps, and specifications have the components needed to enable successful business outcomes. This provides companies the peace of mind that they are making a safe investment, and that switching tools doesn’t mean that their clean data becomes less than optimal to leverage the way they want.

This parallels on the discussion of cleaning in our house – are we trying to clean up quickly because company is coming over, or are we trying to go through an entire room and organize it. Approaching data cleaning is the same thought process.