Monday, November 12, 2007

The Black Swan and Data Quality

First published online 19th October 2007

Nassim Nicholas Taleb, in his book The Black Swan, makes an interesting point which helped me to clarify in my mind why companies are too often willing to spend millions on attempting to cleanse and correct data after it has been collected rather than spending a fraction of that amount in increasing data quality through prevention, for example through validating data on input, a process that will always result in better quality data than any cleansing process employed later in the data process can produce.

What it boils down to is that reaction to problems can be perceived, measured and rewarded. Prevention results only in a lack of problems, which cannot be measured or rewarded. Most employees and managers feel more comfortable with the latter than the former.

To paraphrase Taleb, let us imagine that a legislator before 11th September 2001 introduced a law obliging airlines to fit armoured doors to the cabins of their planes, which were to be kept locked during flights. This would be unpopular with the airlines due to the extra costs involved, for example. Airlines might have to increase fares to cover costs, or reduce staffing levels, and that legislator would be unpopular. As this preventative measure would probably have ensured that the airlines used to attack New York and Washington on 11th September 2001 would not have been highjacked and the attacks would not have taken place, there would be a general perception that his measure was pointless and it would be impossible to measure what its effect had been because what it might be preventing is not taking place to measure.

My own favourite example of this is the millennium computer bug issue. Untold amounts of money were spent to prevent systems failing when the year changed to 2000, and domesday scenarios were described if the changes weren't made. As the clocks stuck midnight on 31st December 1999 no planes fell from skies and no nuclear power plants went into meltdown. Was there cheering and jubilation, back slapping and congratulations for the workers who had prevented this? On the contrary, they were labelled as a bunch of panic merchants who had exaggerated the problem to make a quick buck.

And so it is with data quality in almost all companies. An employee who can analyse badly collected data can show his bosses that there are, for example, errors in 50% of all records, and that after purchasing cleansing programs it could be reduced to 15%, is rewarded for producing measurable and verifiable results.

The better employee, who persuades his bosses to install systems to prevent data pollution earlier in the process can show, for example, errors in only 10% of the records, but as there is no worse start point with which to compare this figure, the employee could either be criticised that there are so many errors, despite the amount spent on prevention; or, more likely, be overlooked because nothing is being perceived as happening in their part of the company to draw managers' attentions.

I have seen this situation time and again in many different contexts. It is a rare manager indeed who can overcome the pressure to show figures in this way to justify their positions, and this reflects on data quality.

(c) 2007 Graham Rhind. Reproduction only allowed with permission. Comment and dialogue welcome.

No comments: