Friday, July 16, 2010

Definition drift

A number of posts recently have drawn my attention again to the persistent problem of definitions and of definition drift. We are rarely able to agree a definition of any word or phrase in the data quality world before a new buzz term comes along to grab our attention. A great deal of this is due to fashion and marketing. Software and solution providers are constantly searching for new terms to launch in order to "persuade" (I'm being nice here) executives that they need to upgrade their current installations. Try as I might, for example, I cannot find a definition for Master Data Management that does not tally with what I understand to be good old plain data management, something we've been doing for years. Mark Goloboy noticed how some people are trying to replace the term "data quality" with "information quality". Though it is often not obvious to information workers, there is a huge difference between information and data (as us data workers know).

Altering terms in this way because of marketing, fashion or internal political needs is pernicious and does little to help data quality or, ultimately, your customers, for whom you should be working to improve your management of their data. The definition saga led me to start to build my own glossary of terms and to place that online for all to enjoy, in the hope that we can arrest some of the worst excesses before they take off. A recent post by Jim Harris exposed again how definitions can affect our working practices, and I felt the need to expand on his post and try to clarify my own thinking on data quality, what it is, what contributes to it, and how it affects information quality.

Data quality tends to have three main definitions: fitness for purpose (we'll come onto that later); data accurately representing the real world entity to which is refers (my own preference); and data being complete, current, consistent and accurate (or relevant, up to date and accurate; or complete, valid, consistent and timely; or accurate, correct, timely, complete, and relevant; or any of a number of similar properties...).

Without wishing to write a three-volume novel about this, let's have a look and see how some of these parameters do affect data quality, starting with continuing the discussion from Jim's post:

Validity versus Accuracy

Validity is that a piece of data satisfies a rule relating to the data itself. Accuracy indicates that the valid data applies to the entity for/about which the data is being gathered and stored. For example, US, CA, DE, and RU are all valid ISO 3166-2 country codes. XX is not. None of these are valid country codes for the country in which I live - in that case, only NL would be an accurate code.

1st January 1833 is a valid (Gregorian) date. It is a valid date for the date of birth of a human being. It is (currently) not a valid date as a date of birth for a human being still alive. 1st January 1961 is a valid data of birth for a human being still alive, but it is not an accurate data of birth if it were to be applied to me.

For these reasons, dashboards and data profiling tools need to be used with caution. They can check that every country code or date within a data file is valid, but they cannot check their accuracy.

Currency versus Timeliness versus Up to Date versus Of Its Time

Yes, even these terms vary in their definitions. For the most part Timeliness is regarded as a processing aspect, where data is made available to the worker at the time it is required - not a data quality issue, in my opinion. Currency is often regarded as synonymous with up-to-date, but data which is up to date (i.e. valid now) is not necessarily fit for a purpose. If your purpose is to know what I bought from your online shop in 2003, you'll need data from that time and of that time, rather than data from this year. This is also a definition of currency, but it does need to be mentioned. For me, this is all part of the accuracy and completeness of data - if I move then the address you may have for me is valid (the building still exists), the data is valid of its time (I did live there when you added it to your database), but I don't live there now, so it's not current and it's not accurate because the house it there but the entity for/about which you're collecting the data (me) isn't there any more.

Consistency and fitness for purpose

Consistency is NOT a pre-condition for data quality. If your database contains the information that I live in The Netherlands in a variety of forms (NL, NLD, The Netherlands, Nederland, Holland, Pays Bas, Niederlande) then the data is accurate, though represented in any number of ways. The data has quality but it is difficult to work with and process, and is therefore not fit for purpose. Counts to find the number of customers in The Netherlands will produce poor information leading to poor business decisions. Data which is made consistent can be used, and is therefore fit, for any purpose. If I know all records for entities within The Netherlands use the code NL, then I can print "The Netherlands" onto an envelope being sent from the USA, or "Niederlande" if it's from Germany, and that data need not be stored anywhere - it is derived from consistent data. I reject the definition of data quality being fitness for purpose - fitness for purpose is a consequence of data quality, not a definition of it.

OK, so I've tossed this off on a Friday morning, and may have missed some logical connections. What do you think? Please join the debate - I'll update this entry with any nuggets that are suggested.

Tuesday, July 6, 2010

Prevention or cure?

I was looking through a pile of my old and dusty university essays a few weeks ago, nicely typed on a 40-year old manual typewriter (at that time the university's only computer was in a huge, well guarded room and the only way we were allowed to interact with it was with punch cards ...) and I found an essay with this title:

"Regional Water Authorities in Britain are Dominated by Engineers Trained Largely to Solve Water Supply Problems by Constructing New Facilities Rather Than by Minimising the Need for Them" (D.J. Parker and E.C. Penning-Rowsell). Discuss."

I had discussed, as directed, and agreed: instead of tempering our profligate and ever-increasing use of water, we just kept tapping into new resources to increase supply.

It wasn't just water that had this issue. And little has changed over the past 30 years.

Looking around, you see this pattern almost everywhere. Health services, for example, spend a little on prevention and a huge amount on curing. Police try to solve crimes after they occur but rarely attempt to prevent the crimes from occurring (and in many countries they are not allowed so to do). In fact, our whole society is based on the use/consume/experience now and the resolve/cure/clean up later paradigm.

So it's hardly surprising that businesses work the same way when it comes to data quality. Like the water authorities, they are dominated by people who are trained (and indoctrinated) to resolve problems as they arise rather than to prevent the problems from arising; and when they see the problems they envisage only technical solutions without any consideration for process or business structure changes. The bigger, more expensive and flashier the product, the more likely it is to be bought, regardless of its effectiveness at reducing the problem.

Shifting spending to prevention will reduce spending on the cure, and we'd be healthier for it - stopping us from taking up smoking will always be better than treating us for lung cancer. There'll always be a need to cures - like our bodies, data decays and we have to work on it to keep it healthy - but prevention works better. And is cheaper.