Friday, July 16, 2010

Definition drift

A number of posts recently have drawn my attention again to the persistent problem of definitions and of definition drift. We are rarely able to agree a definition of any word or phrase in the data quality world before a new buzz term comes along to grab our attention. A great deal of this is due to fashion and marketing. Software and solution providers are constantly searching for new terms to launch in order to "persuade" (I'm being nice here) executives that they need to upgrade their current installations. Try as I might, for example, I cannot find a definition for Master Data Management that does not tally with what I understand to be good old plain data management, something we've been doing for years. Mark Goloboy noticed how some people are trying to replace the term "data quality" with "information quality". Though it is often not obvious to information workers, there is a huge difference between information and data (as us data workers know).

Altering terms in this way because of marketing, fashion or internal political needs is pernicious and does little to help data quality or, ultimately, your customers, for whom you should be working to improve your management of their data. The definition saga led me to start to build my own glossary of terms and to place that online for all to enjoy, in the hope that we can arrest some of the worst excesses before they take off. A recent post by Jim Harris exposed again how definitions can affect our working practices, and I felt the need to expand on his post and try to clarify my own thinking on data quality, what it is, what contributes to it, and how it affects information quality.

Data quality tends to have three main definitions: fitness for purpose (we'll come onto that later); data accurately representing the real world entity to which is refers (my own preference); and data being complete, current, consistent and accurate (or relevant, up to date and accurate; or complete, valid, consistent and timely; or accurate, correct, timely, complete, and relevant; or any of a number of similar properties...).

Without wishing to write a three-volume novel about this, let's have a look and see how some of these parameters do affect data quality, starting with continuing the discussion from Jim's post:

Validity versus Accuracy

Validity is that a piece of data satisfies a rule relating to the data itself. Accuracy indicates that the valid data applies to the entity for/about which the data is being gathered and stored. For example, US, CA, DE, and RU are all valid ISO 3166-2 country codes. XX is not. None of these are valid country codes for the country in which I live - in that case, only NL would be an accurate code.

1st January 1833 is a valid (Gregorian) date. It is a valid date for the date of birth of a human being. It is (currently) not a valid date as a date of birth for a human being still alive. 1st January 1961 is a valid data of birth for a human being still alive, but it is not an accurate data of birth if it were to be applied to me.

For these reasons, dashboards and data profiling tools need to be used with caution. They can check that every country code or date within a data file is valid, but they cannot check their accuracy.

Currency versus Timeliness versus Up to Date versus Of Its Time

Yes, even these terms vary in their definitions. For the most part Timeliness is regarded as a processing aspect, where data is made available to the worker at the time it is required - not a data quality issue, in my opinion. Currency is often regarded as synonymous with up-to-date, but data which is up to date (i.e. valid now) is not necessarily fit for a purpose. If your purpose is to know what I bought from your online shop in 2003, you'll need data from that time and of that time, rather than data from this year. This is also a definition of currency, but it does need to be mentioned. For me, this is all part of the accuracy and completeness of data - if I move then the address you may have for me is valid (the building still exists), the data is valid of its time (I did live there when you added it to your database), but I don't live there now, so it's not current and it's not accurate because the house it there but the entity for/about which you're collecting the data (me) isn't there any more.

Consistency and fitness for purpose

Consistency is NOT a pre-condition for data quality. If your database contains the information that I live in The Netherlands in a variety of forms (NL, NLD, The Netherlands, Nederland, Holland, Pays Bas, Niederlande) then the data is accurate, though represented in any number of ways. The data has quality but it is difficult to work with and process, and is therefore not fit for purpose. Counts to find the number of customers in The Netherlands will produce poor information leading to poor business decisions. Data which is made consistent can be used, and is therefore fit, for any purpose. If I know all records for entities within The Netherlands use the code NL, then I can print "The Netherlands" onto an envelope being sent from the USA, or "Niederlande" if it's from Germany, and that data need not be stored anywhere - it is derived from consistent data. I reject the definition of data quality being fitness for purpose - fitness for purpose is a consequence of data quality, not a definition of it.

OK, so I've tossed this off on a Friday morning, and may have missed some logical connections. What do you think? Please join the debate - I'll update this entry with any nuggets that are suggested.

1 comment:

Jim Harris said...

Great post, Graham.

I applaud your ongoing efforts with the DQ Glossary since it is imperative that the key concepts of data quality are clearly defined and in a language that everyone can understand.

And I think that you did a great job defining the terms mentioned in this blog post.

As Socrates said: “The beginning of wisdom is the definition of terms.”

However, as you noted at the beginning of the post, definitions vary for marketing purposes. Vendors are competing fiercely in a lucrative and growing market for data-related software and solutions, and the easiest way to differentiate an offering is to give it a shiny new name while simultaneously imbuing existing terminology with ambiguity and resisting the application of any attempted definition.

I saw this happen at a recent conference in the United States earlier this year. Four leading vendors of a data-related software and solution market were on stage in front of over 200 prospective customers and the moderator tried to get the vendors to discuss, not their offerings, but simply define some of the common terminology. The result was comically sad. None of the vendors would define anything--not because they couldn't--but because they were afraid that if they defined it in a way (although accurate to their offering) that alienated any prospective customers in the audience, they might lose potential business.

Therefore, "definition drift" is here to stay. After all, Socrates didn't say anything about acceptance of one set of definitions over another or what the next step in wisdom was after we are done defining terms--or if we would ever be done defining terms.

Best Regards,