Tuesday, February 7, 2012

Blind Angel Egg The Dog

A little aside on the topic of linguistics - sort of.  I could think up some parable linking this to data quality, but I'll leave that to you.

Languages vary a lot.  In my mother tongue, English, we separate words with spaces.  In my second language, Dutch, words are grouped together into long strings.  These strings sometimes need a little time to decipher.

On a metro station a few days ago a poster caught my eye, especially the word BLINDENGELEIDEHOND.  I didn't immediately recognise it, so I automatically started splitting up the string in my head.

BLIND|ENGEL|EI|DE|HOND

Blind Angel Egg The Dog.  Sounds great, but it doesn't make a lot of sense. Except that the post has a Labrador puppy on it, so maybe the dog part is close.



Let's try again.

BLIND|EN|GELEI|DE|HOND

Blind and Jelly The Dog.  No, that doesn't make sense either.

BLINDEN|GELEI|DE|HOND

Blinds Jelly The Dog. No, not getting any warmer.

BLIND|EN|GE|LEI|DE|HOND

Blind And You Slate The Dog.  With a Flemish accent. No no no, unless somebody was on drugs when they made the poster.

Oh, hang on ....

BLINDEN|GELEIDEHOND

Guide Dog For The Blind!

It's not just me.  I know quite a number of people who see

BOMMELDING

and read BOMMEL|DING (something that putters along, like an old diesel locomotive) instead of BOM|MELDING (bomb alert).

Well, it kept me amused until the train arrived!

Friday, January 20, 2012

Step One: Acquiring the Knowledge

My guest blog post at PostcodeAnywhere, about the first step in improving international data quality, has been posted at http://blog.postcodeanywhere.co.uk/index.php/step-one-acquiring-the-knowledge/

Thursday, December 22, 2011

Effin' obscenity screening!

You can read my new guest post for PostcodeAnywhere, about obscenity screening in an international environment, here: http://blog.postcodeanywhere.co.uk/index.php/effin-obscenity-screening/ 

Wednesday, October 5, 2011

The Speed of Change

I often warn people about the speed of change.  Not just the speed of change of data within your database, but the speed of change of that data's main context - the world in which we live.  People know that change happens but tend to underestimate how fast and how far reaching many of these changes are.

The world changes so fast that I have to upload a new version of the Global Sourcebook for Name and Address Data Management almost weekly.

To put some numbers to that change I looked at the past nine years and summarised the speed that change is occurring.  On average, there are changes in this many countries every year in these areas:
  • Telephone number systems: changes in 60 countries per annum
  • Administrative districts (top level): 14.4 countries per annum
  • Currencies: 4.7 countries per annum
  • Postal code and addressing systems: 4.3 countries per annum
  • Country names: 1.4 countries per annum
  • New countries: 0.8 per annum
This is a surprisingly fast rate of change.  Are you keeping up with our dynamic world?

Tuesday, September 20, 2011

The Kaleidoscope of Data and Information

John Owens , whom I much admire, made this comment to a Henrik Liliendahl Sørensen blog post about the difference between data quality and information quality. Taken completely out of context, he said:

“…it is the quality of the information (i.e. data in a context) that is the real issue, rather than the items of data themselves. The data items might well be correct, but in the wrong context, thus negating the overall quality of the information, which is what the enterprises uses. It will be interesting to see how long it is before data quality industry arrives at this conclusion. But, if they ever do, who will be courageous enough to say so?”

I agree entirely, yet disagree profoundly. Data and information are not the same thing yet are inextricably linked – one without the other isn’t possible but they still must not be confused. Data and information are as different as chickens and eggs, but are equally dependent upon each other.

Basically, data is stored information whilst information is perceived data.
Data and information are immutably linked – I have never found data which isn’t stored information nor information that isn’t rooted in data – but as they are different parts of a cycle they need to be defined, understood and managed as two separate entities. The challenge with data is keeping it complete, accurate and consistent. The challenge with information is to perceive the information that the data is a stored version of without alteration and in a way that gives clarity. It is at the information stage that we should be thinking about fitness for purpose, not at the data stage.

Let me give you an example from a recent episode of the BBC’s science program Bang Goes the Theory. A presenter went to a shopping centre and prepared two plates of bacon sandwiches. One was accompanied with the message that regularly eating processed meats increases the chances of getting bowel cancer by 20%. The other was accompanied by the message that regularly eating processed meats increases the chances of getting bowel cancer from 5% to 6%. Though the data underlying both pieces of information is identical, as is the information provided, the audience were understandably worried when seeing the first message but happy to tuck in after seeing the second.  The first message would be fit for the purposes of the health authority, the second for the bacon marketing board, but in neither case is the fitness for purposes related to the data - it is related to the information provision.

It is at the points where information becomes data and data becomes information that the potential for corruption and misunderstanding of the data and its perception are at their highest. We also know that once data is stored, inert though it may appear to be, it cannot be ignored as the real world entities to which the data refers may change, and that change needs to be processed to update the data.
Those of a certain age may remember having kaleidoscopes as children. Tubes of tin or cardboard with a clear bottom in which there were chips of coloured glass or plastic, a section of which could be viewed and with mirrors creating a symmetrical pattern from that section. Move the kaleidoscope and patterns form, patterns which change and are always different, though the coloured chips themselves never change their inherent properties when being viewed. Whether your data is a shopping list or a data warehouse containing hundreds of tables and millions of record, working with data and information is much like looking through the kaleidoscope.
Depending on how we view we tend to see something different every time we look. Reports, dashboards, views, queries, forms, software, hardware, your cultural background and the way your brain is wired will all alter the perception of the data for us and thus have an enormous influence on the information we’re receiving from the data.
Like a kaleidoscope we tend to extrapolate what we see to the whole universe. If a report shows a positive result in one part of the operation, the tendency is to assume this result is valid throughout. In these examples square green chips represent accurate data whilst red or other shapes is errant data.
Both human nature and data and information systems tend to filter out the negative and boost the positive, so often data looks better than it really is, and so then is the information derived from it.
But sometimes the data looks entirely bad, though it is not so. The way we look dictates the apparent quality of the data.
Yet data has tangible and innate qualities, its accuracy, completeness and consistency, which together are an indication of its quality. And any data which has these qualities provides a foundation for better information quality because the perversion caused by the view of the data is ameliorated. In these examples the data has been made consistent and accurate – the coloured chips have the same colour and shape.
And regardless how we view that data, we see green square data. Data quality and information quality are different and yet rooted in each other. Data cannot be good if it represents the information that it is the stored version of incorrectly. Information cannot be good if it is based on incomplete, inaccurate and inconsistent data.

Data quality ensures that the data represents its real world information entity accurately, completely and consistently. Information quality is working to ensure that the context in which the data is presented provides a realistic picture of the original information that the data is representing.

There’s a general feeling that only data which has a purpose should be stored. I would not agree as purpose, as with so much, depends on context and our viewpoint. Data which has no purpose now may be required to fulfil an information requirement in the future or be related to occurrences in the past; whilst for the people who are being paid to manage data, whether it is used or not, finds his or her salary is very meaningful!

Ultimately data is used to source information, and information quality is important. But we should not confuse the differences between data quality and information quality. Both are essential, and they are separate disciplines.

Thursday, July 28, 2011

Have you checked your country drop down recently?

When visiting a data quality software supplier’s site recently to download a white paper, I noticed that the country list on the sign up form didn’t contain South Sudan (which became a new country on 9th July 2011) or the new territories which came into existence when the Netherlands Antilles were dissolved on 10th October 2010 (Bonaire, Curaçao, Saba, Sint Eustatius and Sint Maarten).

I shot off a tweet to the company concerned and they told me they were using the United States’ Department of State list. That list has added South Sudan but, shockingly, at the date of writing this blog, has failed to make the changes required by the dissolution of the Netherlands Antilles.

As I mentioned in my post here most companies rely on external sources for their country names and code lists, such as the World Bank and the United Nations, both of which use lists which exclude most of the world's territories) ; or the ISO (International Organization for Standardization), which still has not added South Sudan to its list.

Relying on other organisations for your country lists is problematic. To start with, unless you are aware of global changes (and too many people aren’t), or you check the list every day, you won’t notice changes as they happen, as with the data quality company I mentioned above. Secondly, maintenance of country code lists is not the core business of the United Nations, the World Bank and so on. They maintain a list in order to facilitate their own business – and that is rarely likely to coincide with your business. Many of these organisations are very heavily politically dependent or influenced, such as the United Nations or ISO (which doesn’t include Kosovo in its listing, for example), whereas you are likely to need to manage the reality of the situation on the ground, with less emphasis on political niceties. Finally these lists are often updated only long after a country has come into being – it can take ISO many months to assign a country code – whereas you will ideally need to be ready to make changes to your data before the country comes into existence.

When you’re managing international data your country code is likely to be linked to other data, such as currency, international telephone number access code, address format or postal code structure, which is not taken account of in country name lists being maintained purely for political purposes. Using lists which exclude Kosovo, for example, which is a de facto entity and has language, currency and addressing differences with Serbia, will cause problems for your data quality.

Maintenance of country lists and codes needs to be given more thought and more attention. If you’re not in a position to manage your own lists, take a look at the one we offer: http://www.grcdi.nl/countrycodes.htm . It may not suit your needs, but it is one of the few lists created without a political agenda, which is updated ahead of requirement, and with name and address data management specifically in mind. Using a correct and up to date country lists will improve your data quality and can save you from considerable embarrassment.

Tuesday, March 29, 2011

Data silos - learn to live with them.



Mention data silos (separate data, databases or sets of data files that are not part of an organization's enterprise-wide data administration) and the blood pressure of any self-respecting data quality grafter will rise. Data silos are unconnected to data elsewhere in an organisation and thus are unable to share common types of information, which may create inconsistencies across multiple enterprise applications and may lead to poor business decisions.

One thing, though, that we need to think about: in almost all cases data is in an organisation to mitigate the work that that organisation does. The organisation is not there to improve data quality. My proviso is that every organisation has an obligation to the data owners (usually its customers) to manage that data correctly. Thus any improvement in data quality must serve the organisation and its customers, not the data in and of itself.

Data silos are inevitable, and that will never change. Organisation-wide data management systems are overcomplex and put too great a distance between the staff and the data they need. An employee's main focus is (or should be) to do the work that they are being paid to do as well as they can. If an organisation cannot provide ease of access to the tools that they need (hardly possible unless every member of staff has an accompanying IT employee at their beck and call) then they will reach for the tools that they can use - pen and paper, local spreadsheets, documents and database tables. Few employees (or work groups or departments) are going to wait around and do nothing while an attempt is made to hook their requirements into the cumbersome company database systems.

So whilst a CEO depending on a central database system to provide his or her business intelligence may suffer from data silos (only may, because it depends what is being stored in those silos), the troops doing the real work find having the data they need at their fingertips a distinct advantage.

I've been on both sides on the silos debate. I dug my heals in hard when attempts were made to integrate my data island into the company database because I knew it would not fulfil requirements, it would not enable me to do my job, it would be a waste of money, and I would not use it. I was right on all counts. At the same time I was trying to get the sales staff to give up their data for integration into a central marketing database. I failed for the same reasons (and because the staff were not going to give up the data which gave them the edge on their colleagues, improved their commission and enabled them to spend two weeks every year in the Caribbean as a reward from the company - incentives can be very bad for data quality).

Naturally data silos in certain parts of a company are very damaging, without dispute. You can't have a call centre employee jotting a caller's details onto a piece of paper or into an Excel spreadsheet instead of into their centralised system, for example. But not all data silos are evil.

As Jim Harris points out here, it's not so much having data silos that is the problem: it's what is done with that data. A much more open and accepting attitude to data silos is required, because we may wail and gnash our teeth as much as we like - they're not going to go away.