GRC Database Information

Friday, January 20, 2012

Step One: Acquiring the Knowledge

My guest blog post at PostcodeAnywhere, about the first step in improving international data quality, has been posted at http://blog.postcodeanywhere.co.uk/index.php/step-one-acquiring-the-knowledge/

Thursday, December 22, 2011

Effin' obscenity screening!

You can read my new guest post for PostcodeAnywhere, about obscenity screening in an international environment, here: http://blog.postcodeanywhere.co.uk/index.php/effin-obscenity-screening/

Wednesday, October 5, 2011

I often warn people about the speed of change. Not just the speed of change of data within your database, but the speed of change of that data's main context - the world in which we live. People know that change happens but tend to underestimate how fast and how far reaching many of these changes are.

The world changes so fast that I have to upload a new version of the Global Sourcebook for Name and Address Data Management almost weekly.

To put some numbers to that change I looked at the past nine years and summarised the speed that change is occurring. On average, there are changes in this many countries every year in these areas:

Telephone number systems: changes in 60 countries per annum
Administrative districts (top level): 14.4 countries per annum
Currencies: 4.7 countries per annum
Postal code and addressing systems: 4.3 countries per annum
Country names: 1.4 countries per annum
New countries: 0.8 per annum

This is a surprisingly fast rate of change. Are you keeping up with our dynamic world?

Tuesday, September 20, 2011

The Kaleidoscope of Data and Information

John Owens , whom I much admire, made this comment to a Henrik Liliendahl Sørensen blog post about the difference between data quality and information quality. Taken completely out of context, he said:

“…it is the quality of the information (i.e. data in a context) that is the real issue, rather than the items of data themselves. The data items might well be correct, but in the wrong context, thus negating the overall quality of the information, which is what the enterprises uses. It will be interesting to see how long it is before data quality industry arrives at this conclusion. But, if they ever do, who will be courageous enough to say so?”

I agree entirely, yet disagree profoundly. Data and information are not the same thing yet are inextricably linked – one without the other isn’t possible but they still must not be confused. Data and information are as different as chickens and eggs, but are equally dependent upon each other.

Basically, data is stored information whilst information is perceived data.

Data and information are immutably linked – I have never found data which isn’t stored information nor information that isn’t rooted in data – but as they are different parts of a cycle they need to be defined, understood and managed as two separate entities. The challenge with data is keeping it complete, accurate and consistent. The challenge with information is to perceive the information that the data is a stored version of without alteration and in a way that gives clarity. It is at the information stage that we should be thinking about fitness for purpose, not at the data stage.

Let me give you an example from a recent episode of the BBC’s science program Bang Goes the Theory. A presenter went to a shopping centre and prepared two plates of bacon sandwiches. One was accompanied with the message that regularly eating processed meats increases the chances of getting bowel cancer by 20%. The other was accompanied by the message that regularly eating processed meats increases the chances of getting bowel cancer from 5% to 6%. Though the data underlying both pieces of information is identical, as is the information provided, the audience were understandably worried when seeing the first message but happy to tuck in after seeing the second. The first message would be fit for the purposes of the health authority, the second for the bacon marketing board, but in neither case is the fitness for purposes related to the data - it is related to the information provision.

It is at the points where information becomes data and data becomes information that the potential for corruption and misunderstanding of the data and its perception are at their highest. We also know that once data is stored, inert though it may appear to be, it cannot be ignored as the real world entities to which the data refers may change, and that change needs to be processed to update the data.

Those of a certain age may remember having kaleidoscopes as children. Tubes of tin or cardboard with a clear bottom in which there were chips of coloured glass or plastic, a section of which could be viewed and with mirrors creating a symmetrical pattern from that section. Move the kaleidoscope and patterns form, patterns which change and are always different, though the coloured chips themselves never change their inherent properties when being viewed. Whether your data is a shopping list or a data warehouse containing hundreds of tables and millions of record, working with data and information is much like looking through the kaleidoscope.

Depending on how we view we tend to see something different every time we look. Reports, dashboards, views, queries, forms, software, hardware, your cultural background and the way your brain is wired will all alter the perception of the data for us and thus have an enormous influence on the information we’re receiving from the data.

Like a kaleidoscope we tend to extrapolate what we see to the whole universe. If a report shows a positive result in one part of the operation, the tendency is to assume this result is valid throughout. In these examples square green chips represent accurate data whilst red or other shapes is errant data.

Both human nature and data and information systems tend to filter out the negative and boost the positive, so often data looks better than it really is, and so then is the information derived from it.

But sometimes the data looks entirely bad, though it is not so. The way we look dictates the apparent quality of the data.

Yet data has tangible and innate qualities, its accuracy, completeness and consistency, which together are an indication of its quality. And any data which has these qualities provides a foundation for better information quality because the perversion caused by the view of the data is ameliorated. In these examples the data has been made consistent and accurate – the coloured chips have the same colour and shape.

And regardless how we view that data, we see green square data. Data quality and information quality are different and yet rooted in each other. Data cannot be good if it represents the information that it is the stored version of incorrectly. Information cannot be good if it is based on incomplete, inaccurate and inconsistent data.

Data quality ensures that the data represents its real world information entity accurately, completely and consistently. Information quality is working to ensure that the context in which the data is presented provides a realistic picture of the original information that the data is representing.

There’s a general feeling that only data which has a purpose should be stored. I would not agree as purpose, as with so much, depends on context and our viewpoint. Data which has no purpose now may be required to fulfil an information requirement in the future or be related to occurrences in the past; whilst for the people who are being paid to manage data, whether it is used or not, finds his or her salary is very meaningful!

Ultimately data is used to source information, and information quality is important. But we should not confuse the differences between data quality and information quality. Both are essential, and they are separate disciplines.

Thursday, July 28, 2011

Have you checked your country drop down recently?

When visiting a data quality software supplier’s site recently to download a white paper, I noticed that the country list on the sign up form didn’t contain South Sudan (which became a new country on 9th July 2011) or the new territories which came into existence when the Netherlands Antilles were dissolved on 10th October 2010 (Bonaire, Curaçao, Saba, Sint Eustatius and Sint Maarten).

I shot off a tweet to the company concerned and they told me they were using the United States’ Department of State list. That list has added South Sudan but, shockingly, at the date of writing this blog, has failed to make the changes required by the dissolution of the Netherlands Antilles.

As I mentioned in my post here most companies rely on external sources for their country names and code lists, such as the World Bank and the United Nations, both of which use lists which exclude most of the world's territories) ; or the ISO (International Organization for Standardization), which still has not added South Sudan to its list.

Relying on other organisations for your country lists is problematic. To start with, unless you are aware of global changes (and too many people aren’t), or you check the list every day, you won’t notice changes as they happen, as with the data quality company I mentioned above. Secondly, maintenance of country code lists is not the core business of the United Nations, the World Bank and so on. They maintain a list in order to facilitate their own business – and that is rarely likely to coincide with your business. Many of these organisations are very heavily politically dependent or influenced, such as the United Nations or ISO (which doesn’t include Kosovo in its listing, for example), whereas you are likely to need to manage the reality of the situation on the ground, with less emphasis on political niceties. Finally these lists are often updated only long after a country has come into being – it can take ISO many months to assign a country code – whereas you will ideally need to be ready to make changes to your data before the country comes into existence.

When you’re managing international data your country code is likely to be linked to other data, such as currency, international telephone number access code, address format or postal code structure, which is not taken account of in country name lists being maintained purely for political purposes. Using lists which exclude Kosovo, for example, which is a de facto entity and has language, currency and addressing differences with Serbia, will cause problems for your data quality.

Maintenance of country lists and codes needs to be given more thought and more attention. If you’re not in a position to manage your own lists, take a look at the one we offer: http://www.grcdi.nl/countrycodes.htm . It may not suit your needs, but it is one of the few lists created without a political agenda, which is updated ahead of requirement, and with name and address data management specifically in mind. Using a correct and up to date country lists will improve your data quality and can save you from considerable embarrassment.

Tuesday, March 29, 2011

Data silos - learn to live with them.

Mention data silos (separate data, databases or sets of data files that are not part of an organization's enterprise-wide data administration) and the blood pressure of any self-respecting data quality grafter will rise. Data silos are unconnected to data elsewhere in an organisation and thus are unable to share common types of information, which may create inconsistencies across multiple enterprise applications and may lead to poor business decisions.

One thing, though, that we need to think about: in almost all cases data is in an organisation to mitigate the work that that organisation does. The organisation is not there to improve data quality. My proviso is that every organisation has an obligation to the data owners (usually its customers) to manage that data correctly. Thus any improvement in data quality must serve the organisation and its customers, not the data in and of itself.

Data silos are inevitable, and that will never change. Organisation-wide data management systems are overcomplex and put too great a distance between the staff and the data they need. An employee's main focus is (or should be) to do the work that they are being paid to do as well as they can. If an organisation cannot provide ease of access to the tools that they need (hardly possible unless every member of staff has an accompanying IT employee at their beck and call) then they will reach for the tools that they can use - pen and paper, local spreadsheets, documents and database tables. Few employees (or work groups or departments) are going to wait around and do nothing while an attempt is made to hook their requirements into the cumbersome company database systems.

So whilst a CEO depending on a central database system to provide his or her business intelligence may suffer from data silos (only may, because it depends what is being stored in those silos), the troops doing the real work find having the data they need at their fingertips a distinct advantage.

I've been on both sides on the silos debate. I dug my heals in hard when attempts were made to integrate my data island into the company database because I knew it would not fulfil requirements, it would not enable me to do my job, it would be a waste of money, and I would not use it. I was right on all counts. At the same time I was trying to get the sales staff to give up their data for integration into a central marketing database. I failed for the same reasons (and because the staff were not going to give up the data which gave them the edge on their colleagues, improved their commission and enabled them to spend two weeks every year in the Caribbean as a reward from the company - incentives can be very bad for data quality).

Naturally data silos in certain parts of a company are very damaging, without dispute. You can't have a call centre employee jotting a caller's details onto a piece of paper or into an Excel spreadsheet instead of into their centralised system, for example. But not all data silos are evil.

As Jim Harris points out here, it's not so much having data silos that is the problem: it's what is done with that data. A much more open and accepting attitude to data silos is required, because we may wail and gnash our teeth as much as we like - they're not going to go away.

Sunday, March 20, 2011

Are technology advances damaging your data quality?

Way back on a Friday evening in 1989 I took home a computer, a software package called Foxpro and a manual. By Monday morning I had written a package to manage the employees and productivity of the telephone interviewing department I was running at the time.

I've been using Foxpro ever since. It's fast, powerful, easy to use and it has unparalleled string manipulation capabilities. Unfortunately, Microsoft have stopped developing the product, and will stop supporting it in a few years time, so I recently started looking for a good replacement.

At first I thought I wasn't getting it. I checked site after site, product after product, expert after expert, and instead of finding products which were easier to use: more accessible, more flexible, more agile, more data-centric, I found products which were technically challenging: over-complex, cumbersome, which put wall after wall between me and my data, which required reams of coding to do the simplest action like add a field, remove a field, expand a field, and so on. And most of which were priced to suit the budgets of companies with turnovers matching the GDPs of a large African country. Yes, there are some applications (very few) that try to make process easier, but they are primitive and clunky.

Most are based on SQL (that's a query language, ladies and gentlemen - the clue's in the name - and really very difficult to use to do any type of string processing) and based on client-server setups that required a high level of technical knowledge. You can get a record and display a record and store the modifications (what SQL was designed to do), but if you want to do more than that it gets tough.

I tried to work with some of them and just couldn't get my head around them. Never mind a whole application in a weekend - creating a database in a weekend was a challenge. My synapses may not be working at the speeds that they did in 1989, but I'm not in my dotage just yet.

Many support packages - data profilers and so on - have been created to work solely with these high-end packages, even those free and open source variants, cutting out a huge chunk of the data market.

I wasn't getting it. And then I realized I was getting it. This is the state of play in the data industry at the moment. A chasm has grown between easy to use, cheap but less scaleable products (Microsoft Access, Visual Foxpro, FileMaker and so on) and those scaleable but complex (and far too expensive) client-server applications.

So how does this work in practice? Joe from sales wants to create a database of his contacts. He can't manage the complexity of the SQL system in which the company is working so he'd have to send e-mails, set up meetings to request this database, get the IT department to put it into their planning, wait six months and watch his sales plummet. Or he could open Access or Excel and start typing. Guess which is the option most people take?

These systems encourage the creation of data silos (more about those in my next post).

Data quality is adversely affected by these databases also because they put a distance between the user and their data. Being unable to actually browse through the raw data is a handicap in data quality terms. Data which is filtered through systems, even to the extent of any SQL query other than SELECT *, will be less useful and have less quality.

The data software industry needs to take a close look at what they're up to and ask themselves of they should really be producing data products which can only be used and understood by highly trained technical staff, because they're not giving the data quality industry an iota of help. They should open up their programs to those common file formats that they're currently ignoring, such as .dbf. They need to be made easier, more flexible, more agile and a good deal cheaper

As for me, I'm back with Foxpro, and I have decided to stop apologising for using it. It allows me to produce top quality data, and that's what it's all about.