GRC Database Information: 2011

Thursday, December 22, 2011

Effin' obscenity screening!

You can read my new guest post for PostcodeAnywhere, about obscenity screening in an international environment, here: http://blog.postcodeanywhere.co.uk/index.php/effin-obscenity-screening/

Wednesday, October 5, 2011

I often warn people about the speed of change. Not just the speed of change of data within your database, but the speed of change of that data's main context - the world in which we live. People know that change happens but tend to underestimate how fast and how far reaching many of these changes are.

The world changes so fast that I have to upload a new version of the Global Sourcebook for Name and Address Data Management almost weekly.

To put some numbers to that change I looked at the past nine years and summarised the speed that change is occurring. On average, there are changes in this many countries every year in these areas:

Telephone number systems: changes in 60 countries per annum
Administrative districts (top level): 14.4 countries per annum
Currencies: 4.7 countries per annum
Postal code and addressing systems: 4.3 countries per annum
Country names: 1.4 countries per annum
New countries: 0.8 per annum

This is a surprisingly fast rate of change. Are you keeping up with our dynamic world?

Tuesday, September 20, 2011

The Kaleidoscope of Data and Information

John Owens , whom I much admire, made this comment to a Henrik Liliendahl Sørensen blog post about the difference between data quality and information quality. Taken completely out of context, he said:

“…it is the quality of the information (i.e. data in a context) that is the real issue, rather than the items of data themselves. The data items might well be correct, but in the wrong context, thus negating the overall quality of the information, which is what the enterprises uses. It will be interesting to see how long it is before data quality industry arrives at this conclusion. But, if they ever do, who will be courageous enough to say so?”

I agree entirely, yet disagree profoundly. Data and information are not the same thing yet are inextricably linked – one without the other isn’t possible but they still must not be confused. Data and information are as different as chickens and eggs, but are equally dependent upon each other.

Basically, data is stored information whilst information is perceived data.

Data and information are immutably linked – I have never found data which isn’t stored information nor information that isn’t rooted in data – but as they are different parts of a cycle they need to be defined, understood and managed as two separate entities. The challenge with data is keeping it complete, accurate and consistent. The challenge with information is to perceive the information that the data is a stored version of without alteration and in a way that gives clarity. It is at the information stage that we should be thinking about fitness for purpose, not at the data stage.

Let me give you an example from a recent episode of the BBC’s science program Bang Goes the Theory. A presenter went to a shopping centre and prepared two plates of bacon sandwiches. One was accompanied with the message that regularly eating processed meats increases the chances of getting bowel cancer by 20%. The other was accompanied by the message that regularly eating processed meats increases the chances of getting bowel cancer from 5% to 6%. Though the data underlying both pieces of information is identical, as is the information provided, the audience were understandably worried when seeing the first message but happy to tuck in after seeing the second. The first message would be fit for the purposes of the health authority, the second for the bacon marketing board, but in neither case is the fitness for purposes related to the data - it is related to the information provision.

It is at the points where information becomes data and data becomes information that the potential for corruption and misunderstanding of the data and its perception are at their highest. We also know that once data is stored, inert though it may appear to be, it cannot be ignored as the real world entities to which the data refers may change, and that change needs to be processed to update the data.

Those of a certain age may remember having kaleidoscopes as children. Tubes of tin or cardboard with a clear bottom in which there were chips of coloured glass or plastic, a section of which could be viewed and with mirrors creating a symmetrical pattern from that section. Move the kaleidoscope and patterns form, patterns which change and are always different, though the coloured chips themselves never change their inherent properties when being viewed. Whether your data is a shopping list or a data warehouse containing hundreds of tables and millions of record, working with data and information is much like looking through the kaleidoscope.

Depending on how we view we tend to see something different every time we look. Reports, dashboards, views, queries, forms, software, hardware, your cultural background and the way your brain is wired will all alter the perception of the data for us and thus have an enormous influence on the information we’re receiving from the data.

Like a kaleidoscope we tend to extrapolate what we see to the whole universe. If a report shows a positive result in one part of the operation, the tendency is to assume this result is valid throughout. In these examples square green chips represent accurate data whilst red or other shapes is errant data.

Both human nature and data and information systems tend to filter out the negative and boost the positive, so often data looks better than it really is, and so then is the information derived from it.

But sometimes the data looks entirely bad, though it is not so. The way we look dictates the apparent quality of the data.

Yet data has tangible and innate qualities, its accuracy, completeness and consistency, which together are an indication of its quality. And any data which has these qualities provides a foundation for better information quality because the perversion caused by the view of the data is ameliorated. In these examples the data has been made consistent and accurate – the coloured chips have the same colour and shape.

And regardless how we view that data, we see green square data. Data quality and information quality are different and yet rooted in each other. Data cannot be good if it represents the information that it is the stored version of incorrectly. Information cannot be good if it is based on incomplete, inaccurate and inconsistent data.

Data quality ensures that the data represents its real world information entity accurately, completely and consistently. Information quality is working to ensure that the context in which the data is presented provides a realistic picture of the original information that the data is representing.

There’s a general feeling that only data which has a purpose should be stored. I would not agree as purpose, as with so much, depends on context and our viewpoint. Data which has no purpose now may be required to fulfil an information requirement in the future or be related to occurrences in the past; whilst for the people who are being paid to manage data, whether it is used or not, finds his or her salary is very meaningful!

Ultimately data is used to source information, and information quality is important. But we should not confuse the differences between data quality and information quality. Both are essential, and they are separate disciplines.

Thursday, July 28, 2011

Have you checked your country drop down recently?

When visiting a data quality software supplier’s site recently to download a white paper, I noticed that the country list on the sign up form didn’t contain South Sudan (which became a new country on 9th July 2011) or the new territories which came into existence when the Netherlands Antilles were dissolved on 10th October 2010 (Bonaire, Curaçao, Saba, Sint Eustatius and Sint Maarten).

I shot off a tweet to the company concerned and they told me they were using the United States’ Department of State list. That list has added South Sudan but, shockingly, at the date of writing this blog, has failed to make the changes required by the dissolution of the Netherlands Antilles.

As I mentioned in my post here most companies rely on external sources for their country names and code lists, such as the World Bank and the United Nations, both of which use lists which exclude most of the world's territories) ; or the ISO (International Organization for Standardization), which still has not added South Sudan to its list.

Relying on other organisations for your country lists is problematic. To start with, unless you are aware of global changes (and too many people aren’t), or you check the list every day, you won’t notice changes as they happen, as with the data quality company I mentioned above. Secondly, maintenance of country code lists is not the core business of the United Nations, the World Bank and so on. They maintain a list in order to facilitate their own business – and that is rarely likely to coincide with your business. Many of these organisations are very heavily politically dependent or influenced, such as the United Nations or ISO (which doesn’t include Kosovo in its listing, for example), whereas you are likely to need to manage the reality of the situation on the ground, with less emphasis on political niceties. Finally these lists are often updated only long after a country has come into being – it can take ISO many months to assign a country code – whereas you will ideally need to be ready to make changes to your data before the country comes into existence.

When you’re managing international data your country code is likely to be linked to other data, such as currency, international telephone number access code, address format or postal code structure, which is not taken account of in country name lists being maintained purely for political purposes. Using lists which exclude Kosovo, for example, which is a de facto entity and has language, currency and addressing differences with Serbia, will cause problems for your data quality.

Maintenance of country lists and codes needs to be given more thought and more attention. If you’re not in a position to manage your own lists, take a look at the one we offer: http://www.grcdi.nl/countrycodes.htm . It may not suit your needs, but it is one of the few lists created without a political agenda, which is updated ahead of requirement, and with name and address data management specifically in mind. Using a correct and up to date country lists will improve your data quality and can save you from considerable embarrassment.

Tuesday, March 29, 2011

Data silos - learn to live with them.

Mention data silos (separate data, databases or sets of data files that are not part of an organization's enterprise-wide data administration) and the blood pressure of any self-respecting data quality grafter will rise. Data silos are unconnected to data elsewhere in an organisation and thus are unable to share common types of information, which may create inconsistencies across multiple enterprise applications and may lead to poor business decisions.

One thing, though, that we need to think about: in almost all cases data is in an organisation to mitigate the work that that organisation does. The organisation is not there to improve data quality. My proviso is that every organisation has an obligation to the data owners (usually its customers) to manage that data correctly. Thus any improvement in data quality must serve the organisation and its customers, not the data in and of itself.

Data silos are inevitable, and that will never change. Organisation-wide data management systems are overcomplex and put too great a distance between the staff and the data they need. An employee's main focus is (or should be) to do the work that they are being paid to do as well as they can. If an organisation cannot provide ease of access to the tools that they need (hardly possible unless every member of staff has an accompanying IT employee at their beck and call) then they will reach for the tools that they can use - pen and paper, local spreadsheets, documents and database tables. Few employees (or work groups or departments) are going to wait around and do nothing while an attempt is made to hook their requirements into the cumbersome company database systems.

So whilst a CEO depending on a central database system to provide his or her business intelligence may suffer from data silos (only may, because it depends what is being stored in those silos), the troops doing the real work find having the data they need at their fingertips a distinct advantage.

I've been on both sides on the silos debate. I dug my heals in hard when attempts were made to integrate my data island into the company database because I knew it would not fulfil requirements, it would not enable me to do my job, it would be a waste of money, and I would not use it. I was right on all counts. At the same time I was trying to get the sales staff to give up their data for integration into a central marketing database. I failed for the same reasons (and because the staff were not going to give up the data which gave them the edge on their colleagues, improved their commission and enabled them to spend two weeks every year in the Caribbean as a reward from the company - incentives can be very bad for data quality).

Naturally data silos in certain parts of a company are very damaging, without dispute. You can't have a call centre employee jotting a caller's details onto a piece of paper or into an Excel spreadsheet instead of into their centralised system, for example. But not all data silos are evil.

As Jim Harris points out here, it's not so much having data silos that is the problem: it's what is done with that data. A much more open and accepting attitude to data silos is required, because we may wail and gnash our teeth as much as we like - they're not going to go away.

Sunday, March 20, 2011

Are technology advances damaging your data quality?

Way back on a Friday evening in 1989 I took home a computer, a software package called Foxpro and a manual. By Monday morning I had written a package to manage the employees and productivity of the telephone interviewing department I was running at the time.

I've been using Foxpro ever since. It's fast, powerful, easy to use and it has unparalleled string manipulation capabilities. Unfortunately, Microsoft have stopped developing the product, and will stop supporting it in a few years time, so I recently started looking for a good replacement.

At first I thought I wasn't getting it. I checked site after site, product after product, expert after expert, and instead of finding products which were easier to use: more accessible, more flexible, more agile, more data-centric, I found products which were technically challenging: over-complex, cumbersome, which put wall after wall between me and my data, which required reams of coding to do the simplest action like add a field, remove a field, expand a field, and so on. And most of which were priced to suit the budgets of companies with turnovers matching the GDPs of a large African country. Yes, there are some applications (very few) that try to make process easier, but they are primitive and clunky.

Most are based on SQL (that's a query language, ladies and gentlemen - the clue's in the name - and really very difficult to use to do any type of string processing) and based on client-server setups that required a high level of technical knowledge. You can get a record and display a record and store the modifications (what SQL was designed to do), but if you want to do more than that it gets tough.

I tried to work with some of them and just couldn't get my head around them. Never mind a whole application in a weekend - creating a database in a weekend was a challenge. My synapses may not be working at the speeds that they did in 1989, but I'm not in my dotage just yet.

Many support packages - data profilers and so on - have been created to work solely with these high-end packages, even those free and open source variants, cutting out a huge chunk of the data market.

I wasn't getting it. And then I realized I was getting it. This is the state of play in the data industry at the moment. A chasm has grown between easy to use, cheap but less scaleable products (Microsoft Access, Visual Foxpro, FileMaker and so on) and those scaleable but complex (and far too expensive) client-server applications.

So how does this work in practice? Joe from sales wants to create a database of his contacts. He can't manage the complexity of the SQL system in which the company is working so he'd have to send e-mails, set up meetings to request this database, get the IT department to put it into their planning, wait six months and watch his sales plummet. Or he could open Access or Excel and start typing. Guess which is the option most people take?

These systems encourage the creation of data silos (more about those in my next post).

Data quality is adversely affected by these databases also because they put a distance between the user and their data. Being unable to actually browse through the raw data is a handicap in data quality terms. Data which is filtered through systems, even to the extent of any SQL query other than SELECT *, will be less useful and have less quality.

The data software industry needs to take a close look at what they're up to and ask themselves of they should really be producing data products which can only be used and understood by highly trained technical staff, because they're not giving the data quality industry an iota of help. They should open up their programs to those common file formats that they're currently ignoring, such as .dbf. They need to be made easier, more flexible, more agile and a good deal cheaper

As for me, I'm back with Foxpro, and I have decided to stop apologising for using it. It allows me to produce top quality data, and that's what it's all about.

Wednesday, February 23, 2011

Data quality and information perception

One of my mantras for ensuring data quality is to look at the data. Not at profiles or analyses or graphics but at the raw data. Browse through it and the best data quality tool that there is - your brain - will highlight data quality issues in seconds.

Information is rooted in, and derived from, data. When information is based upon data which is poor in quality the information will mislead.

This was highlighted by an attempt to bring together opinions about the current tensions in North Africa and the Middle East from blogs, feeds, social media and so on in a graphic form on The Guardian website (here).

We see Muammar Gaddafi there, nice and large, and .... but hang on a second. There's Moammar Gadhafi too. And Muammar Qaddafi. And then just plain old Gaddafi. The same person represented in four different places on the graphic because of transliteration issues. If these 4 entries had been brought together in a single place the graphic would look different.

As I mentioned, this issue is mainly caused by different transliterations of a person's name from Arabic, but this sort of variance within data is very common. Place names, for example, are often found written in many different ways within the same databases. Basing decisions on such variant data would be unwise. Yet decisions are based on data like this, from profiles and graphics like this, every minute of every day.

Go on, grab a coffee and go and look at your data. You'll be amazed.

Thursday, February 10, 2011

I see bubbles

We're humans, by nature short-termist and usually not capable of learning from history.

There was a time when advertising was a blanket phenomenon - you broadcast your message on television, the radio or in print media and hoped that it was seen by as many people as possible. If they were people who had an interest in you message, so much the better.

Then along came direct marketing as a way of showing your message only to those who might be interested. Inevitably, through various shortcomings including poor data quality, it wasn't long before direct marketing got a bad reputation - "junk mail" and so on.

And now we have the Internet. Yet, despite its opportunities, we seem to have regressed back to the blanket coverage approach prominent before direct marketing.

When somebody hits your page they tell you a remarkable amount about themselves: where they are, the language of their browser and operating system, how they reached your page and so on. Unless their system is infected with a virus, you shouldn't know any more amount them as individuals. But what is also known is the contents of the page being viewed and where it is hosted.

How is this information being used by the Internet advertising giants? In a non-scientific but pretty revealing study, I analysed over a period of some weeks those advertisements being shown on the various pages that I visited. What I found was:

- 35.6% of advertisements were in the language of the website and suitable for somebody living in the country where the site was being hosted, but not for me. Adverts for German school reunion sites or the American Girl Guides association are just not correctly targetted.

- 38.8% were in Dutch (the language of the country in which I reside, though not the language of my browser or operating system) and were aimed in the same scatter gun way as broadcast adverts are - adverts for holidays, electricity companies and so on.

- 19.4% were those deliberately misleading or downright criminal adverts asking you to count how many bouncing balls you see, or informing you that you are today's lucky 1 millionth visitor. If I won prizes on each of these clicks I'd have a GDP larger than China's.

- Shockingly, as least to me, only 4.4% were relevant to the contents of the website being viewed - baby clothing adverts on baby name sites, for example. Almost as many were the antithesis of what should have been shown - adverts for Thai brides by post sites on a site listing baby names is putting carts before horses!

- 0%: the percentage of advertisements I was tempted to click on.

And then the language of the adverts:

- 49.6% were in Dutch
- 44% were in the language of the website
- 4.7% were in English (on non-English language sites)
- 1.6% were in the language of the country where the site was hosted.

If nobody clicks on the adverts then no payment need be made, so companies and criminals don't feel bound to think of the consequences of the current scatter gun approach. Short-term thinking. It does, though, have negative consequences. In the hope of getting enough click throughs to get a decent return, many webmasters are placing so much advertising on their sites that it has become difficult to locate the content. And we, the users, are becoming less and less inclined to look at, let alone click, on the adverts.

Facebook

So much for the web. But how about social communities where people have provided information and where even the simplest of data mining techniques could improve the relevance of advertising? How about that doyenne of the Internet: Facebook?

I am very careful and specific in my use of Facebook, which makes my profile a good test of their targetted advertising. They know my age, gender, the university I went to, that I speak English, Dutch and French (in that order) and that I and 95% of my "friends" play squash, and that almost every communication I have on that site is about squash.

That being the case I would expect Facebook to show me adverts in English, possible for sports equipment or similar. What I actually get:

- 71.4% of adverts in Dutch - they're looking again at my location, not my profile
- 26.8% in English
- 1.8% in German (eh? Did I say I speak German?? I don't.)
- 0% in French

- 0% relevant advertisements
- 100% irrelevant advertisements.

It is no surprise that Facebook ad performance is abysmal.

How many billions was Facebook deemed to be worth?

Are they mad?

I see bubbles.

Friday, February 4, 2011

Localization is evil!

OK, now I've got your attention let me modify that statement: over-localization is undesirable.

Localisation (with the spelling localised to my own culture) is the process of adapting or modifying a product, service or website for a given language, culture or region. However, in almost everybody's mind, localisation is synonymous with translation, and any other modifications, such as making web forms suitable for local address structures, are quickly overlooked. Localisation is intimately bound up with the concept of locales, which give a country/region and language combination which can be used in the localisation process.

The whole problem with this system is that PLACES DO NOT SPEAK LANGUAGES - PEOPLE DO.

There are many indigenous people who do not speak a nationally recognised language in their country, and in our mobile world many of us do not speak the language of the place where we live, or have a preference for another. Yet this fact is overlooked in almost every attempt at localization made.

A case in point. This blog site is owned by Google, and Google win award after award for their localisation, presumably based on a count of the number of languages their interfaces and programs are available in rather than any intelligent application of the idea. Yet Google assume, like most of the rest, that places speak languages rather than people. Though I state clearly in my preferences that I wish to use this site in English, Google sees where my IP address is situated and changes the interface into Dutch. When I go to Prague and attempt to log in I am expected to master Czech. When I reach Athens I will be asked for my details in Greek.

HP.com is similar. Attempt to get into hp.com from outside the USA and you will be taken to a local page (or local to somewhere - I get to the UK page for some unknown reason). This might be regarded as a service by some, but it has consequences. A user in Bulgaria, for example, searching for information about an HP product may click on the link in the search engine referring to the hp.com site. The site then switches the user back to a local Bulgarian site, where that information page is not available, and the user is presented with a 404 page not found error.

Look, I am absolutely in favour of translating information and I regard myself as a reasonable polyglot. But a stop must be put to translating on the basis of place. Users must be given the ability to override locale settings so that they can use their own languages and not those attached to a place.

Wednesday, January 26, 2011

IAIDQ Blog carnival for Information/Data Quality, December 2010

It's my pleasure and privilege to again be hosting the IAIDQ blog carnival, this time for posts made in December 2010. I'm glad to say that the festivities didn't reduce the flow of top quality posts, some of which I have highlighted here.

If you work in business and you're attempting to get a data quality initiative off the ground, William Sharp, over at the Data Quality Chronicle, has outlined a data quality program's 10 best practices. A useful checklist whatever your data quality background.

Jim Harris' steady stream of wisdom over at the Obsessive-Compulsive Data Quality Blog hardly let up in December, with a look back at the best data quality posts of 2010 (which I have to mention, because Jim kindly included not one but two of mine in the list!).

I've been thinking (probably rather more than is healthy) quite a lot recently about data silos, and Jim casts his usual sensible eye over just that subject when discussing The Good Data.

I don't know where Jim finds the time to read, but he's read "A Confederacy of Dunces" by John Kennedy Toole and he discusses our propensity to delete data in A Confederacy of Data Defects.

And finally, whilst Jim was looking back over 2011, Steve Sarsfield, in his data governance and data quality insider blog, was looking forward to 2011. The brave Steve has made six data management predictions for 2011. I'm wondering if that's a misprint - won't be be waiting until 2051 for most of those things to happen? ;-) What do you think?

Thursday, January 13, 2011

The myth of deliverability.

I am often asked whether an address is "deliverable", and not everybody is happy with my usual response that it depends on the mood of the postal staff on duty that day.

The point is that "deliverability" is unmeasurable, unscientific and has little basis in reality. Address validation software will often give a figure for the number of deliverable addresses within a file, such as 80%, but don't be fooled - this does not mean that, if you send a letter to each address within that file, 80% will get there and 20% will not. These numbers have been created to give some sort of feedback to the user, and to make files comparable with each other when run through the same software processes.

So how come there's no way to measure deliverability?

You could look at a country like The Netherlands, with its neat address system, and boldly state that a deliverable address is one where the postal code and the building number are present. Mind you, if either contain a typo, the mail may be deliverable, but not to the desired recipient. Equally, whilst TNT Post is happy with that much information, because it will get them to a letter box, to make a mailpiece deliverable to a person, more information is required - a sub-building indicator (as many addresses may share the same house number/postal code), a name, department and so on.

So, I've got all that. So the address is deliverable. Right?

A mailpiece containing my correct postal code and house number took 6 months to reach me. Not because the information was wrong, but because a second, stray, postal code had wormed its way into the address block, sending the mailpiece around the system ad infinitum. It was only when a particularly awake postal worker saw and crossed through the stray code that the mailpiece could get sent on its (correct) way.

But not having a full address, or even any address, does not make a mailpiece undeliverable! I remember the TV program That's Life! on the BBC successfully receiving mail sent with just a drawing of a prominent set of teeth on the front - an allusion to Esther Rantzen's somewhat toothy smile.

This Christmas a card was sent with this address on it:

Mr & Mrs T Burlingham?
Near the golf course in Thetford,
Norfolk.
Trevor is a photographer. This might help

It did help. This undeliverable address took just 2 days to arrive. But this isn't just a British thing. How about one of my favourites:

Translation:

Vukasin 6 years and Jelisaveta 3 years
I don't know the family name
PRUSKA GORA
SERBIA
Their father is big, he drives a Citroën Belingo.
He works on the little trains for tourists.
Postman: Please find them!

And he did!

Tuesday, January 4, 2011

Praise for a form

You know that I have a thing about forms, and I spend a lot of time criticising them. But I did always said that if I found an example of a good form, I'd sing its praises on high.

So, this is me singing.

The form is not an international form - it's for the Dutch market and is in Dutch - but is shows some good practices that I'd like to highlight. The form is from ShopPartners .

Dutch addresses are of the few in the world where a whole address can be derived from very little information - in our case a postal code and a house number - and the form starts by asking for that information to save the customer from typing their whole address - and the less you make your customer type, the happier they are. They show very clearly the number of steps you'll have to go through to order, and the section to the right gives clear examples of what they want should there be any doubt in the mind of the customer.

The second step expands the form with the address autocompleted. Nothing earth shattering there.

The form asks whether the customer is a private individual or a business - a great way of preventing customers from having to fill in or skip over fields which are not relevant to them, such as VAT numbers.

But what did make me smile - no, it made me get up and jig around my office - was that this is the first Dutch form I have come across which does not ask your form of address (Mr or Mrs - no other choices are given) and use it to assume your gender. Somebody has thought about this, and apart from Mr and Mrs the customer can also specify a department name or the name of a family. Still no chance for me to add Dr, Professor, Lord, Sir, Dipl. Ing., Mag. or any other one of the hundreds of forms of address that exist; but I do get the change to choose NONE (my own preference) - an escape route for customers not covered by the options provided - and this is rarely allowed on forms.

And again, descriptive examples of what is required are provided to the right of the fields.

Now, plenty of usability experts will pull me up and mention element placement, colours and numerous other "problems" with this form, but for me it's a huge step forward! Well done ShopPartners!

Want to know more about collecting good data from your web form? Download my free e-book here. And no form to fill in either!

GRC Database Information