Sunday, May 31, 2009

What about the rest of the data?

I'm currently reading a most worthy book about data quality, and, like most other books about data quality I've come across, its gaze is fixed completely on data held in large corporate entities. Large companies where data is amassed across myriad systems, current and legacy; where there is a separate IT department; where there are enough staff to create a data quality working group; where money is no object when it comes to tools to assess, process and cleanse that data.

But is that really where most of the world's data is held? Obviously it's where we most come in contact with it - when a large utility messes up its invoicing procedure, we know about it very quickly - but I would guess that more data is held in small spreadsheets and databases and documents in small- and medium-sized companies than is held in large corporations. I'm obviously not typical, but my own databases hold around 40 million records. In these small companies, there is unlikely to be distinct marketing or IT departments, no budget for data tools, not enough staff to create DQ teams.

Has anybody ever estimated how much data is held outside large corporations, or written about how they will go about improving their data quality? It must be the sun, as this is playing on my mind ...

Thursday, May 21, 2009

How Microsoft Access can damage your data quality

OK, so I've bitten the bullet. Much against my better judgement I'm making a concerted effort to learn Microsoft Access 2007, to allow me to produce some data tables with Unicode characters.

Plugging away, I found an interesting aspect of field input masks which is guaranteed to produce data quality issues. When adding a field mask to Access 2007, it "helpfully" provides a number of ready-made options:



Having a UK edition of the software, Microsoft have helpfully provided a telephone number mask and a postal code mask that it thinks covers the UK. Looking at the postal code mask itself:



and we see how it is made up. '0' indicates a required digit, 'L' a required letter, and '>' forces upper case.

Now, in the UK, AA99 9AA is indeed a valid postal code format. It is, however, only one of seven valid formats:

A9 9AA
A99 9AA
A9A 9AA
AA9 9AA
AA99 9AA
AA9A 9AA
AAA 9AA

Thus, whilst this mask can take OX19 6RY, it can't take OX9 6RY or SW1A 4WW or S1 1AA or N45 1AP or .... well, most of them.

Microsoft may feel that they are being helpful by adding this sample mask, but we all know that programmers, like most of us, will take any route that make their life easy, and are unlikely to make any attempt to alter this input mask to make it valid for the UK, let alone valid for postal codes in every other country. And we know that this happens - my God, don't we all suffer regularly at the hands of forms designed like this? Many programmers would not even be aware that the mask is not valid for most UK postal codes - they trust that the software provider has done their homework.

Back to the drawing board, please, Microsoft. This is not helping in the fight for better data quality.

Friday, May 15, 2009

Eurotunnelvision

A slight move off topic, if I may. It's that time of the year again when the Eurovision Song Contest takes place, accompanied, as always, by the sounds of weeping, wailing and gnashing of teeth from the losers, who always try to find anything to blame apart from their own performances or the quality of their songs.

One of the most common complaints is the idea of block-voting, accompanied by a rather embarrassing lack of geographical knowledge. This year's Dutch song is doubtless very suitable for, and popular in, the local pubs of The Jordaan in Amsterdam, but is hardly material that will be popular elsewhere in Europe. Inevitably it failed to get through the semi-final. René Froger, one of the Dutch singers, immediately complained that this was due to block voting: "it's so strange that all the Balkan countries are through to the final", he wailed. And worst, no report I've seen pulled him up on it, seeding and strengthening the myth that everybody hates us/loves them.

Grabbing a pencil and a scrap of people, it took me no time at all to work out that six Balkan countries indeed made it to the final. And five did not. Not quite "all", and certainly no evidence of the great conspiracy so many profess to see. He'd have done better to point to the Nordic block, or the Caucasus block, or the Mediterranean block. If anybody should complain, it's the Central Europeans, none of whom got through.

Back to school, René!

For a map of those through, check Wikipedia here

Thursday, May 14, 2009

Flying share


Another great example of how ignorance affects data quality from web forms (and can lose customers and money!) has come my way, this time courtesy of the site Flying Share. Flying Share are offering their users in the US, UK, Canada and Australia, via this form, a free USB drive.

All well and good until you get to the "ZIP" field (ZIP? Not outside the USA, good people of Flying Share - you probably mean "Postal code"). And there you find that the field will accept five characters and no more. So anybody in the UK or Canada wanting a free drive must either give up at this point, or provide a truncated code.

As drives are being sent out postally, there will clearly be a huge problem of undeliverable and returned drives. I know that this error has been pointed out to the company concerned, but no action has yet been taken, though correcting it would take seconds.

I do wonder at what point anybody at Flying Share will feel the need to act on this. Perhaps when the costs of returns reaches astronomical proportions? How many (potential) customers might they have lost before then?

Tuesday, May 5, 2009

I made a blog entry in March about Vietnam, where a postal code exists but without the populace knowing about it.

Two more examples caught my eye this week, in both cases where a country announced the introduction of a postal code system: Nigeria and Dominican Republic.

In both countries a postal code exists - in Nigeria's case since 2000 - and in both cases this information has been placed at one time or another on the national postal authority's website. It is very interesting that in all of the comments placed in reaction to these news items up to this point, only one persons suggested that they thought that a postal code system was already in place.

Regardless of the reasons why the existing postal code systems in either country had not been fully publicised before now, this is another indicator of how careful form designers need to be in their use of required fields. Nigeria may have a postal code system, but if nobody knows about it or uses it, requiring that field on a form will only lead to customer loss and data quality reduction.

For more information about web forms for an international audience, download the free e-book for here.