Tuesday, November 27, 2007

Fit for purpose

People rarely build a database system just for fun: they have a goal in mind, usually related to the data for which the database will form the receptacle. They will, therefore, need to consider always their objective(s) to ensure that the system they are building is fit for purpose.

It seems, however, that human nature often makes people concentrate on the process rather than the objective, and in the data world this directly affects data quality. I often come across companies and individuals who, when deciding between software A, which is slower but produces good quality data, and software B, which is faster but produces poor quality data, plump for software B because they have lost sight of the objective and are considering the process.

I often use the example of the Brussels underground system. The trains kept a strict timetable, which meant that each one would only stop at each station for a fixed number of seconds. The number of seconds did not appear to differ according to the volume of passengers at that station. As a result, the doors would close and trains would move off before all the passengers had had a chance to get on and off. Regular passengers knew about this, so when stepping onto the train would then simply stop where they were because they had to be in a position to get off again quickly, blocking the entrance and exit. At busy stations a scrum ensued as everybody tried to get on and off at the same time. Many a morning I had to walk to work from the stop after the one I had wanted because I hadn't managed to fight my way out in time.

Clearly, the transport company was concentrating on its process - sticking to its timetable - rather than its objective of moving passengers to where they needed to be.

I was reminded of this when I recently read of Amsterdam trams playing the same tricks. Some were whizzing past stops because they needed to stick to a timetable. Compounding this was the fact that the city of Amsterdam fine the tram company for each late running tram. Clearly nobody had thought this through properly - the city was basing its assessment of the tram company on the process rather than the objective. The tram company would be better off running all its trams without passengers and therefore avoiding fines, but that's hardly fit for purpose.


(c) 2007 Graham Rhind. Reproduction only allowed with permission. Comment and dialogue welcome.

Monday, November 12, 2007

The Black Swan and Data Quality

First published online 19th October 2007

Nassim Nicholas Taleb, in his book The Black Swan, makes an interesting point which helped me to clarify in my mind why companies are too often willing to spend millions on attempting to cleanse and correct data after it has been collected rather than spending a fraction of that amount in increasing data quality through prevention, for example through validating data on input, a process that will always result in better quality data than any cleansing process employed later in the data process can produce.

What it boils down to is that reaction to problems can be perceived, measured and rewarded. Prevention results only in a lack of problems, which cannot be measured or rewarded. Most employees and managers feel more comfortable with the latter than the former.

To paraphrase Taleb, let us imagine that a legislator before 11th September 2001 introduced a law obliging airlines to fit armoured doors to the cabins of their planes, which were to be kept locked during flights. This would be unpopular with the airlines due to the extra costs involved, for example. Airlines might have to increase fares to cover costs, or reduce staffing levels, and that legislator would be unpopular. As this preventative measure would probably have ensured that the airlines used to attack New York and Washington on 11th September 2001 would not have been highjacked and the attacks would not have taken place, there would be a general perception that his measure was pointless and it would be impossible to measure what its effect had been because what it might be preventing is not taking place to measure.

My own favourite example of this is the millennium computer bug issue. Untold amounts of money were spent to prevent systems failing when the year changed to 2000, and domesday scenarios were described if the changes weren't made. As the clocks stuck midnight on 31st December 1999 no planes fell from skies and no nuclear power plants went into meltdown. Was there cheering and jubilation, back slapping and congratulations for the workers who had prevented this? On the contrary, they were labelled as a bunch of panic merchants who had exaggerated the problem to make a quick buck.

And so it is with data quality in almost all companies. An employee who can analyse badly collected data can show his bosses that there are, for example, errors in 50% of all records, and that after purchasing cleansing programs it could be reduced to 15%, is rewarded for producing measurable and verifiable results.

The better employee, who persuades his bosses to install systems to prevent data pollution earlier in the process can show, for example, errors in only 10% of the records, but as there is no worse start point with which to compare this figure, the employee could either be criticised that there are so many errors, despite the amount spent on prevention; or, more likely, be overlooked because nothing is being perceived as happening in their part of the company to draw managers' attentions.

I have seen this situation time and again in many different contexts. It is a rare manager indeed who can overcome the pressure to show figures in this way to justify their positions, and this reflects on data quality.


(c) 2007 Graham Rhind. Reproduction only allowed with permission. Comment and dialogue welcome.

Road to nowhere

First published online 15th December 2006

I have been reading with interest the common reports of problems that motorists have been having with satellite navigation systems.

Having been able to inspect the data which these systems use, it comes as no surprise to me that there have been problems. Even had I not seen the data, we all know that even the smallest databases contain errors - databases of the size and complexity of those used in GIS systems, which attempt to represent an ever changing world, will inevitably contain many inaccuracies.

Many of the problems reported seem to revolve around motorists adopting a blind trust in their satellite navigation systems - something that almost everybody using data, including those doing it professionally, also tend to do. Show a person a piece of data, such as the string "London", and that person will make immediate assumptions about that data. However, one of my rules for good data management is to take data in context - that string may be a city in one context, but may refer to a name in another, for example.

Furthermore, the databases used by these satellite navigations systems do not contain data required for some drivers for choosing routes - height and width restrictions on roads, for example.

If users of satellite navigation systems were to take the data they are given in context - i.e. using that data along with information provided by road signs, maps and the evidence of their eyes, we would find fewer of them wedged in bends, unable to get up steep hills, and having to cross fields to get to roads they thought they were on already.

Other data users and managers would be well advised to consider if this metaphor applies to them ...


(c) 2006 Graham Rhind. Reproduction only allowed with permission. Comment and dialogue welcome.

The politics of personal names

First published online 19th May 2006

The case of Ayaan Hirsi Ali, or Ayaan Hirsi Magan, raises interesting points about ignorance about, and the politics of, personal names. Ayaan Hirsi Ali was an outspoken member of the Dutch parliament. Originating in Somalia, she was naturalised as a Dutch citizen. Though it had been known for a long while that she has been born Ayaan Hirsi Magan, a sudden move by one of her colleagues resulted in the removal of her Dutch passport because she had lied about certain things on her application, including about her name.

Leaving aside the creeping structuralisation of racism within Dutch politics, this brings out some interesting points about personal names. Ayaan Hirsi Ali's name is Islamic. Moslim names have a greater flexibility that Western names, and they can change and flux as a person changes their personal situation - if they change jobs, for example, or have children. This is clearly not reflected in the decision taken against Ayaan Hirsi Ali.

Her case has also highlighted a number of others. In one case, for example, a Kurdish Iraqi by birth had his Dutch passport removed because he gave his Kurdish name whereas the Kurdophobe regime of Saddam (or Saddam Hussein, or Saddam Hussein al Tikriti or ..... - another flexible Moslim name) has registered the man with an Arabised first name.

Is this ignorance? Time to understand how personal names are created differently in different cultures.


(c) 2006 Graham Rhind. Reproduction only allowed with permission. Comment and dialogue welcome.

Reducing the need for scrap and rework with web data collection

First published online 8th February 2006

When collecting data on the web, companies must allow diverse visitors to record their information in a way that is familiar and comfortable to them.

Though the Internet was once heralded as a solution for enabling cheap and effective data collection, experience has shown that this data is often too polluted to be useful in any business intelligence sense. This is not due to the medium, but to the poor understanding that most companies have of how to achieve quality data collection on the web without expensive scrap and rework.

The path normally followed by companies when choosing to collect data on the web is defined by the company decision-making structure and general ignorance of global diversity; and it dictates that scrap and rework will be a necessity. The path normally looks like this:

When a decision is made to collect customer information on the web, the short-term view of how to achieve this is usually chosen. A web data collection form can be up and running within a few hours. It does not usually require any special budgetary measures and answers the pressures from other company departments to get the data as quickly and cheaply as possible. Normal company structures militate against budget being made available to research and implement good data collection practices at the start of the process.

Little or no thought is given to this data collection page. The employees concerned stick with what they are familiar. They use the same fields, the same field labels and the same screen layout that they know from their own country. It is overlooked that a web page can be viewed from any place in the world, and that people from outside the company’s home country are likely to want to enter their details too. Similarly, it is overlooked that these visitors have personal details that do not coincide to the local norms. Ambiguous or country or language biased field labels will mean different things to different site visitors, causing them to provide different information based upon their interpretation. For example, a field entitled "Title"; may be filled in by one visitor with a form of address, by another with a job title and by yet a third with an academic title. In other countries, not only do people's name and address details consist of different components, they are also written in different ways. Their information may be too long to fit into the given fields; and required fields, for state or postal code, for example, will require them to enter nonsense information if their addresses do not contain such details. They may have more information than they can fit comfortably in the company’s web form. Because of this, they are required to shoehorn their data into the available space.

Data is collected, but it arrives in the database confused, concatenated, abbreviated, mis-fielded and completely useless. As the quantity of the data increases, it becomes clear that a major cleansing program is necessary to effectively use the data. Expensive software is acquired. This costs much more than it would have to create a good data entry system at the beginning. A better data structure must be identified, though this would have been better tackled before any data gathering began. In fact, the best way of getting top quality information from a customer is interacting with him or her at the time of data collection. This is true regardless of how expensive the software is, how many hours of labour are put into the process and how many processes are run.

The data is processed and a certain percentage is improved, but the stream of poor data from the data collection point continues. Thus, the data is assessed, scrapped and reworked as a continual process.

Companies do not often reach the end point of this path. Data remains of poor quality, with the resultant business process failures when the data or information from that data is used. Although these results are clearly not effective, this cycle has been followed by almost all companies. Not only does this result in bad data and its consequences (like poor customer image), but also an image and morale problem within the company. The data is not regarded as accurate, and is therefore underutilized. Budget is difficult to pin down to correct the problem because people are not confident about the outcome, and expensive processes do not show enough improvement to increase confidence. As people consider what has been spent already, they are reluctant to spend more.

The budgeting for web data collection should be moved to the beginning of the process, which is the design and execution of data collection processes and applications. With a small amount of investment and research, higher quality data can be collected from website visitors. Data collection pages, which dynamically alter form structure, order and language to the country and language of the visitor, allow visitors to record their information in a way that is familiar and comfortable to them. Field labels and lengths can be adjusted; and validation, both full postal and individual component validation, such as postal code length, can be implemented to reduce data pollution as much as possible. This is the only way that data can be collected on the web accurately enough to be fully used for business intelligence, without expensive scrap and rework.


(c) 2006 Graham Rhind. Reproduction only allowed with permission. Comment and dialogue welcome.

Integrated systems hide data

First published online 29th March 2005

As data sets get larger and software systems more complex and powerful, data owners are being forced further and further away from their data.

Whether you hold data as an information resource, for marketing, CRM (Customer Relationship Marketing), database marketing, direct marketing or any other purposes, the success of your projects and your ability to comply with the Sarbanes-Oxley corporate governance act depend on the quality of your data.

Integrated data systems and software layers on top of your data often hide both the data and that data's quality problems, and often do not resolve the data quality problems. Technical aspects of large data projects, such as the use of SQL, conspire to isolate data from the user. Very few people now have the luxury of being able to browse through their data to allow their brains, the best data quality tool that exists, to work on that data.

This is bad news for data quality. My own analyses of international address data, even after cleaning, shows that sometimes more than 50% of data can be incorrect. If a company is analysing its sales, for example, on that basis, they could be making disastous business decisions.

Data and information quality cannot be assured simply by adding new system or software layers. The distance between the data owner and their data needs to be reduced again. A few moments of their time with their data can save enormous amounts of time and money in data quality requirements, and show immediately the data quality issues extant within that data.

(c) 2005 Graham Rhind. Reproduction only allowed with permission. Comment and dialogue welcome.


No shoehorns - asking the right questions

First published online 12th November 2004

I recently gave a short seminar in London about global name and address management. Much of the time was taken up in emphasizing how culturally different names are addresses are around the world, and how systems must be adapted to match these differences, rather than trying to match real-world information to an inadequate system.

At the end, one of the questions was: "OK, I accept everything you say, but, if I'm stuck with a system which has been designed to take only American addresses, how can I shoehorn international addresses into that format?". The questioner was using Siebel. SAP users at the seminar all nodded in understanding. They all felt that they had to try to change the world because their systems demanded it.

The answer to the question, by the way, is that you can't. You can try, but you will damage your data, pollute your database and face huge bills to clean data up after the event. What is mysterious to me is why the users, large corporations who can vote with their wallets, don't turn this problem around and confront Siebel, SAP and cohorts, and demand that, given what they are charging for their systems, they make them more valid for storing global data. I think a few choice words and hinted threats to move to other systems from a small number of large users would make the suppliers move fast to do this. I haven't heard this being done, and I think it's time somebody started the ball rolling.

The secret is to ask the right question.

I recently read an article about parsing international personal names. The question asked by the author was "How would you split these names in order to put the proper elements in the surname field and given name field?". He then listed 6 personal names with different cultural origins. What I noted immediately was that one name was Islamic, where family names are not used; one name contained a generational name; one a patronym; one a preposition and so on. The idea that these names could have been shoehorned into an Anglo-Saxon given name/family name structure is not acceptable. In my eyes, the question being asked was wrong. Why try to force information into a system which is blatantly not suited to it? My question would have been: "How can I adjust my database structure to hold these personal name forms?"

Let's call time on showhorning and start asking the right questions!

(c) 2004 Graham Rhind. Reproduction only allowed with permission. Comment and dialogue welcome.


Postal authorities - who are their customers?

First posted online 31st May 2004

The days of thinking of only the person who pays you for your service as your customer should be over. Following this theory, the supermarket's customer is the consumer, because he/she buys a product. The producer's customer is the supermarket, who buys the product from them. It is clear, however, that producers regard consumers as much as their customers as the supermarket. Whilst my local supermarket treats consumers like irritations (a supermarket chain heading for its biggest loss ever - maybe somebody should point out a potential link in this), the producers do their utmost to attract the consumer through advertising, packaging, shelf placement and so on. There are no chains of customers, there are networks of them.

In October 2003 I attended the European Mail Users' Forum in Brussels. Speakers came mainly either from large mailers or postal authorities. Of great interest to me was the way that both parties viewed their industry. There were some interesting examples of tunnel vision concerning particularly their customer chain. With a single honourable exception, postal authority speakers, for example, when speaking of their customers, were always referring to the mailers. In one way this is understandable - the mailers, after all, are the ones who pay for the delivery. This view, however, has no place in an industry where mail volumes are declining and the need for creative thinking is paramount. Only one speaker, that I heard, was forward thinking enough to consider that in postal matters there are two customers - for the postal authorities the customer is the one who sends the mail. For the mailer, the customer is the one who receives the mail. This being the case, the postal authorities need to take as much care in their attitude to the mailers' customer as to their own - without them, they have no business.

It is clear to me every day that the mail recipient is the poor man in the whole process. Though I was born and brought up in Britain, I live in Amsterdam. I still read a British newspaper, and follow the stories about the Royal Mail with interest. Though the British amongst you may not believe it, the service that the loss-making Royal Mail provides to the mail recipient is vastly superior to that provided by some postal companies which actually make a profit. When I lived in England I received mail before I left the house in the mornings. Here I cannot expect any mail before 11 am (2 pm on a Monday). If I want to get my mail before 9 am, I need to rent a postbox (kerching! $$$ Extra income for TPG). If collecting the mail from the postbox itself is a nuisance (and the number of post offices here is declining, just as it is in the UK), then I can pay TPG to deliver my mail from my postbox to my door. Kerching! $$$ TPG get money from the mailer and TWICE from the recipient to provide a service which those of us used to something better would regard as normal. Any wonder that they make a profit? The situation was not much different in Belgium when I lived there, except that you could not expect the mail before 2 pm any day.

As a mail receiver, I would much prefer the service offered by Royal Mail than the one I get from TPG.

When the post does arrive, it often arrives at the wrong place. In the Netherlands several different habitations often share the same house number. In the case of my house number it is five habitations - four flats and a houseboat. In each case, the delivery point is shown in the address by a suffix to the house number (or, in the case of the house boat opposite my front door, a prefix). Postmen and women often have a cavalier attitude to this information and throw the whole bundle in the nearest letterbox. If you stand at the end of the street and watch as the postman or women makes his/her rounds, you see behind them a sort of Benny Hill sketch developing - people coming out of their houses, milling around delivering the mail to the correct location. I may have to wait several days to get my mail, if my neighbour is not at home.

Apart from adopting a far more relevant attitude to the mail recipients, mail authorities would benefit immensely from improving their communication skills. This is almost universally applicable to postal authorities. I have spent a great deal of the last 14 years collecting information about postal systems and how they work - address formats, postal codes systems and so on. Many of my customers are very surprised that only a tiny amount of this information actually originated from postal authorities. There is a good reason for this. Postal authorities have been universally appalling at providing even basic information for their customers. Of all the letters, faxes and e-mails that I have sent in those 14 years requesting some basic information, I have received a reply to a single one (thumbs up Hong Kong Post, though I bruised myself when I fell off my chair in surprise).

I know that some of my colleagues are better at getting replies than I am, but it would be a shame if users had to learn a specific technique to get replies to their communications. Even the Internet sites of postal authorities show how communication needs to be given a much higher priority in their business plans. In order to use a postal system effectively, users need information. They need to know how to address a letter, whether a postal code is required, what the postal code is and so on; and this information needs to be provided in a language that they understand. If more users follow the rules laid down by the postal authorities for their addressing, logically it will cost the authorities less to process the mail piece and will lead to better use of their automation investment and greater profit (or less loss). However, dozens of postal authorities have no website. Many of those which do have sites fail to provide information which is essential, such as postal code information. Still others fail to understand that their postal system will be used by cross-border mailers as well as domestic mailers, and post information only in their local language. It is very interesting (and quite depressing) in many cases to compare the local language site of a postal authority with the English language version, and to see how much the postal authority don't think you need to know if you don't speak that local language. It's hardly surprising that companies are forced to come to me or one of my industry colleagues when they need this information.

Of course, there are exceptions, but any move towards a better understanding of the needs of the mail recipient and a greater communication towards all users, both mailers and recipients, would be both welcome and profitable.

(c) 2004 Graham Rhind. Reproduction only allowed with permission. Comment and dialogue welcome.