Wednesday, February 23, 2011

Data quality and information perception

One of my mantras for ensuring data quality is to look at the data. Not at profiles or analyses or graphics but at the raw data. Browse through it and the best data quality tool that there is - your brain - will highlight data quality issues in seconds.

Information is rooted in, and derived from, data. When information is based upon data which is poor in quality the information will mislead.

This was highlighted by an attempt to bring together opinions about the current tensions in North Africa and the Middle East from blogs, feeds, social media and so on in a graphic form on The Guardian website (here).

We see Muammar Gaddafi there, nice and large, and .... but hang on a second. There's Moammar Gadhafi too. And Muammar Qaddafi. And then just plain old Gaddafi. The same person represented in four different places on the graphic because of transliteration issues. If these 4 entries had been brought together in a single place the graphic would look different.

As I mentioned, this issue is mainly caused by different transliterations of a person's name from Arabic, but this sort of variance within data is very common. Place names, for example, are often found written in many different ways within the same databases. Basing decisions on such variant data would be unwise. Yet decisions are based on data like this, from profiles and graphics like this, every minute of every day.

Go on, grab a coffee and go and look at your data. You'll be amazed.

Thursday, February 10, 2011

I see bubbles

We're humans, by nature short-termist and usually not capable of learning from history.

There was a time when advertising was a blanket phenomenon - you broadcast your message on television, the radio or in print media and hoped that it was seen by as many people as possible. If they were people who had an interest in you message, so much the better.

Then along came direct marketing as a way of showing your message only to those who might be interested. Inevitably, through various shortcomings including poor data quality, it wasn't long before direct marketing got a bad reputation - "junk mail" and so on.

And now we have the Internet. Yet, despite its opportunities, we seem to have regressed back to the blanket coverage approach prominent before direct marketing.

When somebody hits your page they tell you a remarkable amount about themselves: where they are, the language of their browser and operating system, how they reached your page and so on. Unless their system is infected with a virus, you shouldn't know any more amount them as individuals. But what is also known is the contents of the page being viewed and where it is hosted.

How is this information being used by the Internet advertising giants? In a non-scientific but pretty revealing study, I analysed over a period of some weeks those advertisements being shown on the various pages that I visited. What I found was:

- 35.6% of advertisements were in the language of the website and suitable for somebody living in the country where the site was being hosted, but not for me. Adverts for German school reunion sites or the American Girl Guides association are just not correctly targetted.

- 38.8% were in Dutch (the language of the country in which I reside, though not the language of my browser or operating system) and were aimed in the same scatter gun way as broadcast adverts are - adverts for holidays, electricity companies and so on.

- 19.4% were those deliberately misleading or downright criminal adverts asking you to count how many bouncing balls you see, or informing you that you are today's lucky 1 millionth visitor. If I won prizes on each of these clicks I'd have a GDP larger than China's.

- Shockingly, as least to me, only 4.4% were relevant to the contents of the website being viewed - baby clothing adverts on baby name sites, for example. Almost as many were the antithesis of what should have been shown - adverts for Thai brides by post sites on a site listing baby names is putting carts before horses!

- 0%: the percentage of advertisements I was tempted to click on.

And then the language of the adverts:

- 49.6% were in Dutch
- 44% were in the language of the website
- 4.7% were in English (on non-English language sites)
- 1.6% were in the language of the country where the site was hosted.

If nobody clicks on the adverts then no payment need be made, so companies and criminals don't feel bound to think of the consequences of the current scatter gun approach. Short-term thinking. It does, though, have negative consequences. In the hope of getting enough click throughs to get a decent return, many webmasters are placing so much advertising on their sites that it has become difficult to locate the content. And we, the users, are becoming less and less inclined to look at, let alone click, on the adverts.


So much for the web. But how about social communities where people have provided information and where even the simplest of data mining techniques could improve the relevance of advertising? How about that doyenne of the Internet: Facebook?

I am very careful and specific in my use of Facebook, which makes my profile a good test of their targetted advertising. They know my age, gender, the university I went to, that I speak English, Dutch and French (in that order) and that I and 95% of my "friends" play squash, and that almost every communication I have on that site is about squash.

That being the case I would expect Facebook to show me adverts in English, possible for sports equipment or similar. What I actually get:

- 71.4% of adverts in Dutch - they're looking again at my location, not my profile
- 26.8% in English
- 1.8% in German (eh? Did I say I speak German?? I don't.)
- 0% in French

- 0% relevant advertisements
- 100% irrelevant advertisements.

It is no surprise that Facebook ad performance is abysmal.

How many billions was Facebook deemed to be worth?

Are they mad?

I see bubbles.

Friday, February 4, 2011

Localization is evil!

OK, now I've got your attention let me modify that statement: over-localization is undesirable.

Localisation (with the spelling localised to my own culture) is the process of adapting or modifying a product, service or website for a given language, culture or region. However, in almost everybody's mind, localisation is synonymous with translation, and any other modifications, such as making web forms suitable for local address structures, are quickly overlooked. Localisation is intimately bound up with the concept of locales, which give a country/region and language combination which can be used in the localisation process.

The whole problem with this system is that PLACES DO NOT SPEAK LANGUAGES - PEOPLE DO.

There are many indigenous people who do not speak a nationally recognised language in their country, and in our mobile world many of us do not speak the language of the place where we live, or have a preference for another. Yet this fact is overlooked in almost every attempt at localization made.

A case in point. This blog site is owned by Google, and Google win award after award for their localisation, presumably based on a count of the number of languages their interfaces and programs are available in rather than any intelligent application of the idea. Yet Google assume, like most of the rest, that places speak languages rather than people. Though I state clearly in my preferences that I wish to use this site in English, Google sees where my IP address is situated and changes the interface into Dutch. When I go to Prague and attempt to log in I am expected to master Czech. When I reach Athens I will be asked for my details in Greek. is similar. Attempt to get into from outside the USA and you will be taken to a local page (or local to somewhere - I get to the UK page for some unknown reason). This might be regarded as a service by some, but it has consequences. A user in Bulgaria, for example, searching for information about an HP product may click on the link in the search engine referring to the site. The site then switches the user back to a local Bulgarian site, where that information page is not available, and the user is presented with a 404 page not found error.

Look, I am absolutely in favour of translating information and I regard myself as a reasonable polyglot. But a stop must be put to translating on the basis of place. Users must be given the ability to override locale settings so that they can use their own languages and not those attached to a place.