Search This Blog

Friday, November 19, 2010

A Trail of Bread Crumbs...

The WSJ has an article today called "Insurers Test Data Profiles to Identify Risky Clients".

It turns out that "risky behavior" can be inferred from scanning the internet (including social web sites) for "bread crumbs" (or maybe cookie crumbs) you leave about yourself.  What sort of bread crumbs are these you might ask?

For example, hunting permits, boat registrations and property transfers, purchase histories, credit data, forum discussions, blogs, an so forth - all public information or information you allowed someone to collect about you. Remarkably, according to the article, this data can be used to tell things like if you like gourmet food, if you exercise regularly, or sit on the couch, commute long distances, watch too much TV, and many other things.  There is a lengthy discussion of how this works in the article.

You'll probably also be surprised to know that is not a new idea, either.

Twenty years ago companies like auto insurers, according to the link, began to use things like credit score to decide how to price your policy (the higher the credit score the less likely you are to file claim).

Now the really interesting part of all this is that life insurers are thinking about how this can replace things like a "blood test" to determine what kind of "risk" you are in the case of insurance.  I think this replacement of an actual "blood test" with an internet "risk" test is itself risky business on the part of the insurance companies.

So let's talk about risk as it relates to this sort of data mining.  A few days ago I wrote this article about "Cholesterol, Heart Disease and Magical Thinking".  People do not understand how people like epidemiologists calculate risk - and that's a big problem.  You hear about risk all the time: on TV, in ads, from friends, from doctors.  Don't do this or that because its "risky".

First of all, what is risk?  Well, for one thing its not a predictor that something will happen.  A predictor is something that, when we observe it, tells us the a high degree of certainty that some other corresponding event will occur.  For example, a clap of thunder can be predicted from the observation of a lightning bolt.  Risk is also not a cause of something.  Causality is represented by a direct link between two events, i.e., lightning and thunder.  We can say the lightning bolt caused the clap of thunder to occur.

But risk is different. So how is this kind of risk defined?  What does it mean?  

Well, in epidemiology risk factors are calculated as follows:

We take a statistically significant group of people (you can use common sense here - for something like heart disease you wouldn't study just five people - you'd study a large number).  Just how large a number is not really important here, all we need to know is the number is large enough for statistical purposes.

We'll pretend in this post that 100 people are subjects in the study because math with 100 is relatively easy.

So let's say (and we're making this up) that 20 people out of our 100 subjects have had heart attacks.  That's 20 / 100 = .20 = 20%.  So we say that in general you have a 20% risk of heart attack.

Let's also say that 25 people in our example buy pants with a waste size of 40 or above and we'll pretend that 15 people in this "large pant size group" also have had heart attacks.

So the number of people that have a "large pants size" and have had a heart attack is 15, or 15 / 100 or .15 or 15% of the population.

If we divide the 15% (people who purchase "large pants" and have had an heart attack) by the 20% that just have had a heart attack we get .75 or 75% risk factor that if I buy large pants I will have had a heart attack.

But what does this risk factor really mean?

Nothing concrete.  It does not tell anyone what you will do - it just says that when a lot of people get together there is a chance that something will occur.  A risk factor represents this numerical chance that something might happen based on examination of a large group.  (Chance here is a number between zero and one, commonly shown as a percentage, i.e., .1 = 10%.)  Sort of like saying 10% of the people at a baseball game buy hot dogs.  We don't know which people will buy hot dogs but we can generally assume that for any given baseball game about 10% will buy hot dogs - everything else being equal (for example, there are no sales of hamburgers that day).  This is why stadium vendors can buy just about the right amount of food so none is wasted and they don't run out.

So using purchase histories, information about permits, and so on statisticians can develop an entire profile about you that tells them what sort of risk you are relative to whatever insurance policy you are applying for.  But, if you are clever, you will also realize something else. 

Just because you purchase large pants doesn't mean you wear them. 

For example, you may have an elderly relative at home who you care for and you go online to purchase their clothing for them and not yourself.

This is a big difference here between a true epidemiological study and "skimming and mining" data from internet sites and data providers.  In a true epidemiological study we can have definite links between buying and wearing the pants, i.e., we can include in our study the notion of collecting definite data.  Here we don't know who is using the products we are buying.  We're assuming the purchaser is the user - and we all know what assuming does.

So to some degree I see this entire process as "magical thinking" on the part of those insurance companies correlating internet data with personal risk, i.e., the insurance companies.  Correlation means, in this case, that when one thing happens there is an observed relationship with some other thing happening.  A correlation is an observation.

Dogs make correlations: If I walk to the container holding the dog food they think I am going to feed them - so they stick close by me.  The dog mind predicts that I will feed them when I do this.  But walking to the dog food container does not cause me to feed them.  Similarly if I walk by the dog food container all the time and don't feed them the dogs will soon realize that their correlation is not useful and abandon it.

So one problem here is that, using this sort of system, some behavior you have, for example purchasing pants for an elderly person you take care of, may be correlated with you instead of the actual user of the purchase, i.e., the elderly relative.

Another potential problem here is that even though there may be invalid individual correlations a system like this is making about you the overall predictive ability of the model may still work.  For example, caring for an elderly relative might cause you a lot of stress and you have a heart attack because of it.  Effectively this becomes some form discrimination.

Hence the bread crumbs you leave behind may be leading others on a false trail.

No comments:

Post a Comment