The plot right shows a line that roughly matches the points which surround it. The outliers (in this case a radio comedy group) fall off to the right.
So in a mathematical sense we say that given the data, i.e., the points, there is some function f(x) that draws the line we see (where x is a set of points perhaps or perhaps not including the outliers).
Now typically the line follows the "core" data and the outliers are off to one side or the other.
The question, I think, is this - which is more important? The "core" data which is described by f(x) or the outliers?
Now of course to some degree it depends on your interest - if I care about the quality of f(x) then I may only be concerned that it matches the "core" data (in this case the points running diagonally).
But what about the outliers? Maybe I don't care or maybe they are critical.
Now statistics as a field is concerned with the significance of the data, e.g., are the outliers significant mathematically. This is fine so long as you understand that significant outliers may mean that your theory or hypothesis or your mathematics are wrong or have significant errors.
But what I see today, particularly in many areas of science, is rather than think about the reason behind the outliers the thinking is to follow along with the "core data." As an example I wrote about Danial Everett and the Piraha. Chomsky's theory's are, if you will, the main "points" of data. Everett has perhaps found an outlier that is significant.
The question is why does science seem to always rally around the core data?
(A good example, beside other things I write about here, are the Nobel prize winners Dr Marshall and Dr Warren who discovered stomach ulcers were caused by H. pylori. They had to go the extreme of infecting themselves and curing themselves before other doctors would believe them.)
So I have spent a number of years thinking about outliers and why I think when you study something they are perhaps the most important thing.
The reason is is that they represent, essentially, what your f(x) does not represent, i.e., you missing knowledge.
As far as I can tell there isn't a science of "missing knowledge" - yet there should be. Statistics tells us about the significance of the outlying data in a mathematical sense. But what about the meaning of the outliers. (This is the "meaningfulness" I wrote about in part I.)
So let's imagine we have a game of checkers that occurs on a table.
Above the game we have a very large piece of cardboard through which a square the size of one square on the underlying game board is visible. So all we can see is what is in that one square at any point in time.
Now the world champion at checkers is a computer because the game is fully vetted mathematically, i.e., the best you can do with the computer is fixed by mathematical rules, e.g., who starts first, you must never make a mistake, etc.
So we watch a game of checkers through the single hole.
What do we see: either an empty square, a red or black piece, or a red or black king.
As time (t) advances we see changes: t = 1 - empty, t = 2 - red, t = 3 - empty, ...
Now I ask is this enough information to deduce the game of checkers?
I think the answer here is no. A single square's visibility is not enough.
And obviously a cardboard with a cutout that shows the whole board provides 100% understanding of the moves and games and would allow us to deduce the rules of the game from the observations.
But what about cutout's that show less than 100% - is there a point at which we can deduce the entire game?
I think, for example, that a centered cutout that's smaller than the board by exactly one square all the way around is sufficient - provided we know the underlying board is bounded.
But what if we don't know the board is bounded or if the board is a closed loop in that it wraps around on top and bottom (so you could move off the right and appear on the left, or off the top and appear on the bottom)?
In that case I don't think we could differentiate.
And what about the other cases?
Say four squares worth of holes? Or four randomly placed holes?
Since checkers is a closed, bounded mathematical system what can we tell by such observations?
What this says is that we can take all the information that we know about the system and we can construct a decision tree representing the observations. At each node in the tree we can say what are the inputs, what is the state, and what are the possible new states, e.g., a Turing machine.
If all the leaves of the tree are deterministic, i.e., they can only take on a value from a certain set even if we do not know which value from the set that is, then we know any outliers must work toward filling in the gaps, e.g., eliminating elements from the set of things we are not sure about. In the sense of programming we have the logic but are missing the data.
On the other hand, if the tree is not deterministic, i.e., we get to a point where we don't have a set of possible values known or unknown, then our knowledge about the system can never be complete with the data we have. Again, in the sense of programming we are missing source files containing the instructions for the computer (and perhaps data as well).
Now in this later case what can we really say about our system?
My concern about modern science is that there is a tendency to assume that, even in the later case, the non-deterministic tree is assumed to be deterministic for the sake of having something to publishing.
Now why do I think this is so important?
For one thing imagine all the software and computers in the world - billions and billions of computers (cell phones, servers, iClouds, S3, and so on).
Yet there is no "science of debugging" (which is what I am describing above).
There is no theory of debugging, and, as far as I know, not even a hypothesis about what one might be.
This is like much of medical science or climate science.
So if we have no theory of debugging software (which I argue is equivalent to making observations about the operation of Turing machines with the intent on filing in missing information about their data or programming) how can we do it about open systems?
Now human's are pretty good at debugging and fixing things - even complex software or hardware (mechanical or computer) systems. I am good at this and I am able to use heuristics to do the job. But for closed systems like a computers program with known input and output it should be possible, at least in some cases, to compute the fix.
What I am also arguing is, that for non-closed systems, e.g., a human body or the climate of a planet, I think its possible to prove that its impossible to compute a fix. The reason is that even for small, relatively simple closed, bounded systems (the checkers through the single square) it cannot be done.
So why isn't there a science for this?
And without a science for this how can we be so certain when we know we are missing key components of the determinism tree?
To me this is the "meaningfulness" measurement.