Episode

When is an Outlier an Outlier? The O3 plot

with Antony Unwin

useR!2017: When is an Outlier an Outlier? The O3 pl...

Whether a case might be identified as an outlier depends on the other cases in the dataset and on the variables available. A case can stand out as unusual on one or two variables, while appearing middling on the others. If a case is identified as an outlier, it is useful to find out why. This paper introduces a new display, the O3 plot (Overview Of Outliers), for supporting outlier analyses, and describes its implementation in R.

Figure 1 shows an example of an O3 plot for four German demographic variables recorded for the 299 Bundestag constituencies. There is a row for each variable combination for which outliers were found and two blocks of columns. Each row of the block on the left shows which variable combination defines that row. There are 4 variables, so there are 4 columns, one for each variable, and a cell is coloured grey if that variable is part of the combination. The combinations (the rows) are sorted by numbers of outliers found within numbers of variables in the combination, and blue dotted lines separate the combinations with different numbers of variables. The columns in the left block are sorted by how often the variables occur. A boundary column separates this block from the block on the right that records the outliers found with whichever outlier identification algorithm was used (in this case Wilkinson's HDoutliers with alpha=0.05). There is one column for each case that is found to be an outlier at least once and these columns are sorted by the numbers of times the cases are outliers.

Given \(n\) cases and \(p\) variables there would be \((p+1+n)\) columns if all cases were an outlier on some combination of variables. And if outliers were identified for all possible combinations there would be \(2^p-1\) rows. An O3 plot has too many rows if there are lots of variables with many combinations having outliers and it has too many columns if there are lots of cases identified as outliers on at least one variable combination. Combinations are only reported if outliers are found for them and cases are only reported which occur at least once as an outlier.

O3 plots show which cases are identified often as outliers, which are identified in single dimensions, and which are only identified in higher dimensions. They highlight which variables and combinations of variables may be affected by possible outliers.