Data Preparation - Outliers
In financial markets, the term “black swan”
In this issue, the fifth tutorial in our data preparation series, we unwrap another troubling phenomenon commonly encountered in sample data: what happens when a few observations look off and don’t quite fit in with the rest of the sample?
In this tutorial, we will discuss the problem of outliers, how to detect them, and what we can do with them.
In statistics, an outlier is an observation that is numerically distant from the rest of the data. In other words, an outlier is one that appears to deviate markedly from other members of the sample in which it occurs.
Outliers arise due to changes in system behavior, fraudulent behavior, human error, instrument error or simply through natural deviations in populations (occurring by chance in any distribution, or when the population has a heavy-tailed distribution).
In time series analysis, we should examine the presence of outlier(s) only for a stationary process.
Please note that outliers, being the most extreme observations, may include the sample maximum or sample minimum, or both, depending on whether they are extremely high or low. However, the sample maximum and minimum are not always outliers because they may not be unusually far from other observations.
Why should I care?
The observed series may be contaminated by so-called outliers. These outliers may change the mean level of the uncontaminated series. Furthermore, the presence of outliers in the sample suggests that the underlying distribution has fat- tails or kurtosis.
In statistics, estimators that are capable of coping with outliers are said to be robust. For instance, the median is a robust statistic, while the mean is not. Naive interpretations of statistics derived from data sets that include outliers may be misleading.
On the other hand, we should always search for the causes of the identified outliers, paying special attention to level changes, variance, and those outliers that can’t be explained. Are all outliers the same?
Not all outliers are created equally; more importantly, outliers don’t always exhibit the same degree of influence on the parameter values of the proposed model.
In a time series analysis, we need to ask the following question: what is the impact on the model parameters if we leave an outlier in the sample data vs. dropping it?
In regression, we’d use cook’s distance to measure this influence, excluding only those values that exhibit a large influence.
In sum, we need to evaluate outliers not by the magnitude of their values, but by the influence they exhibit on the model’s parameter values.
How do I detect those outliers?
In general, we wish for an outlier detection method that can answer the following questions:
- Are there outliers?
- What are their locations?
- What are their types and magnitudes?
The outlier detection methods
- Model-based methods, which are commonly used for identification when we assume the data are from a normal distribution - Grubb’s test, Peirce’s criterion, and Chauvenet’s criterion, etc.
- Distance-based methods – Cook’s distance
- Other measure-based – Inter-quartile range, etc.
- Adaptive filtering.
You have a filter which constantly adapts to the input signal, effectively matching its filter coefficients to a hypothetical short term model of the signal source, thereby reducing mean square error output. This then gives you a low-level output signal (the residual error) except for when you get an outlier, which will result in a spike, which will be easy to detect (threshold).
Please bear in mind that the methods above detect a potential outlier, but it is your responsibility to verify, and to some extent explain, their values.
I have a few (possible) outliers in my data; what’s next?
Golden rule: we should always search for the causes of the identified outliers: level changes and variance, and those outliers that can’t be explained demand special attention.
Case 1: We can explain the outliers by exogenous disturbances
In this case, a more appropriate strategy would be to specify a general model in some form based on causes of the exogenous disturbances and the time series parameters. This strategy allows for the use of prior information of the disturbances. It can also reduce the possibility of over-parametrization that arises from the abuse of the detection procedure.
Case 2: We can’t explain the outliers
Once we detected a few (candidate) outliers in our sample, we are left with two options:
- Retain: Assume they are a genuine outcome of the underlying process and proceed with our analysis.
- Exclude: Assume they are bad values
Deletion of outlier data is a controversial practice.(e.g. data entry error), scrap them, and assume they are missing.
NOTE: In time series, we require our sample data to be equally spaced, so dropping an outlier will create a gap (missing value) in your time series. To retain the equal spacing, we refer you to the interpolation methods discussed in an earlier issue. Keep in mind that we are fundamentally altering the time series regardless of what values we plug in, so a great deal of discretion is in order.
Furthermore, you should differentiate among outliers through their influence on the underlying model’s parameters, and start with those that exhibit the greatest degrees of influence.
The outliers processing is a big and complex subject, and the answer will depend on how much effort you want to invest in it, and how effective your means of outlier detection prove to be.