R Tutorial: Anomalies in time series

preview_player
Показать описание

--
In this lesson, we'll explore methods for detecting anomalies in time series data.

A time series is a set of data collected in a sequence that is ordered by a regular time index. For example, the msales data shown here printed with the head function, contains a time series of monthly revenue in millions of dollars in the column named sales and a monthly time index in the column named month.

Grubbs' test would not be appropriate for finding anomalies in the msales data. Firstly, Grubbs' tests only the points that lie farthest from the overall mean, which is often not a meaningful reference for time series because the data may contain repeating seasonal patterns. Secondly, Grubbs' test can only test one anomaly at a time, but we may expect to find multiple anomalies in the time series.

It's always important to first visualize a time series, to understand if there are seasonal patterns present, and to informally assess if there are any anomalous points.

The plot function can be used to generate a time series plot. The first argument is a formula that specifies sales as the y-axis variable, and month as the x-axis variable. The argument type equals o results in a plot with points connected by lines.

The plot shows a clear seasonal pattern that repeats every 12 months. What's more, there is an unusually low revenue value at month 14, that might be considered an anomaly.

The Seasonal-Hybrid ESD algorithm is a statistical test that can find multiple anomalies in time series that have seasonal patterns. The algorithm is implemented using the AnomalyDetectionVec function from the AnomalyDetection package, using three main arguments.

The x argument is the time series values, in this case, the revenues in the column msales dollar sales.

The second argument is the period of the repeating pattern, which depends on the time intervals between observations in the data. For the msales data, we previously noted a pattern that repeats every 12 months and so the period is 12.

The direction argument indicates whether the algorithm should look for small or large anomalies, or both. Here it is set to direction equals both, as we don't yet know where anomalies will occur.

The output object sales underscore ad is a list whose most important element is called anoms. To print the contents of anoms, use the syntax sales underscore ad dollar anoms.

Anoms is a dataframe with columns called index and anoms, which correspond to the row number and value of the anomalies identified by the Seasonal-Hybrid ESD algorithm. Here the dataframe contains details of two anomalies that have been identified. The first of these is in row 14 of the msales data with a revenue of 1 point 561 million dollars.

Finally, the additional argument plot equals true instructs the algorithm to display any outliers identified using a time series plot.

The two anomalies identified by the algorithm are shown as blue points on the time series plot. Moving from left to right, the first anomaly is the point we suspected from visual inspection which is now confirmed by the algorithm.

The second anomaly is more interesting and lies between a seasonal peak and a trough. Note that this point could never be found using a Grubbs' test, as it doesn't lie near to the extremes in the data.

Now it's your turn to apply anomaly detection to time series data!
Рекомендации по теме
Комментарии
Автор

i am unable to fine the package anomolydetection. I have installed anomalize instead but when I use the command "library(anomalize)" it gives me error.

khyatidavda