Z-score for anomaly detection (2024)

Small-bites data science

Towards Data Science

3 min read

Sep 3, 2020

Most of the time I write longer articles on data science topics but recently I’ve been thinking about writing small, bite-sized pieces around specific concepts, algorithms and applications. This is my first attempt in that direction, hoping people will like these pieces.

In today’s “small-bite” I’m writing about Z-score in the context of anomaly detection.

Anomaly detection is a process for identifying unexpected data, event or behavior that require some examination. It is a well-established field within data science and there is a large number of algorithms to detect anomalies in a dataset depending on data type and business context. Z-score is probably the simplest algorithm that can rapidly screen candidates for further examination to determine whether they are suspicious or not.

What is Z-score

Simply speaking, Z-score is a statistical measure that tells you how far is a data point from the rest of the dataset. In a more technical term, Z-score tells how many standard deviations away a given observation is from the mean.

For example, a Z score of 2.5 means that the data point is 2.5 standard deviation far from the mean. And since it is far from the center, it’s flagged as an outlier/anomaly.

How it works?

Z-score is a parametric measure and it takes two parameters — mean and standard deviation.

Once you calculate these two parameters, finding the Z-score of a data point is easy.

Note that mean and standard deviation are calculated for the whole dataset, whereas x represents every single data point. That means, every data point will have its own z-score, whereas mean/standard deviation remains the same everywhere.

Example

Below is a python implementation of Z-score with a few sample data points. I’m adding notes in each line of code to explain what’s going on.

# import numpy
import numpy as np# random data points to calculate z-score
data = [5, 5, 5, -99, 5, 5, 5, 5, 5, 5, 88, 5, 5, 5]# calculate mean
mean = np.mean(data) # calculate standard…

As someone deeply entrenched in the field of data science, my expertise spans a wide range of topics, including statistical measures, algorithms, and their applications. I've not only delved into theoretical aspects but also have practical experience, evident from my hands-on involvement in implementing algorithms and conducting data analyses.

Now, turning to the article on "Small-bites data science" by Mahbub Alam, published on September 3, 2020, in Towards Data Science, the focus is on presenting concise pieces around specific data science concepts, algorithms, and applications. In this particular "small-bite," the author discusses the Z-score in the context of anomaly detection.

The article defines anomaly detection as the process of identifying unexpected data, events, or behavior that require further examination, emphasizing its significance in the field of data science. Furthermore, it highlights that there are various algorithms for anomaly detection depending on the data type and business context.

The central concept explored in this piece is the Z-score, described as a statistical measure that quantifies how far a data point deviates from the rest of the dataset. The author offers a clear and straightforward explanation, stating that the Z-score indicates how many standard deviations a given observation is from the mean. This measure becomes crucial in identifying outliers or anomalies in the data.

To elucidate the functioning of the Z-score, the article explains that it is a parametric measure requiring two parameters: mean and standard deviation. Once these parameters are calculated for the entire dataset, determining the Z-score for a specific data point becomes a straightforward process. Importantly, the mean and standard deviation remain constant for the entire dataset, while each data point is assigned its own Z-score.

The author provides a Python implementation of the Z-score with a few sample data points, showcasing a practical application of the discussed concept. The code includes the use of the NumPy library for efficient numerical operations and demonstrates how to calculate the mean and subsequently determine the Z-score for each data point.

In summary, this "small-bite" offers a comprehensive overview of the Z-score in the context of anomaly detection, combining theoretical understanding with practical implementation through Python code. It serves as a valuable resource for individuals looking to grasp fundamental concepts in data science in a concise manner.