The distribution of data means the way the data gets spread out. This article talks about some essential concepts of the normal distribution:
Let’s get started!
Suppose you belong to the field of statistics. In that case, you know how vital data distribution is because we always sample from a population where you have no idea about full distribution. As a result, the distribution of our sample might limit the statistical techniques available to us.
Looking at the normal distribution, it is a frequently perceived continuous probability distribution.
When a database meets the normal distribution, you can employ other techniques to explore the data more.
In some cases, it can be beneficial to change a skewed dataset to observe the normal distribution. It will be more relevant when your data is usually distributed for some distortion.
Here are the basic features of the normal distribution:
M.W. Toews via Wikipedia
Important terms you need to know as a general overview of the normal distribution:
Ways to Use Normal Distribution
If the dataset you have does not conform to the normal distribution, you could apply these tips.
Let’s also overview some normality measures and how you would use them in a Data science project.
Skewness
It is a measure of asymmetry relative to the mean.
Source: Rodolfo Hermans via Wikipedia
The above graph has negative skewness. That means that the tail of the distribution is longer on the left side. The counterintuitive thing is that most of the data points are clustered on the right side. Make sure you are not getting confused with right or positive skewness that might get represented by this graph’s mirror image.
A Brief on How to Use Skewness
It is a significant factor in model performance. You can use skew from the scipy stats module to measure skewness.
Source: SciPy
The skewness measure can drive us to the potential deviation in model performance across all the feature values. A positively skewed feature for example the second array in the above image can enable better performance on lower values.
Kurtosis
The original meaning of Kurtosis is a measure of the tailedness of the distribution. It is typically measured relative to 0, the kurtosis value of the normal distribution with Fisher’s definition. A positive kurtosis value identifies “fatter” tails.
The Laplace Distribution has kurtosis > 0. via John D. Cook Consulting.
A Guide to using Kurtosis
Understanding kurtosis supply a lens to the presence of outliers in a dataset. To measure kurtosis, you can use kurtosis from the scipy.stats module. Negative kurtosis indicates data that is grouped meticulously around the mean with fewer outliers.
Via SciPy
A Caution about the Normal Distribution
Various naturally occurring datasets conform to the normal distribution. This claim has been made for everything from IQ to human heights. While normal distribution is drawn from observations of nature and frequently occurs, which is true, we risk oversimplification by applying this assumption too liberally.
Often the standard model won’t fit well in the extremes. It also undermines the probability of rare events.
Calculate the Share of Values within SD
As the amount of data set gets larger and larger, calculating the standard deviation (SD) and the number of values falling within each quarter of the bell-shaped curve becomes difficult. To this end, an empirical rule calculator can make the process faster. This calculator calculates the share of values that fall within a particular SD from the mean or the dataset average. To calculate the percentage of values, we just need to have mean and SD value handy.
Summary
This brief article covered everything about normal distribution—some fundamental concepts, how to measure them, and how to use them. Make sure not to over-apply normal distribution, or you risk discounting the chances of outliers. Let us know how it helped you in understanding the concepts.
Comment
Normality is not always a concern. I have a template that applies the theory of the Boxplot to isolate, visualize and explore unusual data points vs. the median in any Cloud, database or flat file data source. Find Unusual Data: Explore Your Outliers - Cool Number Crunching Te...
Evan, great summary.
© 2021 TechTarget, Inc. Powered by
Badges | Report an Issue | Privacy Policy | Terms of Service
Most Popular Content on DSC
To not miss this type of content in the future, subscribe to our newsletter.
Other popular resources
Archives: 2008-2014 | 2015-2016 | 2017-2019 | Book 1 | Book 2 | More
Most popular articles
You need to be a member of Data Science Central to add comments!
Join Data Science Central