It is important to discover the degree to which variables in your
data-set are dependent upon each other. This will help us to prepare the
good data-set to meet our machine learning algorithms expectation.
Because if the data-set is not good then the performance will degrade.
In this tutorial, we are going to discover the correlation statistical information of the relationship of two variable.
After completing this tutorial you will know,
- How to calculate a covariance matrix to summarise the linear relationship between two or more variables
- How to calculate the Pearson’s correlation coefficient to summarise the linear relationship between two variables.
- How to calculate the Spearman’s correlation coefficient to summarise the monotonic relationship between two variables.
What is correlation ?
Variable in the data-set can be related each other in many reasons,For example
- One variable could cause or depend on the values of another variable
- One variable could be lightly associated with another variable
- Two variables could depend on a third unknown variable
it can be useful in data analysis and modelling to better understand the
relationships between variables. The statistical relationship between
two variables is referred to as their correlation.
A correlation could be positive, meaning both variables move in the same direction, or negative, meaning that when one variable’s value increases, the other variables’ values decrease. Correlation can also be neural or zero, meaning that the variables are unrelated.
- Positive Correlation - Both variables change in the same direction.
- Neutral Correlation - No relationship in the change of the variables.
- Negative Correlation - Variables change in opposite directions.
Dataset:
We will generate 1,000 samples of two two variables with a
strong positive correlation. The first variable will be random numbers
drawn from a Gaussian distribution with a mean of 100 and a standard
deviation of 20. The second variable will be values from the first
variable with Gaussian noise added with a mean of a 50 and a standard
deviation of 10.
The python code for generate test data set.
The python code for generate test data set.
1 2 3 4 5 6 7 8 9 10 11 12 | import numpy as np from matplotlib import pyplot # prepare data-set data1 = 20 * np.random.randn(1000) + 100 data2 = data1 + (10 * np.random.randn(1000) + 50) # summarise print('data1: mean=%.3f stdv=%.3f' % (np.mean(data1), np.std(data1))) print('data2: mean=%.3f stdv=%.3f' % (np.mean(data2), np.std(data2))) # plot pyplot.scatter(data1, data2) pyplot.show() |
A scatter plot of the two variables is created. Because we contrived the dataset, we know there is a relationship between the two variables
Pearson’s Correlation
The Pearson correlation coefficient (named for Karl Pearson) can be used
to summarise the strength of the linear relationship between two data
samples.
The Pearson’s correlation coefficient is calculated as the covariance of
the two variables divided by the product of the standard deviation of
each data sample. It is the normalisation of the covariance between the
two variables to give an interpretable score.
The use of mean and standard deviation in the calculation suggests the
need for the two data samples to have a Gaussian or Gaussian-like
distribution.
The result of the calculation, the correlation coefficient can be interpreted to understand the relationship.
The coefficient returns a value between -1 and 1 that represents the
limits of correlation from a full negative correlation to a full
positive correlation. A value of 0 means no correlation. The value must
be interpreted, where often a value below -0.5 or above 0.5 indicates a
notable correlation, and values below those values suggests a less
notable correlation.
We can calculate the correlation between the two variables in our test
problem. The pandas data frame providing function to calculate the
correlation between two variable, The sample code given below,
1 2 3 4 5 6 7 8 9 10 11 12 | import pandas as pd import numpy as np #Preapring data set data1 = 20 * np.random.randn(1000) + 100 data2 = data1 + (10 * np.random.randn(1000) + 50) # Creatind data frame data_set = pd.DataFrame(columns=['data1', 'data2']) data_set['data1'] = data1 data_set['data2'] = data2 #calcualting Pearson correlation print(' Pearson correlation =',data_set['data1'].corr(data_set['data2'], method='pearson')) |
The result is given below,
We can see that the two variables are positively correlated and that the correlation is 0.8. This suggests a high level of correlation, e.g. a value above 0.5 and close to 1.0.
The Pearson’s correlation coefficient can be used to evaluate the relationship between more than two variables
This can be done by calculating a matrix of the relationships between each pair of variables in the dataset. The result is a symmetric matrix called a correlation matrix with a value of 1.0 along the main diagonal as each column always perfectly correlates with itself.
1 | Pearson correlation = 0.897870771565 |
We can see that the two variables are positively correlated and that the correlation is 0.8. This suggests a high level of correlation, e.g. a value above 0.5 and close to 1.0.
The Pearson’s correlation coefficient can be used to evaluate the relationship between more than two variables
This can be done by calculating a matrix of the relationships between each pair of variables in the dataset. The result is a symmetric matrix called a correlation matrix with a value of 1.0 along the main diagonal as each column always perfectly correlates with itself.
Spearman’s Correlation
The Spearman’s correlation coefficient (named for Charles
Spearman) can be used to summarise the strength between the two data
samples. This test of relationship can also be used if there is a linear
relationship between the variables, but will have slightly less power
(e.g. may result in lower coefficient scores).
As with the Pearson correlation coefficient, the scores are between -1
and 1 for perfectly negatively correlated variables and perfectly
positively correlated respectively.
Instead of calculating the coefficient using covariance and standard
deviations on the samples themselves, these statistics are calculated
from the relative rank of values on each sample. This is a common
approach used in non-parametric statistics, e.g. statistical methods
where we do not assume a distribution of the data such as Gaussian.
A linear relationship between the variables is not assumed, although a
monotonic relationship is assumed. This is a mathematical name for an
increasing or decreasing relationship between the two variables.
If you are unsure of the distribution and possible relationships between
two variables, Spearman correlation coefficient is a good tool to use.
The sample code given below,
1 2 3 4 5 6 7 8 9 10 11 12 | import pandas as pd import numpy as np #Preapring data set data1 = 20 * np.random.randn(1000) + 100 data2 = data1 + (10 * np.random.randn(1000) + 50) # Creatind data frame data_set = pd.DataFrame(columns=['data1', 'data2']) data_set['data1'] = data1 data_set['data2'] = data2 #calcualting Pearson correlation print('Spearman correlation =',data_set['data1'].corr(data_set['data2'], method='spearman')) |
The result for the above code is given below,
We know that the data is Gaussian and that the relationship between the variables is linear. Nevertheless, the non-parametric rank-based approach shows a strong correlation between the variables of 0.8.
Thanks for reading this post.
1 | Spearman correlation = 0.873144093144 |
We know that the data is Gaussian and that the relationship between the variables is linear. Nevertheless, the non-parametric rank-based approach shows a strong correlation between the variables of 0.8.
Thanks for reading this post.
No comments:
Post a Comment