In this tutorial we
are going to learn how to do scaling the independent variable data
using python. In data processing, it is also called as data normalization.
What is the use of feature scaling in Machine Learning?
The range of values of raw data varies widely, in some machine learning algorithms, objective functions will not work properly without normalization. For example, the majority of classifiers calculate the distance between two points by the Euclidean distance. If one of the features has a broad range of values, the distance will be governed by this particular feature. Therefore, the range of all features should be normalized so that each feature contributes approximately proportionately to the final distance.Another reason why feature scaling is applied is that gradient descent converges much faster with feature scaling than without it.
We can do the feature scaling in many ways. As of now we are going to look the Min-Max Normalization and Mean Normalization.
Min-Max Normalization:
This is simplest method and consists in rescaling the range of
features to scale the range in [0, 1] or [−1, 1]. Selecting the target
range depends on the nature of the data. The general formula is given
below,
Mean Normalization:
This method rescaling the range of features using the mean . The general formula is given below,
Now, we are going to look the real time feature scaling using these
methods. Already you have the data set then skip the preparing sample
data set.
Preparing the data set:
Here we are creating the sample data set with the four columns. Each
column having the numeric random variable. In normalization we cant use
the text data types. We can generate the random numbers using the numpy . The sample line for generating the data frame given below,
1 | df = pd.DataFrame(np.random.randn(1000, 4), columns = ['a', 'b', 'c', 'd']) |
The sample data is given below,
Now, we have the data in our hand. The remaining is to apply the feature
scaling in the above data set. we are going to do the feature scaling
for the below scenarios, They are
- Feature scaling / Normalizing a single columns
- Feature scaling / Normalizing entire data-frame
Normalizing Single row:
In single row, we can use the particular column to normalize in the
data set. In our example we are using the four features they are a,b,c,d . In this example I am used the a,b,c,d as column names and dont use it in your data-set.
Min-Max Normalization for single columns :
The sample code is attached below,
1 | df["a_min"]=((df["a"]-df["a"].min())/(df["a"].max()-df["a"].min())) |
Mean - Normalization for single columns:
The sample code attached below,
1 | df['a_mean_method']=(df[['a']]-df[['a']].mean())/df[['a']].std() |
Normalizing Entire data-frame :
This can be done pretty straight forward way. But because full on the text type data and none type data. If the column is None then fill it with 0.
Min-Max Normalization :
The sample code attached below,
We can do this using the sklearn library also. The sample code attached below,
Mean Normalization :
The sample code is given below,
This can be done pretty straight forward way. But because full on the text type data and none type data. If the column is None then fill it with 0.
Min-Max Normalization :
The sample code attached below,
1 | normalized_df=(df-df.min())/(df.max()-df.min()) |
We can do this using the sklearn library also. The sample code attached below,
Mean Normalization :
The sample code is given below,
1 | normalized_df=(df-df.mean())/df.std() |
The full sample source code attached below,
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 | # coding: utf-8 # In[4]: import pandas as pd import numpy as np # In[5]: #creatign data set df = pd.DataFrame(np.random.randn(1000, 4), columns = ['a', 'b', 'c', 'd']) # In[7]: df.head(5) # ### Min-Max normalization for single column using the our own code # In[8]: df["a_min"]=((df["a"]-df["a"].min())/(df["a"].max()-df["a"].min())) # In[9]: df.head() # ### Mean normalization for single column # In[11]: df['a_mean_method']=(df[['a']]-df[['a']].mean())/df[['a']].std() # In[12]: df.head() # ### Min-Max normalization for entire data frame # In[13]: normalized_df=(df-df.min())/(df.max()-df.min()) # ### Min-Max normalization for entire data frame using sklearn library # # In[14]: from sklearn import preprocessing min_max_scaler = preprocessing.MinMaxScaler() scaled_array = min_max_scaler.fit_transform(df) df_normalized = pd.DataFrame(scaled_array) # In[15]: df_normalized.head(4) |
Hope you guys understood better. Drop your commends and suggestion below .
Thanks for reading. :)
No comments:
Post a Comment