Hey all I will just explain what are all the techniques available for
doing feature engineering in machine learning. In this post only explain
the types of feature engineering and where it will be useful. The
working guidelines and more detail of the each technique will be cover
in next sub-sequence posts.
What is meant by Feature?
In a given data set, the attribute or variable is called as feature,
that makes more sense in the context of a problem.If a feature has no
impact on the problem, it is not part of the problem.
For example, In computer vision, an image is an observation, but a feature could be a line in the image. In natural language processing, a document or a tweet could be an observation, and a phrase or word count could be a feature. In speech recognition, an utterance could be an observation, but a feature might be a single word or phoneme.
For example, In computer vision, an image is an observation, but a feature could be a line in the image. In natural language processing, a document or a tweet could be an observation, and a phrase or word count could be a feature. In speech recognition, an utterance could be an observation, but a feature might be a single word or phoneme.
What is feature Engineering ?
Feature engineering is the process of transforming raw data into
features that better represent the underlying problem to the predictive
models, resulting in improved model accuracy on unseen data. It is so
important to how your model performs, that even a simple model with
great features can outperform a complicated algorithm with poor ones. In
fact, feature engineering has been described as easily the most
important factor in determining the success or failure of your
predictive model. Feature engineering really boils down to the human
element in machine learning. How much you understand the data, with your
human intuition and creativity, can make the difference.
Why do we need it ?
They are many reasons but i have listed the common and important things below,
Improves Accuracy : Less misleading data means modeling accuracy improves
Reduces Overfitting : Less redundant data means less opportunity to make decisions based on noise
Reduces Training Time: fewer data points reduce algorithm complexity and algorithms train faster
Common practice: coming up with as many features as possible (e.g. > 100 not unusual) The presence of irrelevant features hurts generalization
Reduces Training Time: fewer data points reduce algorithm complexity and algorithms train faster
Common practice: coming up with as many features as possible (e.g. > 100 not unusual) The presence of irrelevant features hurts generalization
Note : We will use only some techniques in real time. The feature engineering fully depends on the domain and data availability. Keep this in mind forever.
The Curse of Dimensionality
The required number of samples (to achieve the same accuracy) grows exponentially with the number of variables. In practice: the number of training examples is fixed.
the classifier’s performance usually will degrade for a large number of features
In
many cases the information that is lost by discarding variables is made
up for by a more accurate mapping/sampling in the lower-dimensional
space.
The feature engineering have three components and they are listed below,
- Feature Construction or Generation
- Feature Extraction
- Feature Selection
1. Feature Construction
The manual construction of new features from raw data. You need to
manually create them. This requires spending a lot of time with actual
sample data (not aggregates) and thinking about the underlying form of
the problem, structures in the data and how best to expose them to
predictive modeling algorithms. With tabular data, it often means a
mixture of aggregating or combining features to create new features, and
decomposing or splitting features to create new features.
With textual data, it often means devising document or context specific indicators relevant to the problem. With image data, it can often mean enormous amounts of time prescribing automatic filters to pick out relevant structures. The below are the very common ways to create the new features in machine learning.
With textual data, it often means devising document or context specific indicators relevant to the problem. With image data, it can often mean enormous amounts of time prescribing automatic filters to pick out relevant structures. The below are the very common ways to create the new features in machine learning.
1.1 Indicator Variables:
The first type of feature engineering involves using indicator variables
to isolate key information. Now, some of you may be wondering,
“shouldn’t a good algorithm learn the key information on its own?” Well,
not always. It depends on the amount of data you have and the strength
of competing signals.
You can help your algorithm “focus” on what’s important by highlighting it beforehand.
Indicator variable from threshold: Let’s say you’re studying alcohol preferences by U.S. consumers and your dataset has an age feature. You can create an indicator variable for age >= 21 to distinguish subjects who were over the legal drinking age.
Indicator variable from multiple features: You’re predicting real-estate prices and you have the features n_bedrooms and n_bathrooms. If houses with 2 beds and 2 baths command a premium as rental properties, you can create an indicator variable to flag them.
Indicator variable for special events: You’re modeling weekly sales for an e-commerce site. You can create two indicator variables for the weeks of Black Friday and Christmas.
Indicator variable for groups of classes: You’re analyzing website conversions and your dataset has the categorical feature traffic_source. You could create an indicator variable for paid_traffic by flagging observations with traffic source values of "Facebook Ads" or "Google Adwords".
You can help your algorithm “focus” on what’s important by highlighting it beforehand.
Indicator variable from threshold: Let’s say you’re studying alcohol preferences by U.S. consumers and your dataset has an age feature. You can create an indicator variable for age >= 21 to distinguish subjects who were over the legal drinking age.
Indicator variable from multiple features: You’re predicting real-estate prices and you have the features n_bedrooms and n_bathrooms. If houses with 2 beds and 2 baths command a premium as rental properties, you can create an indicator variable to flag them.
Indicator variable for special events: You’re modeling weekly sales for an e-commerce site. You can create two indicator variables for the weeks of Black Friday and Christmas.
Indicator variable for groups of classes: You’re analyzing website conversions and your dataset has the categorical feature traffic_source. You could create an indicator variable for paid_traffic by flagging observations with traffic source values of "Facebook Ads" or "Google Adwords".
1.2 Interaction Features
The next type of feature engineering involves highlighting interactions
between two or more features.Have you ever heard the phrase, “the sum is
greater than the parts?” Well, some features can be combined to provide
more information than they would as individuals.
Specifically, look for opportunities to take the sum, difference, product, or quotient of multiple features.
Note: We don’t recommend using an automated loop to create interactions for all your features. This leads to “feature explosion.”
Sum of two features: Let’s say you wish to predict revenue based
on preliminary sales data. You have the features sales_blue_pens and
sales_black_pens. You could sum those features if you only care about
overall sales_pens.
Difference between two features: You have the features
house_built_date and house_purchase_date. You can take their difference
to create the feature house_age_at_purchase.
Product of two features: You’re running a pricing test, and you
have the feature price and an indicator variable conversion. You can
take their product to create the feature earnings.
Quotient of two features: You have a dataset of marketing
campaigns with the features n_clicks and n_impressions. You can divide
clicks by impressions to create click_through_rate, allowing you to
compare across campaigns of different volume.
1.3 Feature Representation
This next type of feature engineering is simple yet impactful. It’s
called feature representation.Your data won’t always come in the ideal
format. You should consider if you’d gain information by representing
the same feature in a different way.
Date and time features: Let’s say you have the feature purchase_datetime. It might be more useful to extract purchase_day_of_week and purchase_hour_of_day. You can also aggregate observations to create features such as purchases_over_last_30_days.
Numeric to categorical mappings: You have the feature years_in_school. You might create a new feature grade with classes such as "Elementary School", "Middle School", and "High School".
Grouping sparse classes: You have a feature with many classes that have low sample counts. You can try grouping similar classes and then grouping the remaining ones into a single "Other" class.
Creating dummy variables: Depending on your machine learning implementation, you may need to manually transform categorical features into dummy variables. You should always do this after grouping sparse classes.
Date and time features: Let’s say you have the feature purchase_datetime. It might be more useful to extract purchase_day_of_week and purchase_hour_of_day. You can also aggregate observations to create features such as purchases_over_last_30_days.
Numeric to categorical mappings: You have the feature years_in_school. You might create a new feature grade with classes such as "Elementary School", "Middle School", and "High School".
Grouping sparse classes: You have a feature with many classes that have low sample counts. You can try grouping similar classes and then grouping the remaining ones into a single "Other" class.
Creating dummy variables: Depending on your machine learning implementation, you may need to manually transform categorical features into dummy variables. You should always do this after grouping sparse classes.
1.4 External Data
An underused type of feature engineering is bringing in external data.
This can lead to some of the biggest breakthroughs in performance. For
example, one way quantitative hedge funds perform research is by
layering together different streams of financial data.
Many machine learning problems can benefit from bringing in external data. Here are some examples:
Time series data: The nice thing about time series data is that you only need one feature, some form of date, to layer in features from another dataset.
External API’s: There are plenty of API’s that can help you create features. For example, the Microsoft Computer Vision API can return the number of faces from an image.
Geocoding: Let’s say have you street_address, city, and state. Well, you can geocode them into latitude and longitude. This will allow you to calculate features such as local demographics (e.g. median_income_within_2_miles) with the help of another dataset.
Other sources of the same data: How many ways could you track a Facebook ad campaign? You might have Facebook’s own tracking pixel, Google Analytics, and possibly another third-party software. Each source can provide information that the others don’t track. Plus, any differences between the datasets could be informative (e.g. bot traffic that one source ignores while another source keeps).
Many machine learning problems can benefit from bringing in external data. Here are some examples:
Time series data: The nice thing about time series data is that you only need one feature, some form of date, to layer in features from another dataset.
External API’s: There are plenty of API’s that can help you create features. For example, the Microsoft Computer Vision API can return the number of faces from an image.
Geocoding: Let’s say have you street_address, city, and state. Well, you can geocode them into latitude and longitude. This will allow you to calculate features such as local demographics (e.g. median_income_within_2_miles) with the help of another dataset.
Other sources of the same data: How many ways could you track a Facebook ad campaign? You might have Facebook’s own tracking pixel, Google Analytics, and possibly another third-party software. Each source can provide information that the others don’t track. Plus, any differences between the datasets could be informative (e.g. bot traffic that one source ignores while another source keeps).
2. Feature Extraction or Dimensionality Reduction:
Feature Extraction is a process to reduce the number of features in a
dataset by creating new features from the existing ones (and then
discarding the original features). These new reduced set of features
should then be able to summarize most of the information contained in
the original set of features. We have many feature extraction techniques
and we are going to see the most used techniques, They are,
- Principal Components Analysis (PCA)
- Independent Component Analysis (ICA)
- Linear Discriminant Analysis (LDA)
- Locally Linear Embedding (LLE)
- t-distributed Stochastic Neighbor Embedding (t-SNE)
- Autoencoders
2.1 Principal Components Analysis (PCA):
PCA is one of the most used linear dimensionality reduction
techniques. When using PCA, we take as input our original data and try
to find a combination of the input features which can best summarize the
original data distribution so that to reduce its original dimensions.
PCA is able to do this by maximizing variances and minimizing the
reconstruction error by looking at pairwise distances. In PCA, our
original data is projected into a set of orthogonal axes and each of the
axes gets ranked in order of importance.
PCA is an unsupervised learning algorithm, therefore it doesn’t care about the data labels but only about variation. This can lead in some cases to misclassification of data. While using PCA, we can also explore how much of the original data variance was preserved using the explained_variance_ratio_ Scikit-learn function. It is important to mention that principal components do not have any correlation with each other.
PCA is an unsupervised learning algorithm, therefore it doesn’t care about the data labels but only about variation. This can lead in some cases to misclassification of data. While using PCA, we can also explore how much of the original data variance was preserved using the explained_variance_ratio_ Scikit-learn function. It is important to mention that principal components do not have any correlation with each other.
PCA Approach Overview:
- Take the whole dataset consisting of d-dimensional samples ignoring the class labels
- Compute the d-dimensional mean vector (i.e., the means for every dimension of the whole dataset)
- Compute the scatter matrix (alternatively, the covariance matrix) of the whole data set
- Compute eigenvectors (ee1,ee2,...,eed) and corresponding eigenvalues (λλ1,λλ2,...,λλd)
- Sort the eigenvectors by decreasing eigenvalues and choose k eigenvectors with the largest eigenvalues to form a d×k dimensional matrix WW(where every column represents an eigenvector)
- Use this d×k eigenvector matrix to transform the samples onto the new subspace. This can be summarized by the mathematical equation: yy=WWT×xx (where xx is a d×1-dimensional vector representing one sample, and yy is the transformed k×1-dimensional sample in the new subspace.)
2.2 Independent Component Analysis (ICA):
ICA is a linear dimensionality reduction method which takes as
input data a mixture of independent components and it aims to correctly
identify each of them (deleting all the unnecessary noise). Two input
features can be considered independent if both their linear and
non-linear dependance is equal to zero.
In signal processing, independent component analysis (ICA) is a computational method for separating a multivariate signal into additive subcomponents. This is done by assuming that the subcomponents are non-Gaussian signals and that they are statistically independent from each other. ICA is a special case of blind source separation.
A common example application is the "cocktail party problem" of listening in on one person's speech in a noisy room. And also it is used in medical applications such as EEG and fMRI analysis to separate useful signals from noise signal
In signal processing, independent component analysis (ICA) is a computational method for separating a multivariate signal into additive subcomponents. This is done by assuming that the subcomponents are non-Gaussian signals and that they are statistically independent from each other. ICA is a special case of blind source separation.
A common example application is the "cocktail party problem" of listening in on one person's speech in a noisy room. And also it is used in medical applications such as EEG and fMRI analysis to separate useful signals from noise signal
2.3 Linear Discriminant Analysis (LDA)
LDA is supervised learning dimensionality reduction
technique and Machine Learning classifier. LDA aims to maximize the
distance between the mean of each class and minimize the spreading
within the class itself. LDA uses therefore within classes and between
classes as measures. This is a good choice because maximizing the
distance between the means of each class when projecting the data in a
lower-dimensional space can lead to better classification results .When
using LDA, is assumed that the input data follows a Gaussian
Distribution (like in this case), therefore applying LDA to not Gaussian
data can possibly lead to poor classification results.
PCA vs LDA:
PCA has no concern with the class labels. In simple words, PCA
summarizes the feature set without relying on the output. PCA tries to
find the directions of the maximum variance in the dataset. In a large
feature set, there are many features that are merely duplicate of the
other features or have a high correlation with the other features. Such
features are basically redundant and can be ignored. The role of PCA is
to find such highly correlated or duplicate features and to come up with
a new feature set where there is minimum correlation between the
features or in other words feature set with maximum variance between the
features. Since the variance between the features doesn't depend upon
the output, therefore PCA doesn't take the output labels into account.
Unlike PCA, LDA tries to reduce dimensions of the feature set while retaining the information that discriminates output classes. LDA tries to find a decision boundary around each cluster of a class. It then projects the data points to new dimensions in a way that the clusters are as separate from each other as possible and the individual elements within a cluster are as close to the centroid of the cluster as possible. The new dimensions are ranked on the basis of their ability to maximize the distance between the clusters and minimize the distance between the data points within a cluster and their centroids. These new dimensions form the linear discriminants of the feature set.
Unlike PCA, LDA tries to reduce dimensions of the feature set while retaining the information that discriminates output classes. LDA tries to find a decision boundary around each cluster of a class. It then projects the data points to new dimensions in a way that the clusters are as separate from each other as possible and the individual elements within a cluster are as close to the centroid of the cluster as possible. The new dimensions are ranked on the basis of their ability to maximize the distance between the clusters and minimize the distance between the data points within a cluster and their centroids. These new dimensions form the linear discriminants of the feature set.
2.4 Locally Linear Embedding (LLE)
We have considered so far methods such as PCA and LDA, which are able to
perform really well in case of linear relationships between the
different features, we will now move on considering how to deal with non-linear
cases. Locally Linear Embedding is a dimensionality reduction technique
based on Manifold Learning. A Manifold is an object of D dimensions
which is embedded in a higher-dimensional space. Manifold Learning aims
then to make this object representable in its original D dimensions
instead of being represented in an unnecessary greater space.
Some examples of Manifold Learning algorithms are: Isomap, Locally Linear Embedding, Modified Locally Linear Embedding, Hessian Eigen Mapping and so on.
Some examples of Manifold Learning algorithms are: Isomap, Locally Linear Embedding, Modified Locally Linear Embedding, Hessian Eigen Mapping and so on.
3. Feature Selection
Feature Selection is one of the core concepts in machine learning which
hugely impacts the performance of your model.Feature Selection is the
process where you automatically or manually select those features which
contribute most to your prediction variable or output in which you are
interested in.Feature selection is the process of reducing the number of
input variables when developing a predictive model.
Feature-based feature selection methods involve evaluating the relationship between each input variable and the target variable using statistics and selecting those input variables that have the strongest relationship with the target variable. These methods can be fast and effective, although the choice of statistical measures depends on the data type of both the input and output variables.
There are a lot of ways in which we can think of feature selection, but most feature selection methods can be divided into three major buckets
Filter based: We specify some metric and based on that filter features. An example of such a metric could be correlation/chi-square.
Wrapper-based: Wrapper methods consider the selection of a set of features as a search problem. Example: Recursive Feature Elimination
Embedded: Embedded methods use algorithms that have built-in feature selection methods. For instance, Lasso and RF have their own feature selection methods.
3.1 Filter Based Method
Filter feature selection methods use statistical techniques to evaluate
the relationship between each input variable and the target variable,
and these scores are used as the basis to choose (filter) those input
variables that will be used in the mode.
Basic idea: assign heuristic score to each feature to filter out the “obviously” useless ones.
Basic idea: assign heuristic score to each feature to filter out the “obviously” useless ones.
- Does the individual feature seems to help prediction?
- Do we have enough data to use it reliably?
- Many popular scores [see Yang and Pederson ’97]
Classification with categorical data:
- Chi-squared
- information gain
- document frequency
Regression:
- correlation
- mutual information
They all depend on one feature at the time (and the data), Then somehow
pick how many of the highest scoring features to keep Comparison of
filter. Some Common methods are,
- information gain
- chi-square test
- fisher score
- Correlation Coefficient (Pearson Correlation)
- Variance Threshold
Wrapper methods evaluate multiple models using procedures that add and/or remove predictors to find the optimal combination that maximizes model performance.Sequential Forward Selection (SFS), a special case of sequential feature selection, is a greedy search algorithm that attempts to find the “optimal” feature subset by iteratively selecting features based on the classifier performance. We start with an empty feature subset and add one feature at the time in each round; this one feature is selected from the pool of all features that are not in our feature subset, and it is the feature that – when added – results in the best classifier performance.
Some common methods in wrapper is ,
- Recursive feature elimination
- Sequential feature selection algorithms
- Genetic algorithms
Embedded methods have been recently proposed that try to combine the advantages of both previous methods. A learning algorithm takes advantage of its own variable selection process and performs feature selection and classification simultaneously, such as the FRMT algorithm. Some common algorithms are,
- L1 (LASSO) regularization
- Decision Tree or Random Forest
Knowing where feature engineering fits into the context of the process of applied machine learning highlights that it does not standalone. It is an iterative process that interplays with data selection and model evaluation, again and again, until we run out of time on our problem. The process might look as follows:
Brainstorm features: Really get into the problem, look at a lot of data, study feature engineering on other problems and see what you can steal.
Devise features: Depends on your problem, but you may use automatic feature extraction, manual feature construction and mixtures of the two.
Select features: Use different feature importance scorings and feature selection methods to prepare one or more “views” for your models to operate upon.
Evaluate models: Estimate model accuracy on unseen data using the chosen features.
You need a well defined problem so that you know when to stop this process and move on to trying other models, other model configurations, ensembles of models, and so on. There are gains to be had later in the pipeline once you plateau on ideas or the accuracy delta.
You need a well considered and designed test harness for objectively estimating model skill on unseen data. It will be the only measure you have of your feature engineering process, and you must trust it not to waste your time.
I will explain the each and every methods with example in upcoming posts. This post will give you the overview of those methods and its usage
No comments:
Post a Comment