Feature selection for Machine Learning

The data features used to train the machine learning model have a great impact on the ultimate performance. Irrelevant or partially relevant feature can negatively influence the model.
The various automatic feature selection techniques are:

  1. Univariate Selection
  2. Recursive feature elimination
  3. Principal Component Analysis
  4. Feature Importance
 Benefits of feature selection techniques:
  • Reduces overfitting 
  • Improves accuracy 
  • Reduces training time

Univariate Selection:

  • This selection can be used to select the features that have the strongest relationship with the output variable.
  • The example below uses scikit-learn which provides SeleceKBest class that can be used combinely with chi-squared (chi2) statistical test for non-negative features to select 4 of the best features from the dataset(Pima Indian).









Recursive Feature Elimination:

  • REF works by recursively removing the attributes and building a model on those attributes that are left.
  • The example below uses RFE with Logistic Regression algorithm to select the top 4 features.








Principal Component Analysis:

  • PCA is a data compression or data reduction technique.
  • We can choose the number of dimensions or principal components in the transformed result.



Feature Importance:

  • Decision trees like Random Forest and Extra Trees can be used to estimate the importance of features.
  • In the below example we used ExtraTreeClassifier class in scikit-learn library.
  • Larger score indicates Important Features.

Comments

Popular posts from this blog