Feature selection for Machine Learning The data features used to train the machine learning model have a great impact on the ultimate performance. Irrelevant or partially relevant feature can negatively influence the model. The various automatic feature selection techniques are: Univariate Selection Recursive feature elimination Principal Component Analysis Feature Importance Benefits of feature selection techniques: Reduces overfitting Improves accuracy Reduces training time Univariate Selection: This selection can be used to select the features that have the strongest relationship with the output variable. The example below uses scikit-learn which provides SeleceKBest class that can be used combinely with chi-squared (chi2) statistical test for non-negative features to select 4 of the best features from the dataset(Pima Indian). Recursive Feature Elimination: REF works by recursively removing the attributes a...
Posts
- Get link
- X
- Other Apps
Pre-processing the data before applying a Machine Learning Algorithm Rescale data Standardize data normalize data Barnarize data Rescale Data: Rescale all the attributes to have the same scale. Generally attributes are often rescaled into the range between 0 and 1 for better optimization. We can rescale the data using scikit-learn using the MinMaxScaler class. Standardize data: Standardization is a useful technique to transform all attributes to a standard Gaussian distribution with Mean 0 and Standard deviation 1 for better optimization. We can standardize the rescaled data using scikit-learn with the StandardScaler class. Normalize Data: Normalizing refers to rescaling each observation (row) to have a length of 1(unit norm). We can Normalize data with scikit-learn using the Normalizer class. This pre-processing is used for sparse datasets(attributes having lot of zero values) ...
- Get link
- X
- Other Apps
Data visualization Univariate Plots(visualization at each attribute) Histograms Density Plots Box and Whisker Plots Histograms group data into bins and gives the count of observations in each bin. We can get an insight whether an attribute is Gaussian, skewed or has any exponential distribution. Density plot is another way of getting a quick idea of the distribution of each attribute We can also review the distribution using Box and Whisker Plots. Box plots summarize the distribution of each attribute, drawing a line for median and a box around 25th and 75th percentiles. The green line indicates the median or middle value. The whiskers give an idea of idea of the spread of the data and dots outside the whiskers shows the outlier values. Multivariate Plots(interactions between multiple variables) Correlation matrix Plot Scatter Plot matrix Correlation gives an indication of how related the change...
- Get link
- X
- Other Apps
Understand the 'DATA SET' before doing any machine learning project Take a look at the raw data. Check the dimensions of the data set. Review the datatypes of attributes in the data. check the class distribution. Review the descriptive statistics of the data. Understand the relationships in the data using correlations. Review the skew of the distributions of each attribute Looking at the raw data can reveal some insights into the data. Here we print the first 20 rows of our data using the head( ) function on the Pandas DataFrame . 2. Check the dimensions of the data set. We can review the shape and size of our dataset using the shape property of the Pandas DataFrame. We can check the following by checking the dimensions. We can check whether or not there is enough training data and also Too much training data. Too many features. Less features. 3. Review the data types of the attributes in the data....
- Get link
- X
- Other Apps
Knowing sample dataset for use in Machine Learning The Pima Indians Diabetic Dataset is used to illustrate the Machine Learning concepts. The dataset describes the medical records for Pima Indians and whether or not each patient will have an onset of diabetes within five years. # Columns description Pregnancies(preg) Number of times pregnant Glucose(plas) Plasma glucose concentration a 2 hours in an oral glucose tolerance test BloodPressure(pres) Diastolic blood pressure (mm Hg) SkinThickness(skin) Triceps skin fold thickness (mm) Insulin(test) 2-Hour serum insulin (mu U/ml) BMI(mass) Body mass index (weight in kg/(height in m)^2) DiabetesPedigreeFunction(pedi) Diabetes pedigree function Age(age) Age (years) Outcome(class) Class variable (0 or 1) 268 of 768 are 1, the others are 0 ...
- Get link
- X
- Other Apps
Pandas Crash Course Pandas provides data structures and functionality to quickly manipulate and analyze data. We have to understand two data structures in pandas 1. Series 2. DataFrames Series and DataFrames A series is a one-dimensional array where the rows and columns can be labeled. A DataFrame is a two-dimensional array where the rows and columns can be labeled.