Posts

Showing posts from December, 2018
Image
Feature selection for Machine Learning The data features used to train the machine learning model have a great impact on the ultimate performance. Irrelevant or partially relevant feature can negatively influence the model. The various automatic feature selection techniques are: Univariate Selection Recursive feature elimination Principal Component Analysis Feature Importance   Benefits of feature selection techniques: Reduces overfitting  Improves accuracy  Reduces training time Univariate Selection: This selection can be used to select the features that have the strongest relationship with the output variable. The example below uses scikit-learn which provides SeleceKBest class that can be used combinely with chi-squared (chi2) statistical test for non-negative features to select 4 of the best features from the dataset(Pima Indian). Recursive Feature Elimination: REF works by recursively removing the attributes a...
Image
Pre-processing the data before applying a Machine Learning Algorithm Rescale data Standardize data normalize data Barnarize data Rescale Data:  Rescale all the attributes to have the same scale. Generally attributes are often rescaled into the range between 0 and 1 for better optimization. We can rescale the data using scikit-learn using the MinMaxScaler class. Standardize data: Standardization is a useful technique to transform all attributes to a standard Gaussian distribution with Mean 0 and Standard deviation 1 for better optimization.  We can standardize the rescaled data using scikit-learn with the StandardScaler class. Normalize Data: Normalizing refers to rescaling each observation (row) to have a length of  1(unit norm). We can Normalize data with scikit-learn using the Normalizer class. This pre-processing is used for sparse datasets(attributes having lot of zero values) ...
Image
Data visualization Univariate Plots(visualization at each attribute) Histograms Density Plots Box and Whisker Plots Histograms group data into bins and gives the count of observations in each bin. We can get an insight whether an attribute is Gaussian, skewed or has any exponential distribution. Density plot is another way of getting a quick idea of the distribution of each attribute We can also review the distribution using Box and Whisker Plots. Box plots summarize the distribution of each attribute, drawing a line for median and a box around 25th and 75th percentiles. The green line indicates the median or middle value. The whiskers give an idea of idea of the spread of the data and dots outside the whiskers shows the outlier values.  Multivariate Plots(interactions between multiple variables) Correlation matrix Plot  Scatter Plot matrix Correlation gives an indication of how related the change...
Image
Understand the 'DATA SET' before doing any machine learning project Take a look at the raw data. Check the dimensions of the data set. Review the datatypes of attributes in the data. check the class distribution. Review the descriptive statistics of the data. Understand the relationships in the data using correlations. Review the skew of the distributions of each attribute Looking at the raw data can reveal some insights into the data. Here we print the first  20 rows of our data using the head( ) function on the Pandas DataFrame .     2. Check the dimensions of the data set. We can review the shape and size of our dataset using the shape property of the Pandas DataFrame. We can check the following by checking the dimensions. We can check whether or not there is enough training data and also Too much training data. Too many features. Less features. 3. Review the data types of the attributes in the data....
Image
Knowing sample dataset for use in Machine Learning The Pima Indians Diabetic Dataset is used to illustrate the Machine Learning concepts. The dataset describes the medical records for Pima Indians and whether or not each patient will have an onset of diabetes within five years. # Columns description    Pregnancies(preg) Number of times pregnant Glucose(plas) Plasma glucose concentration a 2 hours in an oral glucose tolerance test BloodPressure(pres) Diastolic blood pressure (mm Hg) SkinThickness(skin) Triceps skin fold thickness (mm) Insulin(test) 2-Hour serum insulin (mu U/ml) BMI(mass) Body mass index (weight in kg/(height in m)^2) DiabetesPedigreeFunction(pedi) Diabetes pedigree function Age(age) Age (years) Outcome(class) Class variable (0 or 1) 268 of 768 are 1, the others are 0 ...
Image
How to load Machine Learning Data The most common format to load machine learning data is CSV files. The most common ways to load the data in Python: Load CSV files using Python Standard Library. Load CSV files with NumPy. Load CSV files with Pandas.
Image
Pandas Crash Course Pandas provides data structures and functionality to quickly manipulate and analyze data. We have to understand two data structures in pandas 1. Series 2. DataFrames Series and DataFrames A series is a one-dimensional array where the rows and columns can be labeled. A DataFrame is a two-dimensional array where the rows and columns can be labeled.  
Image
Matplotlib Crash Course Matplotlib can be used for creating plots and charts. we can call a plotting function with some data using plot() function. we can set the properties of the plot like labels and colors. Make the plot visible using show() function. Line plot Scatter plot
Image
NumPy Crash Course NumPy is an array that is efficient to define and manipulate we can convert a python list into an NumPy array we can do arithmetic on NumPy arrays we can access data using numpy arrays
Image
Python crash course for machine learning - IV (for a quick start) Assignment Flowcontrol Datastructures Functions  The example below defines a sample function to calculate the sum of two numbers. The below is a quick example to get the syntax how to use functions in python.
Image
 Python crash course for Machine Learning - III Assignment Flow control Data structures Functions The most used and useful data structures are tuples, lists and dictionaries.
Image
Python crash course for Machine learning- II Assignment Flow control Data structures Functions
Image
Python crash course I: Assignment Flow control Data structures Functions Assignment:
Image
 Python Ecosystem for Machine Learning? SciPy and the functionality it provides with NumPy, Matplotlib and Pandas. scikit-learn that provides all of the machine learning algorithms.