Machine Learning in Python

December 17, 2018

Understand the 'DATA SET' before doing any machine learning project

Take a look at the raw data.
Check the dimensions of the data set.
Review the datatypes of attributes in the data.
check the class distribution.
Review the descriptive statistics of the data.
Understand the relationships in the data using correlations.
Review the skew of the distributions of each attribute

Looking at the raw data can reveal some insights into the data. Here we print the first

20 rows of our data using the head( ) function on the Pandas DataFrame.

2. Check the dimensions of the data set.

We can review the shape and size of our dataset using the shape property of the Pandas DataFrame. We can check the following by checking the dimensions.

We can check whether or not there is enough training data and also

Too much training data.

Too many features.

Less features.

3. Review the data types of the attributes in the data.

The type of the attribute is very important as we may need to do any conversion like string to float etc.,

4. Check the class distribution

We can quickly check get an idea of the distribution of the class attribute in Pandas

5. Review the statistics of the data

We can get a great insight of each attribute like count, Mean, Standard deviation etc.,

6. Understanding the relationships in the data using correlations

Correlation refers to the relationship between the two variables and how they will change together.

0 indicates no correlation

close to 1 indicates positive correlation

close to -1 indicates negative correlation

It is evident that pregnancy has a relationship with age factor.

7. Review the skew of the distribution for each attribute

Skew refers to the distribution which is assumed Gaussian

Comments