This page describes best practices for exploratory data analysis: what to do with a dataset in order to understand its content.
Exploratory data analysis refers to the critical process of performing initial investigations of data to discover patterns, spot anomalies, test hypotheses, and verify hypotheses using summary statistics and representations graphics.
It's a good practice to understand the data first and try to get as much information out of it as possible. EDA is about making sense of the data in hand, before dirtying it with it.
I will take an example white variant of the Wine Quality dataset which is available on UCI Machine Learning Repository and try to grab as much information from the dataset using EDA.
To start, I imported the necessary libraries (for this example pandas, numpy, matplotlib and seaborn) and loaded the dataset.
I found out the total number of rows and columns in the dataset using '.shape'.
The dataset includes 4898 observations and 12 features. One of which is a dependent variable and the other 11 are independent variables – physico-chemical characteristics.
It is also good practice to know the columns and their corresponding data types, as well as to determine whether or not they contain null values.
The data has only float and integer values. No variable columns have null/missing values.
Description of quantitative values
The describe() function in pandas is very handy for getting various summary statistics. This function returns the count, mean, standard deviation, minimum and maximum values, and quantiles of the data.
Here, as you can see, the mean value is lower than the median value of each column which is represented by 50 % (50th percentile) in the index column. In particular, there is a large difference between the 75th %tile and max values of the “residual sugar”, “free sulfur dioxide”, “total sulfur dioxide” predictors. Thus, observations 1 and 2 suggest that there are extreme-outliers in our data set.
Python has a visualization library, Seaborn, which builds on matplotlib. It provides very attractive statistical graphs in order to perform univariate and multivariate analyses.
Selection of columns
To use the data for modeling, it is necessary to remove correlated variables to improve your model. One can find correlations using pandas '.corr()' function and visualize the matrix of correlation using a heatmap in seaborn.
Here we can deduce that 'density' has a strong positive correlation with 'residual sugar' while it has a strong negative correlation with 'alcohol'. "free sulfur dioxide" and "citric acid" have almost no correlation with "quality".
Since the correlation is zero, we can deduce that there is no linear relationship between these two predictors. However, it is safe to remove these features in case you apply the model of regression linear to the data set.
A boxplot (or boxplot) shows the distribution of quantitative data in a way that facilitates comparisons between variables. The box shows the quartiles of the data set while the whiskers expand to show the rest of the distribution.
In the simplest box plot, the central rectangle extends from the first quartile to the third quartile (the interquartile range or IQR). A segment inside the rectangle shows the median, and the "whiskers" above and below the box show the locations of the minimum and maximum.
Outliers are either 3 × IQR or more above the third quartile or 3 × IQR or more below the first quartile. In our dataset, except for “alcohol”, all other feature columns show outliers.
Now, to check the linearity of the variables, it is recommended to plot a distribution graph and find the asymmetry of the features. Kernel density estimation (kde) is a very useful tool for plotting the shape of a distribution.
The "pH" column appears to be distributed normally. All remaining independent variables are right-skewed/positively skewed.
For the exploration of qualitative data, I invite you to return to the glossary of the descriptive analysis course and to choose the corresponding Exercises.