Exploratory data analysis (EDA) is an essential first step in most data science projects and often involves following the same steps to characterize a data set (e.g., finding data types, missing information , distribution of values, correlations, etc.). One of the latest is a new open-source Python library called Sweetviz.
After installing Sweetviz (using pip install sweetviz), simply load the pandas dataframes as you normally would, then call analyze(), compare() or compare_intra().
import sweet viz
import pandas as pd
train = pd.read_csv("train.csv")
test = pd.read_csv("test.csv")
We now have 2 dataframes (train and test), and we would like to analyze the “Survived” target value. I want to point out that in this case we know the name of the target column in advance, but specifying a target column is still optional. We can generate a report with this line of code:
my_report = sweetviz.compare([train, "Train"], [test, "Test"], "Survived")
Running this command will perform the analysis and create the report object. To get the result, just use the show_html() command:
my_report.show_html("Report.html") # Not providing a filename will default to SWEETVIZ_REPORT.html
The summary shows us the characteristics of the two dataframes side by side. We can immediately identify that the test set is roughly half the size of the training set, but contains the same functionality. This legend at the bottom shows us that the training set contains the target variable "Survived", but the test set does not.
Note that Sweetviz will make a best estimate to determine the data type of each column, between numeric, category/boolean and text.
This graph is a composite of visuals from Drazen Zaric: Better Heatmaps and Correlation Matrix Plots in Python and concepts from Shaked Zychlinski: The Search for Categorical Correlation.
Basically, in addition to showing the traditional numerical correlations, it unifies in a single graph both the correlation numerical but also the uncertainty coefficient (for categorical-categorical) and the correlation ratio (for categorial-numerical). The squares represent the variables linked to the categorical characteristics and the circles represent the numerical-numerical correlations. Note that the trivial diagonal is left blank, for clarity.
Categorical-categorical associations (provided by the uncertainty coefficient) are ASYMMETRIC, which means that each row represents how much the row title (left) gives information about each column. For example, "Sex", "Pclass" and "Fare" are the elements that give the most information about "Survived". For the Titanic dataset, this information is rather symmetric but this is not always the case.
Finally, it is worth noting these correlation/association methods
should not be taken as gospel as they make assumptions about the underlying distribution of data and relationships. However, they can be a very useful starting point.
Selecting a variable
When a target variable is specified, it will appear first, in a special black box. Only numeric and Boolean entities can currently be targets.
We can deduce from this summary that "Survived" has no missing data in the training set (891, 100%), that there are 2 distinct possible values (representing less than 1% of all values), and from the graph it can be estimated that about 60 %s did not survive.
Qualitative and Boolean data
When you move the mouse over one of the variables, a box to the right will display the details. The content of the details depends on the type of variable analyzed. In the case of a categorical (or Boolean) variable, as is the case with the target, the analysis is as follows:
Here we can see the exact statistics for each class, where 62% did not survive and 38% survived. You also get the detail of the associations for each of the other features.
Numerical data shows more information about its summary. Here we can see that in this case about 20 % of the data is missing (21 % in the test data, which is very consistent).
Note that the target value ("Survived" in this case) is plotted as a line, just above the distribution graph. This allows instant analysis of the target distribution relative to other variables.
Interestingly, we can see from the graph on the right that the survival rate is quite constant at all ages, except for the youngest who have a higher survival rate. It would seem that “women and children first” are not just words.
Details of a quantitative variable
As with the categorical data type, the numeric data type displays additional information in its detail area. It is worth noting here the buttons at the top of the graph.
These buttons change the number of “bins” displayed in the graph. You can select from the following: Auto, 5, 15, 30.
To access these buttons, you must "lock in place" the current functionality by clicking on them. The function then has a RED OUTLINE to show that it is locked in place and you can access the detail area.
For now, anything that the system does not consider numeric or categorical will be considered "text". Text features currently only show the number (percentage) as statistics.
Comparison of subpopulations
Even if you are looking at only one dataset, it can be very useful to study the characteristics of different subpopulations within that dataset.
For this, Sweetviz offers the compare_intra() function. To use it, you supply a boolean test that splits the population (here we try train["Sex"] == 'male', to get an idea of the different gender populations), and give each sub- population. For example:
my_report = sweetviz.compare_intra(train, train["Sex"] == 'male', ["Male", "Female"], 'Survived')my_report.show_html() # Not providing a filename will default to SWEETVIZ_REPORT.html
This gives the following analysis:
Note that the target value ("Survived" in this case) is now plotted as separate lines, one for each set of data being compared (e.g. male in blue, female in orange).
Example on the Titanic dataset
- The distribution of ID and survivability is uniform and orderly.
- No missing data
- Twice as many men as women
- Women survive 30% better than men
- Same distributions in training and test sets
- No missing data
- 20 % of missing data, consistent missing data and distribution between Train and Test
- Population centered on young adults, but ages 0 to 70 are well represented
- Survivability surprisingly evenly distributed, except for a peak at younger ages
- Age seems tied to Siblings, Pclass, and Fare, and a bit more surprising to Embarked
- No missing data
- All names are distinct
- Survivability closely follows class (first class most likely to survive, third class least likely)
- Similar distribution between Train and Test
- No missing data
- There seems to be a survivability peak at 1 and to some extent at 2, but (looking at the details pane not shown here) there is a steep dip at 3 and up. Large families could not do it or were perhaps poorer?
- Similar Cast
- No missing data
- ~80 % of distinct values, i.e. approximately 1 in 5 shared tickets on average
- The highest frequency ticket was 7, which is generally consistent with the maximum number of siblings (8)
- No missing data, data looks pretty clean
- As expected, and similar to Pclass, the higher tariffs survived better (although the sample size gets quite thin at higher levels)
- A correlation ratio of 0.26 for “Survived” is relatively high, so it would tend to support this theory
- About 30 % of distinct values seems a little high as you would expect less fixed prices, but there seems to be a lot of granularity, so that's fine
- Only 1 record missing in test set, fairly consistent data between Train and Test
- 78% missing data
- The maximum frequency is 4, which would make sense to have 4 people maximum in a cabin
- 3 distinct values (S, C, Q)
- Slightly higher survivability at C; could this be a place with richer people?
- Anyway, "Embedded" shows an uncertainty coefficient of only 0.03 for "Survived", so it may not be very significant
- Overall, most of the data is there and seems consistent and logical; no major outliers or huge surprises
Test versus Training data
The test contains about 50 % fewer lines.
Train and Test are very close in the distribution of missing data.
Training and test data values are very consistent across the board
- Gender, fare, and class give the most information about survivors
- As expected, Fare and Pclass are highly correlated
- Age seems to tell us a lot about Pclass, siblings and to some extent Fare, which would be somewhat expected. It seems to tell us a lot about "Embedded" which is a bit more surprising.
- There are no significant missing data except for age (~20 %) and cabin (~77 %)