Automating EDA - Better Summaries from Pandas

The first thing that many of us do to quickly inspect a dataframe containing a fresh dataset is to use the df.describe() method. This works well to quickly get an idea of the size and distribution of the numerical columns in the dataframe but often needs to be supplemented with additional code to check for outliers, correlations, or to look at any of the non-numerical data stored in the dataframe.

A while ago I discovered the package pandas-profiling while browsing r/datascience and quickly realised I could perform a lot of basic exploratory analysis in less then 3 lines of code. In this post I will describe the key features of the package and how to use it.


Jupyter and Virtual Environments

I start off every new project by creating a new virtual environment and by creating a new kernel for my notebooks.

	$ virtualenv venv
	$ source venv/bin/activate
	$ pip install ipython
	$ ipython kernel install --user --name=pandas-profiling

For the purposes of this tutorial, I will be using the Titanic dataset to highlight the features available within pandas-profiling. This can be downloaded with

	$ curl https://web.stanford.edu/class/archive/cs/cs109/cs109.1166/stuff/titanic.csv -o titanic.csv

I then installed the necessary packages before launching a notebook

	$ virtualenv venv
	$ source venv/bin/activate
	$ pip install ipython
	$ ipython kernel install --user --name=pandas-profiling

	$ jupyter notebook

Pandas Profiling

After reading in your data into a dataframe, you can quickly build a report with

	import pandas as pd
	import pandas_profiling
	df = pd.read_csv('./titanic.csv')
	profile = df.profile_report()
	profile

This outputs the report into a Jupyter cell. Since notebooks run in the browser, you can also easily export the report as a HTML file. Check out the full report - I will go through and describe the different sections below.


Overview

There are 5 main sections of the report. The first section is an overview of the dataset as a whole, giving you things like the number of samples, number of variables, variable types, and the size of data on the disk. There is also a subsection that highlights to the user some potential warnings such as columns with many zeroes, with a lot of missing data, duplicate records, or with many distinct values.




Variables

The second section goes through each of the columns, or variables, in the dataframe and provides summaries such as the mean, median, missing values, and a plot of the distribution.

A toggle allows the user to explore this variable in more depth giving you information on variance, measures of skew, and quartile information.




Correlations

The third section gives information on the pairwise correlations of two columns. There are several measures including Person's r, Spearman's ρ, and for categorical columns there Cramér's V.





Missing Values

The final plot, and the penultimate section, is where you would the information on missing values and is presented in the form of a bar graph.




Sample

The last section gives two samples of the data, taken from the top and bottom of the dataset. These are equivalent to the df.head() and df.tail() methods.




Summary

This tool has sped up a lot of my initial EDA work and presents the results in a nice and concise format to boot - would recommend!