To create a PairGrid instance with our data, we use the following code which also limits the variables we will show: # Create an instance of the PairGrid class. Instead, we create a class instance and then we map specific functions to the different sections of the grid. In contrast to the sns.pairplot function, sns.PairGrid is a class which means that it does not automatically fill in the plots for us. For this post we’ll stick to plotting, and, if we want to explore our data even more, we can customize the pairplots using the PairGrid class. For example, we know that log_gdp_per_cap is positively correlated with life_exp, so we could create a linear model to quantify this relationship. This is starting to look pretty nice! If we were going to do modeling, we could use information from these plots to inform our choices. This is also extremely simple in seaborn! All we need to do is use the hue keyword in the sns.pairplot function call: sns.pairplot(df, hue = 'continent') While this plot alone can be useful in an analysis, we can find make it more valuable by coloring the figures based on a categorical variable such as continent. To better show these variables in future plots, we can transform these columns by taking the logarithm of the values: # Take the log of population and gdp_per_capita df = np.log10(df) df = np.log10(df) # Drop the non-transformed columns df = df.drop(columns = ) From the histograms, we learn that the population and gdp variables are heavily right-skewed. It also appears that (thankfully) life expectancies worldwide are on the rise over time. We see that life expectancy and gdp per capita are positively correlated showing that people in higher income countries tend to live longer (although this of course does not prove that one causes the other). The default pairs plot by itself often gives us valuable insights. For example, the left-most plot in the second row shows the scatter plot of life_exp versus year. The histogram on the diagonal allows us to see the distribution of a single variable while the scatter plots on the upper and lower triangles show the relationship (or lack thereof) between two variables. I’m still amazed that one simple line of code gives us this entire plot! The pairs plot builds on two basic figures, the histogram and the scatter plot. We can load in the socioeconomic data as a pandas dataframe and look at the columns: To get started we need to know what data we have. We will explore a real-world dataset, comprised of country-level socioeconomic data collected by GapMinder. The code for this project is available as a Jupyter Notebook on GitHub.
We will see how to create a default pairs plot for a rapid examination of our data and how to customize the visualization for deeper insights. In this article we will walk through getting up and running with pairs plots in Python using the seaborn visualization library.
Pair plots are a great method to identify trends for follow-up analysis and, fortunately, are easily implemented in Python! A pairs plot allows us to see both distribution of single variables and relationships between two variables. While there are an almost overwhelming number of methods to use in EDA, one of the most effective starting tools is the pairs plot (also called a scatterplot matrix). EDA is the process of figuring out what the data can tell us and we use EDA to find patterns, relationships, or anomalies to inform our subsequent analysis. Once you’ve got yourself a nice cleaned dataset, the next step is Exploratory Data Analysis (EDA). How to quickly create a powerful exploratory data analysis visualization Visualizing Data with Pairs Plots in Python