What is used to plot graphs in python?
Chapter 4. Visualization with MatplotlibWe’ll now take an in-depth look at the Matplotlib tool for visualization in Python. Matplotlib is a multiplatform data visualization library built on NumPy arrays, and designed to work with the broader SciPy stack. It was conceived by John Hunter in 2002, originally as a patch to IPython for enabling interactive MATLAB-style plotting via gnuplot from the IPython command line. IPython’s creator, Fernando Perez, was at the time scrambling to finish his PhD, and let John know he wouldn’t have time to review the patch for several months. John took this as a cue to set out on his own, and the Matplotlib package was born, with version 0.1 released in 2003. It received an early boost when it was adopted as the plotting package of choice of the Space Telescope Science Institute (the folks behind the Hubble Telescope), which financially supported Matplotlib’s development and greatly expanded its capabilities. Show
One of Matplotlib’s most important features is its ability to play well with many operating systems and graphics backends. Matplotlib supports dozens of backends and output types, which means you can count on it to work regardless of which operating system you are using or which output format you wish. This cross-platform, everything-to-everyone approach has been one of the great strengths of Matplotlib. It has led to a large userbase, which in turn has led to an active developer base and Matplotlib’s powerful tools and ubiquity within the scientific Python world. In recent years, however, the interface and style of Matplotlib have begun to show their age. Newer tools like ggplot and ggvis in the R language, along with web visualization toolkits based on D3js and HTML5 canvas, often make Matplotlib feel clunky and old-fashioned. Still, I’m of the opinion that we cannot ignore Matplotlib’s strength as a well-tested, cross-platform graphics engine. Recent Matplotlib versions make it relatively easy to set new global plotting styles (see “Customizing Matplotlib: Configurations and Stylesheets”), and people have been developing new packages that build on its powerful internals to drive Matplotlib via cleaner, more modern APIs—for example, Seaborn (discussed in “Visualization with Seaborn”), ggplot, HoloViews, Altair, and even Pandas itself can be used as wrappers around Matplotlib’s API. Even with wrappers like these, it is still often useful to dive into Matplotlib’s syntax to adjust the final plot output. For this reason, I believe that Matplotlib itself will remain a vital piece of the data visualization stack, even if new tools mean the community gradually moves away from using the Matplotlib API directly. General Matplotlib TipsBefore we dive into the details of creating visualizations with Matplotlib, there are a few useful things you should know about using the package. Importing matplotlibJust as we use the
The Setting StylesWe will use the
Throughout this section, we will adjust this style as needed. Note that the stylesheets used here are supported as of Matplotlib version 1.5; if you are using an earlier version of Matplotlib, only the default style is available. For more information on stylesheets, see “Customizing Matplotlib: Configurations and Stylesheets”. show() or No show()? How to Display Your PlotsA visualization you can’t see won’t be of much use, but just how you view your Matplotlib plots depends on the context. The best use of Matplotlib differs depending on how you are using it; roughly, the three applicable contexts are using Matplotlib in a script, in an IPython terminal, or in an IPython notebook. Plotting from a scriptIf you are using Matplotlib from within a script, the function So, for example, you may have a file called myplot.py containing the following:
You can then run this script from the command-line prompt, which will result in a window opening with your figure displayed: $ python myplot.py The One thing to be aware of: the Plotting from an IPython shellIt can be very convenient to use Matplotlib interactively within an
IPython shell (see Chapter 1). IPython is built to work well with Matplotlib if you specify Matplotlib mode. To enable this mode, you can use the
At this point, any Plotting from an IPython notebookThe IPython notebook is a browser-based interactive data analysis tool that can combine narrative, code, graphics, HTML elements, and much more into a single executable document (see Chapter 1). Plotting interactively within an IPython notebook can be done with the
For this book, we will generally opt for
After you run this command (it needs to be done only once per kernel/session), any cell within the notebook that creates a plot will embed a PNG image of the resulting graphic (Figure 4-1): Figure 4-1. Basic plotting example Saving Figures to FileOne nice feature of Matplotlib
is the ability to save figures in a wide variety of formats. You can save a figure using the
We now have a file called my_figure.png in the current working directory:
-rw-r--r-- 1 jakevdp staff 16K Aug 11 10:59 my_figure.png To confirm that it contains what we think it contains, let’s use the IPython Figure 4-2. PNG rendering of the basic plot In
Out[8]: {'eps': 'Encapsulated Postscript', 'jpeg': 'Joint Photographic Experts Group', 'jpg': 'Joint Photographic Experts Group', 'pdf': 'Portable Document Format', 'pgf': 'PGF code for LaTeX', 'png': 'Portable Network Graphics', 'ps': 'Postscript', 'raw': 'Raw RGBA bitmap', 'rgba': 'Raw RGBA bitmap', 'svg': 'Scalable Vector Graphics', 'svgz': 'Scalable Vector Graphics', 'tif': 'Tagged Image File Format', 'tiff': 'Tagged Image File Format'} Note that when saving your figure, it’s not necessary to use Two Interfaces for the Price of OneA potentially confusing feature of Matplotlib is its dual interfaces: a convenient MATLAB-style state-based interface, and a more powerful object-oriented interface. We’ll quickly highlight the differences between the two here. MATLAB-style interfaceMatplotlib was originally written as a Python alternative for MATLAB users, and much of its syntax reflects that fact. The MATLAB-style tools are contained in the
pyplot ( Figure 4-3. Subplots using the MATLAB-style interface It’s important to note that this interface is stateful: it keeps track of the “current” figure and axes, which are where all While this stateful interface is fast and convenient for simple plots, it is easy to run into problems. For example, once the second panel is created, how can we go back and add something to the first? This is possible within the MATLAB-style interface, but a bit clunky. Fortunately, there is a better way. Object-oriented interfaceThe object-oriented interface is available for these more complicated situations, and for when you want more control over your figure. Rather than depending on some notion of an “active” figure or axes, in the object-oriented interface the plotting functions are methods of
explicit Figure 4-4. Subplots using the object-oriented interface For more simple plots, the choice of which style to use is largely a matter of preference, but the object-oriented approach can become a necessity as plots become more complicated. Throughout this chapter, we will switch between the MATLAB-style and object-oriented interfaces, depending on what is most convenient. In most cases, the difference is as small as switching Simple Line PlotsPerhaps the simplest of all plots is the visualization of a single function y=f(x) . Here we will take a first look at creating a simple plot of this type. As with all the following sections, we’ll start by setting up the notebook for plotting and importing the functions we will use:
For all Matplotlib plots, we start by creating a figure and an axes. In their simplest form, a figure and axes can be created as follows (Figure 4-5): Figure 4-5. An empty gridded axes In Matplotlib, the figure (an instance of the class Once we have created an axes, we
can use the Figure 4-6. A simple sinusoid Alternatively, we can use the pylab interface and let the figure and axes be created for us in the background (Figure 4-7; see “Two Interfaces for the Price of One” for a discussion of these two interfaces): Figure 4-7. A simple sinusoid via the object-oriented interface If we want to create a single figure with multiple lines, we can simply call the Figure 4-8. Over-plotting multiple lines That’s all there is to plotting simple functions in Matplotlib! We’ll now dive into some more details about how to control the appearance of the axes and lines. Adjusting the Plot: Line Colors and StylesThe first adjustment you might wish to make to a plot is to control the line colors and styles.
The Figure 4-9. Controlling the color of plot elements If no color is specified, Matplotlib will automatically cycle through a set of default colors for multiple lines. Similarly, you can adjust the line style using the Figure 4-10. Example of various line styles If you would like to be extremely terse, these Figure 4-11. Controlling colors and styles with the shorthand syntax These single-character color codes reflect the standard abbreviations in the RGB (Red/Green/Blue) and CMYK (Cyan/Magenta/Yellow/blacK) color systems, commonly used for digital color graphics. There are many other keyword arguments that can be used
to fine-tune the appearance of the plot; for more details, I’d suggest viewing the docstring of the Adjusting the Plot: Axes LimitsMatplotlib does a decent job of choosing default axes limits for your plot, but sometimes it’s nice to have finer control. The most basic way to adjust axis limits is to use the Figure 4-12. Example of setting axis limits If for some reason you’d like either axis to be displayed in reverse, you can simply reverse the order of the arguments (Figure 4-13): Figure 4-13. Example of reversing the y-axis A useful related method is Figure 4-14. Setting the axis limits with plt.axis The Figure 4-15. Example of a “tight” layout It allows even higher-level specifications, such as ensuring an equal aspect
ratio so that on your screen, one unit in Figure 4-16. Example of an “equal” layout, with units matched to the output resolution For more information on axis limits and the other capabilities of the Labeling PlotsAs the last piece of this section, we’ll briefly look at the labeling of plots: titles, axis labels, and simple legends. Titles and axis labels are the simplest such labels—there are methods that can be used to quickly set them (Figure 4-17): Figure 4-17. Examples of axis labels and title You can adjust the position, size, and style of these labels using optional arguments to the function. For more information, see the Matplotlib documentation and the docstrings of each of these functions. When multiple lines are being shown within a single axes, it can be useful to create a plot legend that labels each line type. Again, Matplotlib has a built-in way of quickly creating such a legend. It is done via the (you guessed it) Figure 4-18. Plot legend example As you can see, the Simple Scatter PlotsAnother commonly used plot type is the simple scatter plot, a close cousin of the line plot. Instead of points being joined by line segments, here the points are represented individually with a dot, circle, or other shape. We’ll start by setting up the notebook for plotting and importing the functions we will use:
Scatter Plots with plt.plotIn the previous section, we looked at Figure 4-20. Scatter plot example The third argument in the function call is a character that represents the type of symbol used for the plotting. Just as you
can specify options such as Figure 4-21. Demonstration of point numbers For even more possibilities, these character codes can be used together with line and color codes to plot points along with a line connecting them (Figure 4-22): Figure 4-22. Combining line and point markers Additional
keyword arguments to Figure 4-23. Customizing line and point numbers This type of flexibility in the Scatter Plots with plt.scatterA second, more powerful
method of creating scatter plots is the Figure 4-24. A simple scatter plot The primary difference of Let’s show this by creating a random scatter plot with points of many colors and sizes. In order to better see the overlapping results, we’ll also use the Figure 4-25. Changing size, color, and transparency in scatter points Notice that the color argument is automatically
mapped to a color scale (shown here by the For example, we might use the Iris data from Scikit-Learn, where each sample is one of three types of flowers that has had the size of its petals and sepals carefully measured (Figure 4-26): Figure 4-26. Using point properties to encode features of the Iris data We can see that this scatter plot has given us the ability to simultaneously explore four different dimensions of the data: the (x, y) location of each point corresponds to the sepal length and width, the size of the point is related to the petal width, and the color is related to the particular species of flower. Multicolor and multifeature scatter plots like this can be useful for both exploration and presentation of data. plot Versus scatter: A Note on EfficiencyAside from the different features available in Visualizing ErrorsFor any scientific measurement, accurate accounting for errors is nearly as important, if not more important, than accurate reporting of the number itself. For example, imagine that I am using some astrophysical observations to estimate the Hubble Constant, the local measurement of the expansion rate of the universe. I know that the current literature suggests a value of around 71 (km/s)/Mpc, and I measure a value of 74 (km/s)/Mpc with my method. Are the values consistent? The only correct answer, given this information, is this: there is no way to know. Suppose I augment this information with reported uncertainties: the current literature suggests a value of around 71 ± 2.5 (km/s)/Mpc, and my method has measured a value of 74 ± 5 (km/s)/Mpc. Now are the values consistent? That is a question that can be quantitatively answered. In visualization of data and results, showing these errors effectively can make a plot convey much more complete information. Basic ErrorbarsA basic errorbar can be created with a single Matplotlib function call (Figure 4-27):
Figure 4-27. An errorbar example Here the In addition to these basic options, the Figure 4-28. Customizing errorbars In addition to these options, you can also specify horizontal errorbars ( Continuous ErrorsIn some situations it is desirable to show errorbars on continuous quantities. Though Matplotlib does not have a built-in convenience routine for this type of application, it’s relatively easy to combine primitives like Here we’ll perform a simple Gaussian process regression (GPR), using the Scikit-Learn API (see “Introducing Scikit-Learn” for details). This is a method of fitting a very flexible nonparametric function to data with a continuous measure of the uncertainty. We won’t delve into the details of Gaussian process regression at this point, but will focus instead on how you might visualize such a continuous error measurement:
We now have Figure 4-29. Representing continuous uncertainty with filled regions Note what we’ve done here with the The resulting figure gives a very intuitive view into what the Gaussian process regression algorithm is doing: in regions near a measured data point, the model is strongly constrained and this is reflected in the small model errors. In regions far from a measured data point, the model is not strongly constrained, and the model errors increase. For more information on the options available in Finally, if this seems a bit too low level for your taste, refer to “Visualization with Seaborn”, where we discuss the Seaborn package, which has a more streamlined API for visualizing this type of continuous errorbar. Density and Contour PlotsSometimes it is useful to display three-dimensional data in two dimensions using contours or color-coded regions. There are three Matplotlib functions that can be helpful for this task:
Visualizing a Three-Dimensional FunctionWe’ll start by demonstrating a contour plot using a function z=f(x,y), using the following particular choice for f (we’ve seen this before in “Computation on Arrays: Broadcasting”, when we used it as a motivating example for array broadcasting):
A contour plot can be created with the
Now let’s look at this with a standard line-only contour plot (Figure 4-30): Figure 4-30. Visualizing three-dimensional data with contours Notice that by default when a single color is used, negative
values are represented by dashed lines, and positive values by solid lines. Alternatively, you can color-code the lines by specifying a colormap with the Figure 4-31. Visualizing three-dimensional data with colored contours Here we chose the plt.cm. Our plot is looking nicer, but the spaces between the lines may be a bit distracting. We can change this by switching to a filled contour plot using the Additionally, we’ll add a Figure 4-32. Visualizing three-dimensional data with filled contours The colorbar makes it clear that the black regions are “peaks,” while the red regions are “valleys.” One potential issue with this plot is that it is a bit “splotchy.” That is, the color steps are discrete rather than continuous, which is not always what is desired. You could remedy this by setting the number of contours to a very high number, but this results in a rather inefficient plot: Matplotlib must render a new polygon for each step in the level.
A better way to handle this is to use the Figure 4-33 shows the result of the following code:
There are a few potential gotchas with
Finally, it can sometimes be useful to combine contour plots and image plots. For example, to create the effect shown in
Figure 4-34, we’ll use a partially transparent background image (with transparency set via the Figure 4-34. Labeled contours on top of an image The combination of these three functions— Histograms, Binnings, and DensityA simple histogram can be a great first step in understanding a dataset. Earlier, we saw a preview of Matplotlib’s histogram function (see “Comparisons, Masks, and Boolean Logic”), which creates a basic histogram in one line, once the normal boilerplate imports are done (Figure 4-35):
Figure 4-35. A simple histogram The Figure 4-36. A customized histogram The Figure 4-37. Over-plotting multiple histograms If you would like to simply compute the histogram (that is, count the
number of points in a given bin) and not display it, the
[ 12 190 468 301 29] Two-Dimensional Histograms and BinningsJust as we create histograms in one dimension by dividing the number line into bins, we can also create histograms in two dimensions by dividing points among two-dimensional bins. We’ll take a brief look at several ways to do this here. We’ll start by defining some data—an
plt.hist2d: Two-dimensional histogramOne straightforward way to plot a two-dimensional histogram is to use Matplotlib’s Figure 4-38. A two-dimensional histogram with plt.hist2d Just as with
For the generalization of this histogram binning in dimensions higher than two, see the plt.hexbin: Hexagonal binningsThe two-dimensional histogram creates a tessellation of squares across the axes. Another natural shape for such a tessellation is the regular hexagon. For this purpose, Matplotlib provides the Figure 4-39. A two-dimensional histogram with plt.hexbin
Kernel density estimationAnother common method of evaluating densities in multiple dimensions is kernel density estimation (KDE). This will be discussed more fully in
“In-Depth: Kernel Density Estimation”, but for now we’ll simply mention that KDE can be thought of as a way to “smear out” the points in space and add up the result to obtain a smooth function. One extremely quick and simple KDE implementation exists in the Figure 4-40. A kernel density representation of a distribution KDE has a smoothing length that effectively slides the knob
between detail and smoothness (one example of the ubiquitous bias–variance trade-off). The literature on choosing an appropriate smoothing length is vast: Other KDE implementations are available within the SciPy ecosystem, each with its own various strengths and weaknesses; see, for example, Customizing ColorbarsPlot legends identify discrete labels of discrete points. For continuous labels based on the color of points, lines, or regions, a labeled colorbar can be a great tool. In Matplotlib, a colorbar is a separate axes that can provide a key for the meaning of colors in a plot. Because the book is printed in black and white, this section has an accompanying online appendix where you can view the figures in full color (https://github.com/jakevdp/PythonDataScienceHandbook). We’ll start by setting up the notebook for plotting and importing the functions we will use:
As we have seen several times throughout this section, the simplest colorbar can be created with the Figure 4-49. A simple colorbar legend We’ll now discuss a few ideas for customizing these colorbars and using them effectively in various situations. Customizing ColorbarsWe can specify the colormap using the Figure 4-50. A grayscale colormap All the available colormaps are in
the plt.cm. But being able to choose a colormap is just the first step: more important is how to decide among the possibilities! The choice turns out to be much more subtle than you might initially expect. Choosing the colormapA full treatment of color choice within visualization is beyond the scope of this book, but for entertaining reading on this subject and others, see the article “Ten Simple Rules for Better Figures”. Matplotlib’s online documentation also has an interesting discussion of colormap choice. Broadly, you should be aware of three different categories of colormaps: Sequential colormapsThese consist of one continuous sequence of colors (e.g., These usually contain two distinct colors, which show positive and negative deviations from a mean (e.g., These mix colors with no particular sequence (e.g., The We can see this by
converting the
Figure 4-51. The jet colormap and its uneven luminance scale Notice
the bright stripes in the grayscale image. Even in full color, this uneven brightness means that the eye will be drawn to certain portions of the color range, which will potentially emphasize unimportant parts of the dataset. It’s better to use a colormap such as Figure 4-52. The viridis colormap and its even luminance scale If you favor rainbow schemes, another good option for continuous data is the Figure 4-53. The cubehelix colormap and its luminance For other situations, such as showing positive and negative deviations from some mean, dual-color colorbars such as Figure 4-54. The RdBu (Red-Blue) colormap and its luminance We’ll see examples of using some of these color maps as we continue. There are a large number of colormaps available in Matplotlib; to see a list of them, you can use IPython to explore the Color limits and extensionsMatplotlib allows for a large range of colorbar customization. The colorbar itself is simply an instance of Figure 4-55. Specifying colormap extensions Notice that in the left panel, the default color limits respond to the noisy pixels, and the range of the noise completely washes out the pattern we are interested in. In the right panel, we manually set the color limits, and add extensions to indicate values that are above or below those limits. The result is a much more useful visualization of our data. Discrete colorbarsColormaps are by default continuous, but sometimes you’d like to represent discrete values. The easiest way to do this is to use the Figure 4-56. A discretized colormap The discrete version of a colormap can be used just like any other colormap. Example: Handwritten DigitsFor an example of where this might be useful, let’s look at an interesting visualization of some handwritten digits data. This data is included in Scikit-Learn, and consists of nearly 2,000 8×8 thumbnails showing various handwritten digits. For now, let’s start by downloading the digits data and visualizing several of the example images with Figure 4-57. Sample of handwritten digit data Because each digit is defined by the hue of its 64 pixels, we can consider each digit to be a point lying in 64-dimensional space: each dimension represents the brightness of one pixel. But visualizing relationships in such high-dimensional spaces can be extremely difficult. One way to approach this is to use a dimensionality reduction technique such as manifold learning to reduce the dimensionality of the data while maintaining the relationships of interest. Dimensionality reduction is an example of unsupervised machine learning, and we will discuss it in more detail in “What Is Machine Learning?”. Deferring the discussion of these details, let’s take a look at a two-dimensional manifold learning projection of this digits data (see “In-Depth: Manifold Learning” for details):
We’ll use our discrete colormap to view the results, setting the Figure 4-58. Manifold embedding of handwritten digit pixels The projection also gives us some interesting insights on the relationships within the dataset: for example, the ranges of 5 and 3 nearly overlap in this projection, indicating that some handwritten fives and threes are difficult to distinguish, and therefore more likely to be confused by an automated classification algorithm. Other values, like 0 and 1, are more distantly separated, and therefore much less likely to be confused. This observation agrees with our intuition, because 5 and 3 look much more similar than do 0 and 1. We’ll return to manifold learning and digit classification in Chapter 5. Multiple SubplotsSometimes it is helpful to compare different views of data side by side. To this end, Matplotlib has the concept of subplots: groups of smaller axes that can exist together within a single figure. These subplots might be insets, grids of plots, or other more complicated layouts. In this section, we’ll explore four routines for creating subplots in Matplotlib. We’ll start by setting up the notebook for plotting and importing the functions we will use:
plt.axes: Subplots by HandThe
most basic method of creating an axes is to use the For example, we might create an inset axes at the top-right corner of another axes by setting the x and y position to 0.65 (that is, starting at 65% of the width and 65% of the height of the figure) and the x and y extents to 0.2 (that is, the size of the axes is 20% of the width and 20% of the height of the figure). Figure 4-59 shows the result of this code: Figure 4-59. Example of an inset axes The equivalent of this command within the object-oriented interface is Figure 4-60. Vertically stacked axes example We now have two axes (the top with no tick labels) that are just touching: the bottom of the upper panel (at position 0.5) matches the top of the lower panel (at position 0.1 + 0.4). plt.subplot: Simple Grids of SubplotsAligned columns or rows of subplots are a common enough need that Matplotlib has several convenience routines that make them easy to
create. The lowest level of these is Figure 4-61. A plt.subplot() example The command Figure 4-62. plt.subplot() with adjusted margins We’ve used the plt.subplots: The Whole Grid in One GoThe approach just described can become quite tedious when
you’re creating a large grid of subplots, especially if you’d like to hide the x- and y-axis labels on the inner plots. For this purpose, Here we’ll create a 2×3 grid of subplots, where all axes in the same row share their y-axis scale, and all axes in the same column share their x-axis scale (Figure 4-63): Figure 4-63. Shared x and y axis in plt.subplots() Note that by specifying Figure 4-64. Identifying plots in a subplot grid In comparison to plt.GridSpec: More Complicated ArrangementsTo go beyond a regular grid to subplots that span multiple rows and columns,
From this we can specify subplot locations and extents using the familiar Python slicing syntax (Figure 4-65): Figure 4-65. Irregular subplots with plt.GridSpec This type of flexible grid alignment has a wide range of uses. I most often use it when creating multi-axes histogram plots like the one shown here (Figure 4-66): Figure 4-66. Visualizing multidimensional distributions with plt.GridSpec This type of distribution plotted alongside its margins is common enough that it has its own plotting API in the Seaborn package; see “Visualization with Seaborn” for more details. Text and AnnotationCreating a good visualization involves guiding the reader so that the figure tells a story. In some cases, this story can be told in an entirely visual manner, without the need for added text, but in others, small textual cues and labels are necessary. Perhaps the most basic types of annotations you will use are axes labels and titles, but the options go beyond this. Let’s take a look at some data and how we might visualize and annotate it to help convey interesting information. We’ll start by setting up the notebook for plotting and importing the functions we will use:
Example: Effect of Holidays on US BirthsLet’s return to some data we worked with earlier in “Example: Birthrate Data”, where we generated a plot of average births over the course of the calendar year; as already mentioned, this data can be downloaded at https://raw.githubusercontent.com/jakevdp/data-CDCbirths/master/births.csv. We’ll start with the same cleaning procedure we used there, and plot the results (Figure 4-67):
Figure 4-67. Average daily births by date When we’re communicating data like this, it is often useful to annotate certain features of the plot to draw the reader’s attention. This can be done manually with the Figure 4-68. Annotated average daily births by date The Transforms and Text PositionIn the previous example, we anchored our text annotations to data locations. Sometimes it’s preferable to anchor the text to a position on the axes or figure, independent of the data. In Matplotlib, we do this by modifying the transform. Any graphics display framework needs some scheme for translating between coordinate systems. For example, a data point at (x,y)=(1,1) needs to somehow be represented at a certain location on the figure, which in turn needs to be represented in pixels on the screen. Mathematically, such coordinate transformations are relatively straightforward, and
Matplotlib has a well-developed set of tools that it uses internally to perform them (the tools can be explored in the The average user rarely needs to worry about the details of these transforms, but it is helpful knowledge to have when considering the placement of text on a figure. There are three predefined transforms that can be useful in this situation: ax.transData Transform associated with data coordinates ax.transAxes Transform associated with the axes (in units of axes dimensions) fig.transFigure Transform associated with the figure (in units of figure dimensions) Here let’s look at an example of drawing text at various locations using these transforms (Figure 4-69): Figure 4-69. Comparing Matplotlib’s coordinate systems Note that by default, the text is aligned above and to the left of the specified coordinates; here the “.” at the beginning of each string will approximately mark the given coordinate location. The Notice now that if we change the axes limits, it is only the Figure 4-70. Comparing Matplotlib’s coordinate systems You can see this behavior more clearly by changing the axes limits
interactively; if you are executing this code in a notebook, you can make that happen by changing Arrows and AnnotationAlong with tick marks and text, another useful annotation mark is the simple arrow. Drawing arrows in Matplotlib is often much harder than you might hope. While there is a Here we’ll use Figure 4-71. Annotation examples The arrow style is controlled through the Figure 4-72. Annotated average birth rates by day You’ll notice that the specifications of the arrows and text boxes are very detailed: this gives you the power to create nearly any arrow style you wish. Unfortunately, it also means that these sorts of features often must be manually tweaked, a process that can be very time-consuming when one is producing publication-quality graphics! Finally, I’ll note that the preceding mix of styles is by no means best practice for presenting data, but rather included as a demonstration of some of the available options. More discussion and examples of available arrow and annotation styles can be found in the Matplotlib gallery, in particular http://matplotlib.org/examples/pylab_examples/annotation_demo2.html. Customizing TicksMatplotlib’s default tick locators and formatters are designed to be generally sufficient in many common situations, but are in no way optimal for every plot. This section will give several examples of adjusting the tick locations and formatting for the particular plot type you’re interested in. Before we go into examples, it will be best for us to understand further the object hierarchy of Matplotlib plots. Matplotlib
aims to have a Python object representing everything that appears on the plot: for example, recall that the The tick marks are no exception. Each Major and Minor TicksWithin each axis, there is the concept of a major tick mark and a minor tick mark. As the names would imply, major ticks are usually bigger or more pronounced, while minor ticks are usually smaller. By default, Matplotlib rarely makes use of minor ticks, but one place you can see them is within logarithmic plots (Figure 4-73):
Figure 4-73. Example of logarithmic scales and labels We see here that each major tick shows a large tick mark and a label, while each minor tick shows a smaller tick mark with no label. We can customize these tick properties—that is, locations and labels—by setting the
We see that both major and minor tick labels have their locations specified by a We’ll now show a few examples of setting these locators and formatters for various plots. Hiding Ticks or LabelsPerhaps the most common tick/label formatting operation is the act of hiding ticks or labels. We can do this using Figure 4-74. Plot with hidden tick labels (x-axis) and hidden ticks (y-axis) Notice that we’ve removed the labels (but kept the ticks/gridlines) from the x axis, and removed the ticks (and thus the labels as well) from the y axis. Having no ticks at all can be useful in many situations—for example, when you want to show a grid of images. For instance, consider Figure 4-75, which includes images of different faces, an example often used in supervised machine learning problems (for more information, see “In-Depth: Support Vector Machines”): Figure 4-75. Hiding ticks within image plots Notice that each image has its own axes, and we’ve set the locators to null because the tick values (pixel number in this case) do not convey relevant information for this particular visualization. Reducing or Increasing the Number of TicksOne common problem with the default settings is that smaller subplots can end up with crowded labels. We can see this in the plot grid shown in Figure 4-76: Figure 4-76. A default plot with crowded ticks Particularly for the x ticks, the numbers nearly overlap, making them quite
difficult to decipher. We can fix this with the Figure 4-77. Customizing the number of ticks This makes things much cleaner. If you want even more control over the locations of regularly spaced ticks, you might also use Fancy Tick FormatsMatplotlib’s default tick formatting can leave a lot to be desired; it works well as a broad default, but sometimes you’d like to do something more. Consider the plot shown in Figure 4-78, a sine and a cosine: Figure 4-78. A default plot with integer ticks There are a couple changes we might like to make. First,
it’s more natural for this data to space the ticks and grid lines in multiples of π. We can do this by setting a Figure 4-79. Ticks at multiples of pi/2 But now these tick labels look a little bit silly: we can see that they are
multiples of π, but the decimal representation does not immediately convey this. To fix this, we can change the tick formatter. There’s no built-in formatter for what we want to do, so we’ll instead use Figure 4-80. Ticks with custom labels This is much better! Notice that we’ve made use of Matplotlib’s LaTeX support,
specified by enclosing the string within dollar signs. This is very convenient for display of mathematical symbols and formulae; in this case, The Summary of Formatters and LocatorsWe’ve mentioned a couple of the available formatters and
locators. We’ll conclude this section by briefly listing all the built-in locator and formatter options. For more information on any of these, refer to the docstrings or to the Matplotlib online documentation. Each of the following is available in the
We’ll see additional examples of these throughout the remainder of the book. Customizing Matplotlib: Configurations and StylesheetsMatplotlib’s default plot settings are often the subject of complaint among its users. While much is slated to change in the 2.0 Matplotlib release, the ability to customize default settings helps bring the package in line with your own aesthetic preferences. Here we’ll walk through some of Matplotlib’s runtime configuration ( Plot Customization by HandThroughout this chapter, we’ve seen how it is possible to tweak individual plot settings to end up with something that looks a little bit nicer than the default. It’s possible to do these customizations for each individual plot. For example, here is a fairly drab default histogram (Figure 4-81):
Figure 4-81. A histogram in Matplotlib’s default style We can adjust this by hand to make it a much more visually pleasing plot, shown in Figure 4-82: Figure 4-82. A histogram with manual customizations This looks better, and you may recognize the look as inspired by the look of the R language’s Changing the Defaults: rcParamsEach time Matplotlib loads, it defines a runtime
configuration ( We’ll start by saving a copy of the current
Now we can use the
With these settings defined, we can now create a plot and see our settings in action (Figure 4-83): Figure 4-83. A customized histogram using rc settings Let’s see what simple line plots look like with these Figure 4-84. A line plot with customized styles I find this much more aesthetically pleasing than the default styling. If you disagree with my aesthetic sense, the good news is that you can adjust the StylesheetsThe version 1.4 release of Matplotlib in August 2014 added a
very convenient Even if you don’t create your own style, the stylesheets included by default are extremely useful. The available styles are listed in
Out[8]: ['fivethirtyeight', 'seaborn-pastel', 'seaborn-whitegrid', 'ggplot', 'grayscale'] The basic way to switch to a stylesheet is to call:
But keep in mind that this will change the style for the rest of the session! Alternatively, you can use the style context manager, which sets a style temporarily:
Let’s create a function that will make two basic types of plot:
We’ll use this to explore how these plots look using the various built-in styles. Default styleThe default style is what we’ve been seeing so far throughout the book; we’ll start with that. First, let’s reset our runtime configuration to the notebook default:
Now let’s see how it looks (Figure 4-85): Figure 4-85. Matplotlib’s default style FiveThirtyEight styleThe FiveThirtyEight style mimics the graphics found on the popular FiveThirtyEight website. As you can see in Figure 4-86, it is typified by bold colors, thick lines, and transparent axes. Figure 4-86. The FiveThirtyEight style ggplotThe Figure 4-87. The ggplot style Bayesian Methods for Hackers styleThere is a very nice short online book called Probabilistic Programming and Bayesian Methods for Hackers; it features
figures created with Matplotlib, and uses a nice set of Figure 4-88. The bmh style Dark backgroundFor figures used within presentations, it is often useful to have a dark rather than light background. The Figure 4-89. The dark_background style GrayscaleSometimes you might find yourself preparing figures for a print publication that does not accept color figures. For this, the Figure 4-90. The grayscale style Seaborn styleMatplotlib also has stylesheets inspired by the Seaborn library (discussed more fully in “Visualization with Seaborn”). As we will see, these styles are loaded automatically when Seaborn is imported into a notebook. I’ve found these settings to be very nice, and tend to use them as defaults in my own data exploration (see Figure 4-91): Figure 4-91. Seaborn’s plotting style With all of these built-in options for various plot styles, Matplotlib becomes much more useful for both interactive visualization and creation of figures for publication. Throughout this book, I will generally use one or more of these style conventions when creating plots. Three-Dimensional Plotting in MatplotlibMatplotlib was initially designed with only two-dimensional plotting in mind. Around the time of the 1.0 release, some three-dimensional plotting utilities were built on top of
Matplotlib’s two-dimensional display, and the result is a convenient (if somewhat limited) set of tools for three-dimensional data visualization. We enable three-dimensional plots by importing the
Once this submodule is imported, we can create a three-dimensional
axes by passing the keyword
Figure 4-92. An empty three-dimensional axes With this 3D axes enabled, we can now plot a variety of three-dimensional plot types. Three-dimensional plotting is one of the functionalities
that benefits immensely from viewing figures interactively rather than statically in the notebook; recall that to use interactive figures, you can use Three-Dimensional Points and LinesThe most basic
three-dimensional plot is a line or scatter plot created from sets of (x, y, z) triples. In analogy with the more common two-dimensional plots discussed earlier, we can create these using the Figure 4-93. Points and lines in three dimensions Notice that by default, the scatter points have their transparency adjusted to give a sense of depth on the page. While the three-dimensional effect is sometimes difficult to see within a static image, an interactive view can lead to some nice intuition about the layout of the points. Three-Dimensional Contour PlotsAnalogous to the contour plots we explored in
“Density and Contour Plots”,
Figure 4-94. A three-dimensional contour plot Sometimes the default viewing
angle is not optimal, in which case we can use the Figure 4-95. Adjusting the view angle for a three-dimensional plot Again, note that we can accomplish this type of rotation interactively by clicking and dragging when using one of Matplotlib’s interactive backends. Wireframes and Surface PlotsTwo other types of three-dimensional plots that work on gridded data are wireframes and surface plots. These take a grid of values and project it onto the specified three-dimensional surface, and can make the resulting three-dimensional forms quite easy to visualize. Here’s an example using a wireframe (Figure 4-96): Figure 4-96. A wireframe plot A surface plot is like a wireframe plot, but each face of the wireframe is a filled polygon. Adding a colormap to the filled polygons can aid perception of the topology of the surface being visualized (Figure 4-97): Figure 4-97. A three-dimensional surface plot Note that though the grid of values for a surface plot needs to be two-dimensional, it need not be rectilinear. Here is an example of creating a partial polar grid, which when used with the Figure 4-98. A polar surface plot Surface TriangulationsFor some applications, the evenly sampled grids required by the preceding routines are overly restrictive and inconvenient. In these situations, the triangulation-based plots can be very useful. What if rather than an even draw from a Cartesian or a polar grid, we instead have a set of random draws?
We could create a scatter plot of the points to get an idea of the surface we’re sampling from (Figure 4-99): Figure 4-99. A three-dimensional sampled surface This leaves a lot to be desired. The function that will help us in this case is Figure 4-100. A triangulated surface plot The result is certainly not as clean as when it is plotted with a grid, but the flexibility of such a triangulation allows for some really interesting three-dimensional plots. For example, it is actually possible to plot a three-dimensional Möbius strip using this, as we’ll see next. Example: Visualizing a Möbius stripA Möbius strip is similar to a strip of paper glued into a loop with a half-twist. Topologically, it’s quite interesting because despite appearances it has only a single side! Here we will visualize such an object using Matplotlib’s three-dimensional tools. The key to creating the Möbius strip is to think about its parameterization: it’s a two-dimensional strip, so we need two intrinsic dimensions. Let’s call them θ, which ranges from 0 to 2π around the loop, and w which ranges from –1 to 1 across the width of the strip:
Now from this parameterization, we must determine the (x, y, z) positions of the embedded strip. Thinking about it, we might realize that there are two rotations happening: one is the position of the loop about its center (what we’ve called θ), while the other is the twisting of the strip about its axis (we’ll call this ϕ). For a Möbius strip, we must have the strip make half a twist during a full loop, or Δϕ=Δθ/2.
Now we use our recollection of trigonometry to derive the three-dimensional embedding. We’ll define r, the distance of each point from the center, and use this to find the embedded (x,y,z) coordinates:
Finally, to plot the object, we must make sure the triangulation is correct. The best way to do this is to define the triangulation within the underlying parameterization, and then let Matplotlib project this triangulation into the three-dimensional space of the Möbius strip. This can be accomplished as follows (Figure 4-101): Figure 4-101. Visualizing a Möbius strip Combining all of these techniques, it is possible to create and display a wide variety of three-dimensional objects and patterns in Matplotlib.
Geographic Data with BasemapOne common type of visualization in data science is that of geographic data. Matplotlib’s main tool for this type of visualization is the Basemap toolkit, which is one of several Matplotlib toolkits that live under the Installation of Basemap is straightforward; if you’re using conda you can type this and the package will be downloaded: $ conda install basemap We add just a single new import to our standard boilerplate:
Once you have the Basemap toolkit installed and imported, geographic plots are just a few lines away (the graphics in Figure 4-102 also require the Figure 4-102. A “bluemarble” projection of the Earth The meaning of the arguments to Basemap will be discussed momentarily. The useful thing is that the globe shown here is not a mere image; it is a fully functioning Matplotlib axes that understands spherical coordinates and allows us to easily over-plot data on the map! For example, we can use a different map projection, zoom in to North America, and plot the location of Seattle. We’ll use an etopo image (which shows topographical features both on land and under the ocean) as the map background (Figure 4-103): Figure 4-103. Plotting data and labels on the map This gives you a brief glimpse into the sort of geographic visualizations that are possible with just a few lines of Python. We’ll now discuss the features of Basemap in more depth, and provide several examples of visualizing map data. Using these brief examples as building blocks, you should be able to create nearly any map visualization that you desire. Map ProjectionsThe first thing to decide when you are using maps is which projection to use. You’re probably familiar with the fact that it is impossible to project a spherical map, such as that of the Earth, onto a flat surface without somehow distorting it or breaking its continuity. These projections have been developed over the course of human history, and there are a lot of choices! Depending on the intended use of the map projection, there are certain map features (e.g., direction, area, distance, shape, or other considerations) that are useful to maintain. The Basemap package implements several dozen such projections, all referenced by a short format code. Here we’ll briefly demonstrate some of the more common ones. We’ll start by defining a convenience routine to draw our world map along with the longitude and latitude lines:
Cylindrical projectionsThe simplest of map projections
are cylindrical projections, in which lines of constant latitude and longitude are mapped to horizontal and vertical lines, respectively. This type of mapping represents equatorial regions quite well, but results in extreme distortions near the poles. The spacing of latitude lines varies between different cylindrical projections, leading to different conservation properties, and different distortion near the poles.
In Figure 4-104, we show an example of the equidistant cylindrical projection, which chooses a latitude scaling that preserves distances along meridians. Other cylindrical projections are the Mercator ( Figure 4-104. Cylindrical equal-area projection The additional arguments to Basemap for this view specify the latitude ( Pseudo-cylindrical projectionsPseudo-cylindrical projections relax the requirement that meridians (lines of
constant longitude) remain vertical; this can give better properties near the poles of the projection. The Mollweide projection ( Figure 4-105. The Molleweide projection The extra arguments
to Perspective projectionsPerspective projections are constructed using a
particular choice of perspective point, similar to if you photographed the Earth from a particular point in space (a point which, for some projections, technically lies within the Earth!). One common example is the orthographic projection ( Here is an example of the orthographic projection (Figure 4-106): Figure 4-106. The orthographic projection Conic projectionsA conic projection projects the map onto a single cone, which is then unrolled. This can lead to very good local properties, but regions far from the focus point of the cone may become very distorted. One example of this is the
Lambert conformal conic projection ( Figure 4-107. The Albers equal-area projection Other projectionsIf you’re going to do much with map-based visualizations, I encourage you to read up on other available projections, along with their properties, advantages, and disadvantages. Most likely, they are available in the Basemap package. If you dig deep enough into this topic, you’ll find an incredible subculture of geo-viz geeks who will be ready to argue fervently in support of their favorite projection for any given application! Drawing a Map BackgroundEarlier we saw the
For
the boundary-based features, you must set the desired resolution when creating a Basemap image. The Here’s an example of drawing land/sea boundaries, and the effect of the resolution parameter. We’ll create both a low- and high-resolution map of Scotland’s beautiful Isle of Skye. It’s located at 57.3°N, 6.2°W, and a map of 90,000×120,000 kilometers shows it well (Figure 4-108): Figure 4-108. Map boundaries at low and high resolution Notice that the low-resolution coastlines are not suitable for this level of zoom, while high-resolution works just fine. The low level would work just fine for a global view, however, and would be much faster than loading the high-resolution border data for the entire globe! It might require some experimentation to find the correct resolution parameter for a given view; the best route is to start with a fast, low-resolution plot and increase the resolution as needed. Plotting Data on MapsPerhaps the most useful piece of the Basemap toolkit is the ability to over-plot a variety of data onto a map background. For simple plotting and text, any In addition to this, there are many map-specific functions available as methods of the Some of these map-specific methods are: contour() /contourf() Draw contour lines or filled contours imshow() Draw an image pcolor() /pcolormesh() Draw a pseudocolor plot for irregular/regular meshes plot() Draw lines and/or markers scatter() Draw points with markers quiver() Draw vectors barbs() Draw wind barbs drawgreatcircle() Draw a great circle We’ll see examples of a few of these as we continue. For more information on these functions, including several example plots, see the online Basemap documentation. Example: California CitiesRecall that in “Customizing Plot Legends”, we demonstrated the use of size and color in a scatter plot to convey information about the location, size, and population of California cities. Here, we’ll create this plot again, but using Basemap to put the data in context. We start with loading the data, as we did before:
Next, we set up the map projection, scatter the data, and then create a colorbar and legend (Figure 4-109): Figure 4-109. Scatter plot over a map background This shows us roughly where larger populations of people have settled in California: they are clustered near the coast in the Los Angeles and San Francisco areas, stretched along the highways in the flat central valley, and avoiding almost completely the mountainous regions along the borders of the state. Example: Surface Temperature DataAs an example of visualizing some more continuous geographic data, let’s consider the “polar vortex” that hit the eastern half of the United States in January 2014. A great source for any sort of climatic data is NASA’s Goddard Institute for Space Studies. Here we’ll use the GIS 250 temperature data, which we can download using shell commands (these commands may have to be modified on Windows machines). The data used here was downloaded on 6/12/2016, and the file size is approximately 9 MB:
The data comes in NetCDF format, which can be read in Python by the $ conda install netcdf4 We read the data as follows:
The file contains many global temperature readings on a variety of dates; we need to select the index of the date we’re interested in—in this case, January 15, 2014:
Now we can load the latitude and longitude data, as well as the temperature anomaly for this index:
Finally, we’ll use the
The data paints a picture of the localized, extreme temperature anomalies that happened during that month. The eastern half of the United States was much colder than normal, while the western half and Alaska were much warmer. Regions with no recorded temperature show the map background. Figure 4-110. The temperature anomaly in January 2014Visualization with SeabornMatplotlib has proven to be an incredibly useful and popular visualization tool, but even avid users will admit it often leaves much to be desired. There are several valid complaints about Matplotlib that often come up:
An answer to these problems is Seaborn. Seaborn provides an API on top of Matplotlib that offers sane choices for plot style and color defaults, defines simple high-level functions for common statistical plot types, and integrates with the
functionality provided by Pandas To be fair, the Matplotlib team is addressing this: it has recently added the Seaborn Versus MatplotlibHere is an example of a simple random-walk plot in Matplotlib, using its classic plot formatting and colors. We start with the typical imports:
Now we create some random walk data:
And do a simple plot (Figure 4-111): Figure 4-111. Data in Matplotlib’s default style Although the result contains all the information we’d like it to convey, it does so in a way that is not all that aesthetically pleasing, and even looks a bit old-fashioned in the context of 21st-century data visualization. Now let’s take a look at how it works with Seaborn. As we will see, Seaborn has many of its own high-level plotting routines, but it can also overwrite Matplotlib’s default parameters and in turn get even simple Matplotlib scripts to produce vastly superior output. We can set the style by calling Seaborn’s
Now let’s rerun the same two lines as before (Figure 4-112): Figure 4-112. Data in Seaborn’s default style Ah, much better! Exploring Seaborn PlotsThe main idea of Seaborn is that it provides high-level commands to create a variety of plot types useful for statistical data exploration, and even some statistical model fitting. Let’s take a look at a few of the datasets and plot types available in Seaborn. Note that all of the following could be done using raw Matplotlib commands (this is, in fact, what Seaborn does under the hood), but the Seaborn API is much more convenient. Histograms, KDE, and densitiesOften in statistical data visualization, all you want is to plot histograms and joint distributions of variables. We have seen that this is relatively straightforward in Matplotlib (Figure 4-113): Figure 4-113. Histograms for visualizing distributions Rather than a histogram, we can get a smooth estimate of the distribution using a kernel density estimation, which Seaborn does with Figure 4-114. Kernel density estimates for visualizing distributions Histograms and KDE can be combined using Figure 4-115. Kernel density and histograms plotted together If we pass the full two-dimensional dataset to Figure 4-116. A two-dimensional kernel density plot We can see the joint distribution and the marginal distributions together using Figure 4-117. A joint distribution plot with a two-dimensional kernel density estimate There are other
parameters that can be passed to Figure 4-118. A joint distribution plot with a hexagonal bin representation Pair plotsWhen you generalize joint plots to datasets of larger dimensions, you end up with pair plots. This is very useful for exploring correlations between multidimensional data, when you’d like to plot all pairs of values against each other. We’ll demo this with the well-known Iris dataset, which lists measurements of petals and sepals of three iris species:
Out[12]: sepal_length sepal_width petal_length petal_width species 0 5.1 3.5 1.4 0.2 setosa 1 4.9 3.0 1.4 0.2 setosa 2 4.7 3.2 1.3 0.2 setosa 3 4.6 3.1 1.5 0.2 setosa 4 5.0 3.6 1.4 0.2 setosa Visualizing the multidimensional relationships among the samples is as
easy as calling Figure 4-119. A pair plot showing the relationships between four variables Faceted histogramsSometimes the best way to view data is via histograms of subsets. Seaborn’s
Out[14]: total_bill tip sex smoker day time size 0 16.99 1.01 Female No Sun Dinner 2 1 10.34 1.66 Male No Sun Dinner 3 2 21.01 3.50 Male No Sun Dinner 3 3 23.68 3.31 Male No Sun Dinner 2 4 24.59 3.61 Female No Sun Dinner 4 Figure 4-120. An example of a faceted histogram Factor plotsFactor plots can be useful for this kind of visualization as well. This allows you to view the distribution of a parameter within bins defined by any other parameter (Figure 4-121): Figure 4-121. An example of a factor plot, comparing distributions given various discrete factors Joint distributionsSimilar to the pair plot we saw earlier, we can use Figure 4-122. A joint distribution plot The joint plot can even do some automatic kernel density estimation and regression (Figure 4-123): Figure 4-123. A joint distribution plot with a regression fit Bar plotsTime
series can be plotted with
Out[19]: method number orbital_period mass distance year 0 Radial Velocity 1 269.300 7.10 77.40 2006 1 Radial Velocity 1 874.774 2.21 56.95 2008 2 Radial Velocity 1 763.000 2.60 19.84 2011 3 Radial Velocity 1 326.030 19.40 110.62 2007 4 Radial Velocity 1 516.220 10.50 119.47 2009 Figure 4-124. A histogram as a special case of a factor plot We can learn more by looking at the method of discovery of each of these planets, as illustrated in Figure 4-125: Figure 4-125. Number of planets discovered by year and type (see the online appendix for a full-scale figure) For more information on plotting with Seaborn, see the Seaborn documentation, a tutorial, and the Seaborn gallery. Example: Exploring Marathon Finishing TimesHere we’ll look at using Seaborn to help visualize and understand finishing results from a marathon. I’ve scraped the data from sources on the Web, aggregated it and removed any identifying information, and put it on GitHub where it can be downloaded (if you are interested in using Python for web scraping, I would recommend Web Scraping with Python by Ryan Mitchell). We will start by downloading the data from the Web, and loading it into Pandas:
Out[23]: age gender split final 0 33 M 01:05:38 02:08:51 1 32 M 01:06:26 02:09:28 2 31 M 01:06:49 02:10:42 3 38 M 01:06:16 02:13:45 4 31 M 01:06:32 02:13:59 By default, Pandas loaded the time columns as Python
strings (type
Out[24]: age int64 gender object split object final object dtype: object Let’s fix this by providing a converter for the times:
Out[25]: age gender split final 0 33 M 01:05:38 02:08:51 1 32 M 01:06:26 02:09:28 2 31 M 01:06:49 02:10:42 3 38 M 01:06:16 02:13:45 4 31 M 01:06:32 02:13:59
Out[26]: age int64 gender object split timedelta64[ns] final timedelta64[ns] dtype: object That looks much better. For the purpose of our Seaborn plotting utilities, let’s next add columns that give the times in seconds:
Out[27]: age gender split final split_sec final_sec 0 33 M 01:05:38 02:08:51 3938.0 7731.0 1 32 M 01:06:26 02:09:28 3986.0 7768.0 2 31 M 01:06:49 02:10:42 4009.0 7842.0 3 38 M 01:06:16 02:13:45 3976.0 8025.0 4 31 M 01:06:32 02:13:59 3992.0 8039.0 To get an idea of what the data looks like, we can plot a Figure 4-126. The relationship between the split for the first half-marathon and the finishing time for the full marathon The dotted line shows where someone’s time would lie if they ran the marathon at a perfectly steady pace. The fact that the distribution lies above this indicates (as you might expect) that most people slow down over the course of the marathon. If you have run competitively, you’ll know that those who do the opposite—run faster during the second half of the race—are said to have “negative-split” the race. Let’s create another column in the data, the split fraction, which measures the degree to which each runner negative-splits or positive-splits the race:
Out[29]: age gender split final split_sec final_sec split_frac 0 33 M 01:05:38 02:08:51 3938.0 7731.0 -0.018756 1 32 M 01:06:26 02:09:28 3986.0 7768.0 -0.026262 2 31 M 01:06:49 02:10:42 4009.0 7842.0 -0.022443 3 38 M 01:06:16 02:13:45 3976.0 8025.0 0.009097 4 31 M 01:06:32 02:13:59 3992.0 8039.0 0.006842 Where this split difference is less than zero, the person negative-split the race by that fraction. Let’s do a distribution plot of this split fraction (Figure 4-127): Figure 4-127. The distribution of split fractions; 0.0 indicates a runner who completed the first and second halves in identical times
Out[31]: 251 Out of nearly 40,000 participants, there were only 250 people who negative-split their marathon. Let’s see whether there is any correlation between this split
fraction and other variables. We’ll do this using a Figure 4-128. The relationship between quantities within the marathon dataset It looks like the split fraction does not correlate particularly with age, but does correlate with the final time: faster runners tend to have closer to even splits on their marathon time. (We see here that Seaborn is no panacea for Matplotlib’s ills when it comes to plot styles: in particular, the x-axis labels overlap. Because the output is a simple Matplotlib plot, however, the methods in “Customizing Ticks” can be used to adjust such things if desired.) The difference between men and women here is interesting. Let’s look at the histogram of split fractions for these two groups (Figure 4-129): Figure 4-129. The distribution of split fractions by gender The interesting thing here is that there are many more men than women who are running close to an even split! This almost looks like some kind of bimodal distribution among the men and women. Let’s see if we can suss out what’s going on by looking at the distributions as a function of age. A nice way to compare distributions is to use a violin plot (Figure 4-130): Figure 4-130. A violin plot showing the split fraction by gender This is yet another way to compare the distributions between men and women. Let’s look a little deeper, and compare these violin plots as a function of age. We’ll start by creating a new column in the array that specifies the decade of age that each person is in (Figure 4-131):
Out[35]: age gender split final split_sec final_sec split_frac age_dec 0 33 M 01:05:38 02:08:51 3938.0 7731.0 -0.018756 30 1 32 M 01:06:26 02:09:28 3986.0 7768.0 -0.026262 30 2 31 M 01:06:49 02:10:42 4009.0 7842.0 -0.022443 30 3 38 M 01:06:16 02:13:45 3976.0 8025.0 0.009097 30 4 31 M 01:06:32 02:13:59 3992.0 8039.0 0.006842 30 Figure 4-131. A violin plot showing the split fraction by gender and age Looking at this, we can see where the distributions of men and women differ: the split distributions of men in their 20s to 50s show a pronounced over-density toward lower splits when compared to women of the same age (or of any age, for that matter). Also surprisingly, the 80-year-old women seem to outperform everyone in terms of their split time. This is probably due to the fact that we’re estimating the distribution from small numbers, as there are only a handful of runners in that range:
Out[38]: 7 Back to the men with negative splits: who are these runners? Does this split fraction correlate with finishing quickly? We can plot this very easily. We’ll use Figure 4-132. Split fraction versus finishing time by gender Apparently the people with fast splits are the elite runners who are finishing within ~15,000 seconds, or about 4 hours. People slower than that are much less likely to have a fast second split. Further ResourcesMatplotlib ResourcesA single chapter in a book can never hope to cover all the available features and plot types available in Matplotlib. As with other packages we’ve seen, liberal use of IPython’s tab-completion and help functions (see “Help and Documentation in IPython”) can be very helpful when you’re exploring Matplotlib’s API. In addition, Matplotlib’s online documentation can be a helpful reference. See in particular the Matplotlib gallery linked on that page: it shows thumbnails of hundreds of different plot types, each one linked to a page with the Python code snippet used to generate it. In this way, you can visually inspect and learn about a wide range of different plotting styles and visualization techniques. For a book-length treatment of Matplotlib, I would recommend Interactive Applications Using Matplotlib, written by Matplotlib core developer Ben Root. Other Python Graphics LibrariesAlthough Matplotlib is the most prominent Python visualization library, there are other more modern tools that are worth exploring as well. I’ll mention a few of them briefly here:
The visualization space in the Python community is very dynamic, and I fully expect this list to be out of date as soon as it is published. Keep an eye out for what’s coming in the future! How do you plot a graph in Python?Following steps were followed:. Define the x-axis and corresponding y-axis values as lists.. Plot them on canvas using . plot() function.. Give a name to x-axis and y-axis using . xlabel() and . ylabel() functions.. Give a title to your plot using . title() function.. Finally, to view your plot, we use . show() function.. Which package is used to plot the graphs in Python?The matplotlib provides the pyplot package which is used to plot the graph of given data. The matplotlib. pyplot is a set of command style functions that make matplotlib work like MATLAB.
What is the use of plot in Python?The plot() function is used to draw points (markers) in a diagram. By default, the plot() function draws a line from point to point. The function takes parameters for specifying points in the diagram.
How do you plot results in Python?Data can also be plotted by calling the matplotlib plot function directly.. The command is plt.plot(x, y). The color and format of markers can also be specified as an additional optional argument e.g., b- is a blue line, g-- is a green dashed line.. |