Playing with pandas in Jupyter

So, today I would like to talk with you about the fun time I had playing with pandas in Jupyter ūüôā

What a nice and misleading title ūüėÄ haha
Let me add some context.

TL;DR;

I decided to spend some time playing with data using python, just to have a feeling on how easy it is, giving that python is the language of choice of many data scientists.

“Why is this guy talking about python in the first place? Isn’t this an Azure/.NET Blog?”

Mainly, yes, but Python has a special place in my heart <3 and it is my second language, I could say. So, whenever I am not learning Azure/.NET I am most likely learning python ūüôā

What did I do?

I found a Data Analysis tool called pandas and a web application that allows you to visualize this data while you play with code called Jupyter Notebooks.

Let me make it clear that I am NOT an expert in any of the tools that I am going to list below and I was learning most of what I used while creating this post! So, if you see something terribly wrong, go easy on me and enlighten me, please! I would love to learn more from other people about this whole Data Science world.

The tools that I choose to do this are:

pandas

an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.

Jupyter Notebooks

an open-source web application that allows you to create and share documents that contain live code, equations, visualizations and narrative text. Uses include: data cleaning and transformation, numerical simulation, statistical modeling, data visualization, machine learning, and much more.

matplotlib

a Python 2D plotting library which produces publication quality figures in a variety of hardcopy formats and interactive environments across platforms.

data.gov

The home of the U.S. Government’s open data

I was always told that this is an awesome place to get some nice datasets with data that you can use to generate visualizations, and now I can confirm this.

Before we begin

All the code that I wrote for this post can be found on my GitHub, where I also plan to add more code to this repository, as I am still learning new things.

Also, I created a Twitter account for the blog, just to separate it from my personal account. You can find it here: @AzureCoderBlog and my personal account is: @lucas_lra

Let’s begin.

Setting up the Environment

First thing we need to do is to set up our environment with all the tools.

If you are on window, you can use the CreateEnvironment.bat¬†script that is available as part of the source code. This script will create the entire environment for you. But if you don’t want to miss the fun, just follow the step-by-step.

    1. Install Python 3
      • If you don’t know anything about python, just download the installer from this page.
      • You are going to LOVE it.
    2. Clone the GitHub repository
    3. Navigate to the project folder
    4. Create a Python Virtual Environment
    5. Activate your Virtual Environment
    6. Install the required packages (this step may take a while, and needs internet connection)
    7. Finally, start Jupyter Notebooks!

You should now see a screen like this:

Jupyter Notebook
Click on the Image to enlarge

As you can see, this is a file explorer that shows everything on the current folder that you are running, and what we want to do now it open the notebook: World-Population-by-Continent-[1980-2010].ipynb

What should be seeing now is some kind of in-browser text editor filled with text and python code:

Population Notebook
Click on the Image to enlarge

I won’t go into the specifics about how to better navigate on a Jupyter Notebook in this post, but you can learn everything you are going to need into this documentation.

To execute each block of our notebook, we are going to use the shortcut SHIFT+ENTER. This shortcut will execute the current block and jump to the next.

While I tried to make the notebook as self-explanatory as possible, I would like to go over the blocks of code and try to explain what is happening.

We start importing all the packages that we are going to need in the execution of our script.

As mentioned before, pandas is what we are going to use for the Data analysis, matplotlib is responsible for the graph generation and itertools is a default python package used to do lots of awesome stuff with iterable types.

Next we are going to import our dataset.

Really simple, isn’t it?¬†pandas contains lots of those methods to import many different data types, like pd.read_excel()¬†or pd.read_json(). This csv file, as I mentioned before, was obtained on the data.gov website.

The next step is to try to make the data a little better. I started by naming the column with the names of the places.

This was tricky for me on a first sight but, what is happening here is that I am copying the titles of all the columns on the dataset to a separate list object, after that I am renaming the first item of this list and then, finally, I am applying this entire list as a new set of column names for the pandas.DataFrame. Looks weird, but works like a charm.

Next problem we need to address is that the population data on the DataFrame is recognized by the script as str! We need to recognize this data as numeric types if we want to do some operations with it, let’s do this.

So here we are basically iterating through the DataFrame and using the pandas.to_numeric()¬†function to convert the values. Also, we are using the errors='coerse' option to make sure we ignore the ‘NaN’ values.

Great! Now we have all the data into the DataFrame prepared. So I started thinking, what if I wanted to do some Data Analysis based on the type of place? (Like, is it a Country? A Continent?) and I realized that I would need to add one extra piece of data to the DataFrame, and this is how I did it.

I decide that I just wanted to¬†tag the Continents on the DataFrame, so any other register will be tagged with a simple¬† (dash), which we will ignore later. To be quite honest, I don’t like this approach but, so far, I don’t know any other.

Next! Let’s effectively filter only the continents data out of the dataset.

Ok! Now let’s pause and have a look at the state of our DataFrame:

Splitted DataFrame
Click on the Image to enlarge

Looking good, isn’t it? We only have the five rows for the continents, we have our Region Type column correctly filled and the columns are all there, now what? First let’s setup two small lists of markers and colors that we are going to use in our Graph.

Those are all codes for markers and colors that matplotlib can understand. You can find more documentation about this here. Also, we are using itertools.cycle() to generate this list. Why? Well, the reason is that this object type allows us to iterate through it a random number of times, and it will always go back to the first item of the list, after reaching the last one, that will allow us to have any number of data entries on our DataFrame and still have enough markers and colors.

And with that, our preparations are done. Let’s start setting up our graph by configuring a Figure()

Here we are configuring our font-size in a global way for matplotlib, that will allow us to use relative sizes later. We are also creating the Figure() which will be the canvas for our plotting, and the actual subplot, which will contain our visualization.

Now, let’s effectively plot our Graph to the Figure

I’ll try to explain everything on this not so awesome looking code.

  • Lines 2-3
    • Here we are just converting our data to lists for easier handling
  • Lines 6;33 (Shame on me)
    • I couldn’t find a nice way of fitting all the¬†y axis on the graph width so, my trick¬†was to add two fake and empty axis, just to readjust the graph width and make it better.
    • I am REALLY sorry for this one, it doesn’t look good AT ALL. I’ll find a best solution next time ūüėõ
  • Line 7
    • Our for¬†will iterate through all the columns that represent a year on the DataFrame
  • Lines 8-14
    • This is where we are adding our data ticks to the graph. I should say that this is the most important part of the process.
    • The first two parameters of the execution are the ones that define our data tick, the rest is configuration of which you can learn more here.
  • Lines 16-33
    • This is where we set some annotations (in this case, text) to our data ticks, ensuring that we can really understand the plotted data
  • Lines 36-37
    • Here we enable the Grid and the legend (upper right corner) for our Graph
  • Lines 40-41
    • Here we add some style to the labels around the Graph, even rotating than to 30 degrees.

And this is it! This is the entire code, and where is our final result? (Click on the image to expand it)

Final Graph
Click on the Image to enlarge

HA! It worked! Eureka!

And that is actually everything I have to show for my efforts on Data Science, so far ūüėÄ

I can’t stress enough the fact that¬†I am NOT a specialist in any Data-Science related technology so, please, don’t take anything from my code as a best practice.

I also can’t stress enough that¬†I do love Python, and I bet you are going to like it too, if I don’t ruin it for you with my ugly code.

And that is all for today! Till next time!

Sources