Chapter 6: Hello columns

In this chapter we’ll begin our analysis by learning how to inspect a column from a DataFrame.

Accessing a column

We’ll begin with the prop_name column where the proposition each committee sought to influence is stored.

To see the contents of a column separate from the rest of the DataFrame, add the column’s name to the DataFrame’s variable following a period.

committee_list.prop_name

That will list the column out as a Series, just like the ones we created from scratch in chapter three.

And, just as we did then, you can now start tacking on additional methods that will analyze the contents of the column.

In this case, the column is filled with characters. So we don’t want to calculate statistics like the median and average, as we did before.

Note

You can also access columns a second way, like this:

committee_list['prop_name']

This method isn’t as pretty, but it’s required if your column has a space in its name, which would break the simpler dot-based method.

Counting a column’s values

There’s another built-in pandas tool that will total up the frequency of values in a column. In this case that could be used to answer the question: Which proposition had the most committees?

The method is called value_counts and it’s just as easy to use as sum, min or max. All you need to do it is add a period after the column name and chain it on the tail end of your cell.

committee_list.prop_name.value_counts()

Run the code and you should see the lengthy proposition names ranked by their number of committees.

Resetting a DataFrame

You may have noticed that even though the result has two columns, pandas did not return a clean-looking table in the same way as head did for our DataFrame.

That’s because our column, a Series, acts a little bit different than the DataFrame created by read_csv.

In most instances, if you have an ugly Series generated by a method like value_counts and you want to convert it into a pretty DataFrame you can do so by tacking on the reset_index method onto the tail end.

committee_list.prop_name.value_counts().reset_index()

Why do Series and DataFrames behave differently? Why does reset_index have such a weird name?

Like so much in computer programming, the answer is simply “because the people who created the library said so.”

That’s not worth stressing about in this case, but it’s important to learn that all open-source programming tools have their quirks. Over time you’ll learn pandas has more than a few.

As a beginner, you should just accept the oddities and roll with it. As you get more advanced, if there’s something about the system you think could be improved you should consider contributing to the Python code that operates the library you’d like to improve.