Ceci est une ancienne révision du document !
Pandas and Python and Code … Oh My!
I try to keep abreast of new happenings in the Python and programming worlds in general. Lately, I have been seeing a number of Python articles having to do with Data Science or Machine Learning that involve a Python library called Pandas. I'd heard of it before, but never took the time to find out more about it. Recently, I learned about it, and I am glad I did!
Pandas, to quote from their web page, “is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.”
You can find out a lot more about it at https://pandas.pydata.org/. There is a wealth of information on the Internet about it. To install it, you can do a pip install pandas. There are, of course some dependencies for Pandas. They are: • Python 2.7 or higher (as of January 1, 2019 it will only work under Python 3.5 or higher) • setuptools 24.2.0 or higher • NumPy 1.9.0 or higher • python-dateutil 2.5.0 or higher • pytz
So, given that Pandas is such an important library for Data Science, I plan to spend a few articles on it. Now, I am not going to try to teach you a large amount about Pandas in this quick article. I'm only going to try to show you some of the neat things that Pandas can do. We'll go into Pandas in depth in later articles.
Pandas can deal with three different types of data structures - • Series • DataFrame • Panel
A Series data structure is a 1D labeled array which is size-immutable and contains homogeneous data (same data type). A DataFrame structure is a 2D labeled tabular structure which is size-mutable, which contains heterogeneous data (data of different types) and is a container for a Series structure. A Panel is a 3D labeled size-mutable array which contains heterogeneous data and is a container for a DataFrame structure. All Pandas data structures are value mutable (can be changed). Size-mutable is only available to DataFrames and Panel structures.
All of that dry information is nice, but to really appreciate how easy Pandas makes dealing with data, let's play a little bit. One of the best things about Pandas is that many times, you can do much of the work in a Python shell.
So, assuming you've gotten the Pandas library, open up a Python3 shell. The first thing you need to do is import the Pandas library.
import pandas as pd
Just for those who haven't been doing Python for a while, like most of you, we use the 'as pd' to create an alias to the library so we don't have to refer to all the commands by typing 'pandas.command'. We can just type 'pd.command'.
Series Data Structures
Now let's create a simple list of random ten integers and name it 'data'.
data = [20,10,42,73,90,18,37,26,19,98]
Now we can create a Pandas series data structure by the .Series() command.
sd = pd.Series(data)
That's all there is to it. Now let's see what it looks like..
print(sd)
0 20 1 10 2 42 3 73 4 90 5 18 6 37 7 26 8 19 9 98
Notice our list of integers is exactly as we entered it and that there is an index added for us as well. This is the DEFAULT index. We can make it different if we wish, which we'll see later.
Also, you might notice that at the end of the output from almost all of the Pandas code we will be doing, you will see something like dtype: int64. That is there to show you what the data type is. I’ve taken it out of the output listings to save space in the article.
Now, if we just want a quick peek at the data, we could use the .head() or .tail() command. Here is an example of the .head command showing the first five items.
sd.head(5)
0 20 1 10 2 42 3 73 4 90
The .tail() command works the same way, showing the end of the data list.
If we want to simply see one item out of our list, we can use the index…
print(sd[4])
90
Now assume we want to see items 4, 5 and 6. We do it this way…
print(sd.loc[4:6,])
4 90 5 18 6 37
While this command looks somewhat strange, I'll break it down so it makes more sense…
sd.loc[] is an indexer command. It works with both the Series and Dataframe data structures. It can be very powerful. The command works like this…
.loc[rowslice,columnslice]
Since we are using a Series structure, we have only one column, so we only work with the row indexer portion of the command. We'll look at the .loc command some more when we deal with Dataframes.
Now getting back to custom indexes. We can create the index as a second parameter to the Series command like this…
sd = pd.Series(data,index=['One','Two','Three','Four','Five','Six','Seven','Eight','Nine','Ten'])
print(sd)
One 20 Two 10 Three 42 Four 73 Five 90 Six 18 Seven 37 Eight 26 Nine 19 Ten 98
One of the things that I really like about Pandas, is the built-in Data Analysis Helper functions. Here is a quick sample…
sd.sum()
433
sd.count()
10
sd.min()
10
sd.max()
98
sd.describe()
count 10.000000 mean 43.300000 std 32.107631 min 10.000000 25% 19.250000 50% 31.500000 75% 65.250000 max 98.000000
DataFrame Data Structures
Now that I’ve shown you some of the things that can be done with a simple Series data structure, let’s look at the DataFrame. I stated earlier that a DataFrame is a 2D tabular structure. Think of a spreadsheet or database table and that is pretty much what a DataFrame looks like. We can create a DataFrame from any of the following: • Lists • Dictionaries • Series • Numpy ndarrays • Other DataFrames
The easiest way to show you a DataFrame in action, let’s create a small dictionary (shown below).
As you can see, there will be four rows and four columns. And you can also see that the data types are mixed. Just as we did when we created the Series data structure, we simply call the .DataFrame command with our data (there are other parameter options that we’ll discuss another time).
df = pd.DataFrame(data2)
Now to see what the structure looks like to Pandas, we just call the structure.
dfName Age Gender Department0 Greg 65 M Management
1 Sam 34 M Development 2 Mary 41 F Human Resources 3 Lois 27 F Development
Like I said earlier, it resembles a spreadsheet. Pretty much everything we did with the Serial structure, we can do with the DataFrame. Let’s do something useful with the data. We’ll create a Serial structure based on the Age column.
age = df['Age']
age
0 65 1 34 2 41 3 27
Now that we have our age Serial structure, let’s get the sum of the values…
age.sum()
167
It’s SO easy to deal with data this way.
As we did with the Serial structure, we can get the data for just one row by using the .loc command.
df.loc[0]
Name Greg Age 65 Gender M Department Management Name: 0, dtype: object
Notice we have to use the index that was created for us automatically. We can’t do something like df.loc[‘Greg’] since ‘Greg’ is not an index item. HOWEVER, there is a cool way to fix that. We can use the .set_index(ColumnName,inplace=True) command to remove the default index and replace it with a column of our choice. In this case, we’ll use the ‘Name’ column…
df.set_index('Name',inplace=True)
Now we can see our data structure after the change…
dfAge Gender DepartmentName
Greg 65 M Management Sam 34 M Development Mary 41 F Human Resources Lois 27 F Development
Now our index has been replaced by the Name column. NOW we can get the information on just Greg…
df.loc['Greg']
Age 65 Gender M Department Management Name: Greg, dtype: object
One of the things we can do with a DataFrame that we can’t with a Serial structure, is get extended information using the .info() command.
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 4 entries, 0 to 3 Data columns (total 4 columns): Name 4 non-null object Age 4 non-null int64 Gender 4 non-null object Department 4 non-null object dtypes: int64(1), object(3) memory usage: 208.0+ bytes
I hope that I have generated some interest about Pandas. Next time, we’ll look deeper at the DataFrames in Pandas. Until then, keep coding!
