Full Circle Magazine FR

Ceci est une ancienne révision du document !

Pandas = Cuddly… Data? This time, we will concentrate on the Pandas DataFrame and dealing with a semi-real world scenario. You'll need to download a CSV file from kaggle.com. The link is https://www.kaggle.com/sulmansarwar/transactions-from-a-bakery, and the file you need to get is 'BreadBasket_DMS.csv'. Once you have that downloaded, create a folder to hold the code and the CSV file. This could all be done within the Python shell, but creating a file will make it easier to deal with until you get familiar with the commands and concepts that we will be using. We'll be building the code file step by step as we go along. I've put the source code up on PasteBin at https://pastebin.com/uG1QSkmN to make it easy for you.

Pandas = Tendres … données ?

Cette fois, nous nous concentrerons sur les Dataframes de Pandas, en traitant un scénario du monde semi-réel.

Nous aurons besoin de télécharger un fichier CSV depuis kaggle.com. Le lien est https://www.kaggle.com/sulmansarwar/transactions-from-a-bakery, et le fichier à récupérer est « BreadBasket_DMS.csv' ».

Une fois que vous l'avez téléchargé, créez un dossier qui contiendra le code et le fichier CSV. Ceci pourrait se faire complètement dans le shell Python, mais la création d'un fichier rendra tout ça plus facile jusqu'à ce que vous soyez à l'aise avec les commandes et les concepts que nous utiliserons. Nous construirons le fichier de code pas à pas en avançant. J'ai mis le code source sur Pastebin à : https://pastebin.com/uG1QSkmN pour que ce soit facile pour vous.

The data that the CSV file holds is really rather simple. There are just four columns… • Date • Time • Transaction (number) • Item and 21,293 rows. To begin with, we will create a DataFrame by importing the data from the CSV file. You can also create DataFrames from database tables, but that's an article for another day. Here is a sample of what the base CSV file looks like… Date,Time,Transaction,Item 2016-10-30,09:58:11,1,Bread 2016-10-30,10:05:34,2,Scandinavian 2016-10-30,10:05:34,2,Scandinavian 2016-10-30,10:07:57,3,Hot chocolate 2016-10-30,10:07:57,3,Jam 2016-10-30,10:07:57,3,Cookies 2016-10-30,10:08:41,4,Muffin Of course, this is just the first 8 lines from the file.

Les données contenues dans le fichier CSV sont plutôt simples. Il n'y a que quatre colonnes… • Date • Time (l'heure) • Transaction (un nombre) • Item (le sujet et 21,293 lignes.

Pour commencer, nous créerons une Dataframe en important les données du fichier CSV. Vous pouvez aussi créer des Dataframes à partir de table de bases de données, mais, ce sera un article pour une autre fois. Voici un échantillon de ce à quoi ressemble le fichier CSV…

Date,Time,Transaction,Item 2016-10-30,09:58:11,1,Bread 2016-10-30,10:05:34,2,Scandinavian 2016-10-30,10:05:34,2,Scandinavian 2016-10-30,10:07:57,3,Hot chocolate 2016-10-30,10:07:57,3,Jam 2016-10-30,10:07:57,3,Cookies 2016-10-30,10:08:41,4,Muffin

Bien sûr, ce ne sont que les 8 premières lignes du fichier.

To get started, we'll import Pandas (just like we did last month), define the filename of the CSV file, and create the DataFrame from the CSV file. import pandas as pd filename = 'BreadBasket_DMS.csv' df = pd.read_csv(filename) print(df) What you will see is something like the data shown above. All of the data is really there, but Pandas only shows a portion of the DataFrame information. Now, to do any kind of work with the data, you will need to know the names of the columns. Many times, when we are working, we either don't have time or don't take the time to make notes carefully. This is especially true when we deal with really large data files with more than 50 columns. That can take more time than we have. Thankfully, Pandas has a simple command that we can use to get all of our column headers. The command is 'df.columns.values.tolist()'. We can use it like this…

# get and display a list of the column names (headers) col_list = df.columns.values.tolist() print(col_list) This will give us… ['Date', 'Time', 'Transaction', 'Item'] We can also simply call df.count() and it will show us something like this… print(df.count()) Date 21293 Time 21293 Transaction 21293 Item 21293 dtype: int64

So now, we have created and loaded our DataFrame and we know the column headers and know the number of data rows that we have. Now, let's see how many individual dates we are dealing with. To do that, we can create a list (almost like we did to get the column header list) by using the following command… datelist = df['Date'].unique().tolist() Then we can print the length of the list to know how many unique dates we are dealing with. We can also include the earliest and latest date that we have data for. print(len(datelist),min(datelist),max(datelist)) which results in… 159 2016-10-30 2017-04-09 Keep the datelist variable in mind for later on.

We also know that we have the 'Item' column. We can do the same thing to see how many unique items we have. itemlist = df['Item'].unique().tolist() print(itemlist) print(len(itemlist)) I won't print the entire item list here, but there are 95 unique items. ['Bread', 'Scandinavian', 'Hot chocolate', 'Jam', 'Cookies', 'Muffin', 'Coffee', …, 'Mortimer', 'Raw bars', 'Tacos/Fajita'] Ok, so now we know that we have a DataFrame that has 4 data columns, the data has 159 unique dates between 2016-10-30 and 2017-04-09, and 95 unique items in the DataFrame, and all in less than 20 lines of code and about 5 minutes of actual work.

Now, before we go any further, it would be a good idea to think about some of the questions that would be asked about our data… probably by the boss. Some of them might be… • By day, how many of each unique item was sold? • By item, what were the top 10 sellers? What were the bottom 10? • By day, what were the busiest times? Before we can answer these questions, we have to come up with a plan for each. So, let's start with question #1… By day, how many of each unique item was sold? We know that our data is broken down by date, time of each sale (transaction) and each item sold. In addition, each sale has a unique transaction number that is duplicated if there were multiple items in that sale. For example, let's look at two sales (shown above).

Sale #1 (transaction 1954) was completed on 2016-11-24 at 10:18:24, and was for three items, two bread items and one coffee. Sale #2 (transaction 1955) was completed on the same day at 10:23:10, and was for two items. So how would we structure our research to accomplish the task? If I were to simply look at a single day, I would get all of the records for the day in question and sort the records by the Items sold. I would then count each unique item that was sold for that day. Using the five record set above, it would look something like this… Date Item Count ———–|———–|—– 2016-11-24 | Bread | 2 | Coffee | 2 | Alfajores | 1

Or to put it another way, I'd group the records by Date, then by Item and count (and record) each occurance of the unique item. So, how would we get from the output of a simple set of records to a command set that gets us what we want for the full data set? The key is in the phrases 'group' and 'count'. There are helper functions in Pandas that accomplishes each of the two phase part of the task. The first is 'groupby' and we have already used the other, 'count'. #1 - By Date, show how many of each item were sold… # produces a Series Data object byDate = df.groupby(['Date','Item'])['Date'].count() So now, we know how to get the data for the boss for question #1. How about question #2… #2 - By item, what were the top 10 sellers? What were the bottom 10?

Again, we want to find the top 10 sellers as well as the bottom 10. Here we want to groupby Item, counting each Transaction number within each group. Then we want to make sure the items are sorted from hi to low. The .head() and .tail helper routines will give us the answers we need. # By item, what were the top 10 sellers? What were the bottom 10? sorteditemcount2 = df.groupby('Item')'Transaction'.count().sort_values('Transaction',ascending=False) print(sorteditemcount2) print(sorteditemcount2.head(10)) print(sorteditemcount2.tail(10)) #3 - By day, what were the busiest times?

Once again, we can group by data and time, then count the number of transaction items. df.groupby(['Date','Time'])'Transaction'.count() Transaction Date Time 2016-10-30 09:58:11 1 10:05:34 2 10:07:57 3 10:08:41 1 10:13:03 3 10:16:55 3 10:19:12 4 … … 2017-04-09 10:08:23 2 10:36:11 1 10:39:07 2 10:52:02 3 11:14:28 1 11:17:43 2 11:22:01 1 [9531 rows x 1 columns]

So now we have answers for the boss and still all of the work could have been done within the Python Shell. I have created a simple program that contains all of the things that we did in one easy-to-see file. You can find it on pastebin at https://pastebin.com/uG1QSkmN . Next month, we’ll continue dealing with Pandas and Python, this time looking at a different dataset. Until then, have fun!