issue141:python
Différences
Ci-dessous, les différences entre deux révisions de la page.
Prochaine révision | Révision précédente | ||
issue141:python [2019/01/26 14:40] – créée auntiee | issue141:python [2019/02/08 16:06] (Version actuelle) – andre_domenech | ||
---|---|---|---|
Ligne 1: | Ligne 1: | ||
- | Carrying on from last month’s article, I thought I’d cover a few aspects of theming my setup – that I’ve done to start the new year off with a fresh look. I’ll cover setting terminal themes, a few websites and tools for generating colorschemes, | + | **Pandas = Cuddly... Data? |
- | Terminal Color Scheme | + | This time, we will concentrate on the Pandas DataFrame and dealing with a semi-real world scenario. |
- | I spent some time in December converting my favourite syntax color scheme | + | You'll need to download a CSV file from kaggle.com. The link is https:// |
- | The exact method of changing your terminal colors will, of course, depend on your terminal itself - urxvt pulls the data from .Xresources (like Xterm), Gnome-terminal has a settings panel (and so does Terminator), | + | Once you have that downloaded, create |
- | *foreground: | + | Pandas = câlins ... données ? |
- | *background: | + | |
- | ! Black | + | Cette fois, nous nous concentrerons sur les DataFrames de Pandas, en traitant un scénario du monde semi-réel. |
- | *color0: #333f4a | + | |
- | *color8: #41505e | + | |
- | ! Red | + | Nous aurons besoin de télécharger un fichier CSV depuis kaggle.com. Le lien est https:// |
- | *color1: #d95468 | + | Une fois que vous l' |
- | *color9: #d95468 | + | |
- | ! Green | + | **The data that the CSV file holds is really rather simple. There are just four columns... |
+ | • Date | ||
+ | • Time | ||
+ | • Transaction (number) | ||
+ | • Item | ||
+ | and 21,293 rows. | ||
- | *color2: #8bd49c | + | To begin with, we will create a DataFrame by importing the data from the CSV file. You can also create DataFrames from database tables, but that's an article for another day. Here is a sample of what the base CSV file looks like... |
- | *color10: #8bd49c | + | |
- | ! Yellow | + | Date, |
+ | 2016-10-30, | ||
+ | 2016-10-30, | ||
+ | 2016-10-30, | ||
+ | 2016-10-30, | ||
+ | 2016-10-30, | ||
+ | 2016-10-30, | ||
+ | 2016-10-30, | ||
- | *color3: #ebbf83 | + | Of course, this is just the first 8 lines from the file.** |
- | *color11: #f7dab3 | + | |
- | ! Blue | + | Les données contenues dans le fichier CSV sont plutôt simples. Il n'y a que quatre colonnes... |
+ | • Date | ||
+ | • Time (l' | ||
+ | • Transaction (un nombre) | ||
+ | • Item (l' | ||
+ | et 21 293 lignes. | ||
- | *color4: #539afc | + | Pour commencer, nous créerons une DataFrame en important les données du fichier CSV. Vous pouvez aussi créer des DataFrames à partir d'une table de bases de données, mais, ce sera un article pour une autre fois. Voici un échantillon de ce à quoi ressemble le fichier CSV... |
- | *color12: #5ec4ff | + | |
- | ! Magenta | + | Date, |
+ | 2016-10-30, | ||
+ | 2016-10-30, | ||
+ | 2016-10-30, | ||
+ | 2016-10-30, | ||
+ | 2016-10-30, | ||
+ | 2016-10-30, | ||
+ | 2016-10-30, | ||
- | *color5: #d44b7e | + | Bien sûr, ce ne sont que les 8 premières lignes du fichier. |
- | *color13: #b62d65 | + | |
- | ! Cyan | ||
- | *color6: #70e1e8 | + | **To get started, we'll import Pandas (just like we did last month), define the filename of the CSV file, and create the DataFrame from the CSV file. |
- | *color14: #70e1e8 | + | |
- | ! White | + | import pandas as pd |
- | *color7: #718ca1 | + | filename = ' |
- | *color15: #b7c5d3 | + | |
- | Alternatively, | + | df = pd.read_csv(filename) |
- | {" | + | print(df) |
- | Save this somewhere | + | What you will see is something like the data shown above. |
+ | |||
+ | All of the data is really there, but Pandas only shows a portion of the DataFrame information. | ||
- | wal -f /path/to/citylights.json | + | Now, to do any kind of work with the data, you will need to know the names of the columns. Many times, when we are working, we either don't have time or don't take the time to make notes carefully. This is especially true when we deal with really large data files with more than 50 columns. That can take more time than we have. Thankfully, Pandas has a simple command that we can use to get all of our column headers. The command is ' |
- | This is simply a copy of their included Monokai theme that I adapted and edited. Naturally, Pywal is intended to create/ | + | |
- | If you want to create your own scheme from scratch, something like http:// | + | Pour commencer, nous importerons Pandas (exactement comme nous l' |
- | GTK+ Theme, Cursor & Icons | + | import pandas as pd |
- | I didn’t go so far as to create a custom GTK+ theme to utilize the citylights.json theme (though it should be possible with oomox and pywal). Instead, I just selected one that I liked the look of (Adapta-Nokto-Eta). It’s not perfect, but as I don’t tend to see a lot of GTK applications, | + | filename = ' |
- | Icon-wise I’m just using Lüv, which is a nice blue-tinged set of icons. I have noticed in some dialog boxes that the icons are showing up funny, but it doesn’t happen often enough for me to track down the cause. | + | df = pd.read_csv(filename) |
- | The cursor I use is StormDrops dark. It’s overall a nice mouse cursor with sharp lines and a normal angle to the cursor. The only cursor I’m not sold on is the hand, which just feels out of place. I haven’t yet found a better option for a dark-colored cursor though. | + | print(df) |
- | Fonts | + | Ce que nous verrons ressemblera aux données montrées ci-dessus. |
- | Lastly, one of the most important parts of anyone’s setup - their fonts! | + | Toutes les données sont bien là, mais Pandas ne montre qu'une partie des informations de la DataFrame. |
- | Terminal/ | + | Maintenant, quel que soit le travail à faire sur les données, vous aurez besoin de connaître le nom des colonnes. Il arrive très souvent, quand nous travaillons, |
- | GTK: Cantarell | + | |
- | VS Code: Fira Code (with ligatures enabled) | + | |
- | The first line has 3 fonts listed, as each one is progressively used as a fallback. FontAwesome gives me icons for use in Polybar | + | **# get and display |
- | I highly recommend Hack as a terminal font, as it even includes Powerline icons by default (in case you use Powerline). Even if not, it’s very legible and offers some defining characteristics on symbols you may confuse | + | col_list = df.columns.values.tolist() |
- | Future Plans | + | print(col_list) |
- | I’m debating about trying Herbstluftwm instead of i3, simply for a little bit of a change. I’d also like to re-work my workspace names using Kanji or other icons, instead of largely being numbers. I also want to tweak Polybar a bit more, especially to see if I can improve the appearance of the tray. I may also start using Conky again to see more information about my system. | + | This will give us... |
- | Naturally, if any of my readers | + | [' |
+ | |||
+ | We can also simply call df.count() and it will show us something like this... | ||
+ | |||
+ | print(df.count()) | ||
+ | |||
+ | Date | ||
+ | Time | ||
+ | Transaction | ||
+ | Item | ||
+ | dtype: int64** | ||
+ | |||
+ | # get and display a list of the column names (headers) | ||
+ | |||
+ | col_list = df.columns.values.tolist() | ||
+ | |||
+ | print(col_list) | ||
+ | |||
+ | ce qui nous donnera... | ||
+ | |||
+ | [' | ||
+ | |||
+ | nous pouvons également tout simplement appeler « df.count() » et quelque chose comme ceci nous sera présenté : | ||
+ | |||
+ | print(df.count()) | ||
+ | |||
+ | Date | ||
+ | Time | ||
+ | Transaction | ||
+ | Item | ||
+ | dtype: int64 | ||
+ | |||
+ | **So now, we have created and loaded our DataFrame and we know the column headers and know the number of data rows that we have. Now, let's see how many individual dates we are dealing with. To do that, we can create a list (almost like we did to get the column header list) by using the following command… | ||
+ | |||
+ | datelist = df[' | ||
+ | |||
+ | Then we can print the length of the list to know how many unique dates we are dealing with. We can also include the earliest and latest date that we have data for. | ||
+ | |||
+ | print(len(datelist),min(datelist), | ||
+ | |||
+ | which results in… | ||
+ | 159 2016-10-30 2017-04-09 | ||
+ | |||
+ | Keep the datelist variable in mind for later on.** | ||
+ | |||
+ | Jusqu' | ||
+ | |||
+ | datelist = df[' | ||
+ | |||
+ | Ensuite, nous pouvons imprimer la longueur de la liste pour savoir combien elle possède de dates uniques. Nous pouvons aussi inclure les dates la plus proche et la plus lointaine pour lesquelles nous avons des données : | ||
+ | |||
+ | print(len(datelist), | ||
+ | |||
+ | dont les résultats sont : | ||
+ | |||
+ | 159 2016-10-30 2017-04-09 | ||
+ | |||
+ | Gardez en tête la variable datelist pour plus tard. | ||
+ | |||
+ | |||
+ | **We also know that we have the ' | ||
+ | |||
+ | itemlist = df[' | ||
+ | |||
+ | print(itemlist) | ||
+ | |||
+ | print(len(itemlist)) | ||
+ | |||
+ | I won't print the entire item list here, but there are 95 unique items. | ||
+ | |||
+ | [' | ||
+ | |||
+ | Ok, so now we know that we have a DataFrame that has 4 data columns, the data has 159 unique dates between 2016-10-30 and 2017-04-09, and 95 unique items in the DataFrame, and all in less than 20 lines of code and about 5 minutes of actual work.** | ||
+ | |||
+ | Nous savons aussi que la colonne « Item » existe. Nous pouvons faire la même chose pour voir combien d' | ||
+ | |||
+ | itemlist = df[' | ||
+ | |||
+ | print(itemlist) | ||
+ | |||
+ | print(len(itemlist)) | ||
+ | |||
+ | Je n' | ||
+ | |||
+ | [' | ||
+ | |||
+ | Bon. Maintenant nous savons que nous avons une DataFrame qui possède 4 colonnes, que les données ont 159 dates uniques entre le 30-10-2016 et le 09-04-2017, et qu'il y a 95 articles uniques dans la DataFrame ; et tout cela, avec moins de 20 lignes de code et environ cinq minutes de travail réel. | ||
+ | |||
+ | **Now, before we go any further, it would be a good idea to think about some of the questions that would be asked about our data… probably by the boss. Some of them might be... | ||
+ | • By day, how many of each unique item was sold? | ||
+ | • By item, what were the top 10 sellers? What were the bottom 10? | ||
+ | • By day, what were the busiest times? | ||
+ | |||
+ | Before we can answer these questions, we have to come up with a plan for each. So, let's start with question #1... | ||
+ | |||
+ | By day, how many of each unique item was sold? | ||
+ | |||
+ | We know that our data is broken down by date, time of each sale (transaction) and each item sold. In addition, each sale has a unique transaction number that is duplicated if there were multiple items in that sale. For example, let's look at two sales (shown above).** | ||
+ | |||
+ | Avant d' | ||
+ | • Par jour, combien d' | ||
+ | • Par article, quels ont été les plus vendus ? Quels sont les 10 moins bien vendus ? | ||
+ | • Par jour, quels sont les périodes les plus actives ? | ||
+ | |||
+ | Avant de pouvoir répondre à ces questions, nous devons monter un plan pour chacune. Aussi, commençons avec la question n°1. | ||
+ | |||
+ | Par jour, combien d' | ||
+ | |||
+ | Nous savons que les données sont organisées avec la date et l' | ||
+ | |||
+ | **Sale #1 (transaction 1954) was completed on 2016-11-24 at 10:18:24, and was for three items, two bread items and one coffee. | ||
+ | |||
+ | Sale #2 (transaction 1955) was completed on the same day at 10:23:10, and was for two items. | ||
+ | |||
+ | So how would we structure our research to accomplish the task? If I were to simply look at a single day, I would get all of the records for the day in question and sort the records by the Items sold. I would then count each unique item that was sold for that day. Using the five record set above, it would look something like this... | ||
+ | |||
+ | Date | ||
+ | -----------|-----------|----- | ||
+ | 2016-11-24 | Bread | 2 | ||
+ | | Coffee | ||
+ | | Alfajores | 1** | ||
+ | |||
+ | La vente n°1 (transaction 1954) a été terminée le 24-11-2016 à 10:18:24, et comprenait trois articles, deux articles boulangers et un café. | ||
+ | |||
+ | La vente n°2 (transaction 1955) s'est terminée le même jour à 10:23:10, pour deux articles. | ||
+ | |||
+ | Aussi, comment pourrions-nous structurer notre recherche pour accomplir cette tâche ? Si je ne regardais qu'un seul jour, je prendrais tous les enregistrements du jour dit et trierais les enregistrements par articles vendus. Ensuite, je compterais chaque | ||
+ | |||
+ | Date | ||
+ | -----------|-----------|----- | ||
+ | 2016-11-24 | Bread | 2 | ||
+ | | Coffee | ||
+ | | Alfajores | 1 | ||
+ | |||
+ | **Or to put it another way, I'd group the records by Date, then by Item and count (and record) each occurance of the unique item. | ||
+ | |||
+ | So, how would we get from the output of a simple set of records to a command set that gets us what we want for the full data set? The key is in the phrases ' | ||
+ | |||
+ | #1 - By Date, show how many of each item were sold... | ||
+ | |||
+ | # produces a Series Data object | ||
+ | |||
+ | byDate = df.groupby([' | ||
+ | |||
+ | So now, we know how to get the data for the boss for question #1. How about question #2... | ||
+ | |||
+ | #2 - By item, what were the top 10 sellers? What were the bottom 10?** | ||
+ | |||
+ | Ou, pour le présenter différemment, | ||
+ | |||
+ | Aussi, comment, à partir de la sortie d'un jeu simple d' | ||
+ | |||
+ | N° 1 - Par Date, montrer combien il a été vendu de chaque article... | ||
+ | |||
+ | Ce n° 1 produit un objet de données Series. | ||
+ | |||
+ | byDate = df.groupby([' | ||
+ | |||
+ | Ainsi, nous savons maintenant comment obtenir les informations pour le patron pour la question n°1. Et pour la question n° 2 : | ||
+ | |||
+ | N° 2 - Par Item, quels sont ceux les plus vendus ? Et les 10 moins bien vendus ? | ||
+ | |||
+ | **Again, we want to find the top 10 sellers as well as the bottom 10. Here we want to groupby Item, counting each Transaction number within each group. Then we want to make sure the items are sorted from hi to low. The .head() and .tail helper routines will give us the answers we need. | ||
+ | |||
+ | # By item, what were the top 10 sellers? What were the bottom 10? | ||
+ | |||
+ | sorteditemcount2 = df.groupby(' | ||
+ | |||
+ | print(sorteditemcount2) | ||
+ | |||
+ | print(sorteditemcount2.head(10)) | ||
+ | |||
+ | print(sorteditemcount2.tail(10)) | ||
+ | |||
+ | #3 - By day, what were the busiest times?** | ||
+ | |||
+ | À nouveau, nous voulons trouver les 10 articles les mieux vendus tout comme les 10 moins bien vendus. Ici, nous voulons regrouper les articles, en comptant le nombre de transactions pour chaque regroupement. Ensuite, nous voulons nous assurer que les articles sont classés du haut vers le bas. Les routines auxiliaires .head() et .tail() nous donneront les réponses dont nous avons besoin. | ||
+ | |||
+ | # By item, what were the top 10 sellers? What were the bottom 10? | ||
+ | |||
+ | sorteditemcount2 = df.groupby(' | ||
+ | |||
+ | print(sorteditemcount2) | ||
+ | |||
+ | print(sorteditemcount2.head(10)) | ||
+ | |||
+ | print(sorteditemcount2.tail(10)) | ||
+ | |||
+ | N° 3 - Par jour, quelles ont été les périodes les plus actives ? | ||
+ | |||
+ | **Once again, we can group by data and time, then count the number of transaction items. | ||
+ | |||
+ | df.groupby([' | ||
+ | | ||
+ | Date | ||
+ | 2016-10-30 09: | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | ... ... | ||
+ | 2017-04-09 10: | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | |||
+ | [9531 rows x 1 columns]** | ||
+ | |||
+ | Une fois encore, nous pouvons regrouper par date et heure, puis compter le nombre de transactions. | ||
+ | |||
+ | df.groupby([' | ||
+ | |||
+ | Date | ||
+ | 2016-10-30 09: | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | ... ... | ||
+ | 2017-04-09 10: | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | |||
+ | [9531 rows x 1 columns] (9531 lignes x 1 colonne) | ||
+ | |||
+ | **So now we have answers for the boss and still all of the work could have been done within the Python Shell. I have created a simple program that contains all of the things that we did in one easy-to-see file. You can find it on pastebin at https:// | ||
+ | |||
+ | Next month, we’ll continue dealing with Pandas and Python, this time looking at a different dataset. Until then, have fun!** | ||
+ | |||
+ | Nous avons maintenant les réponses pour notre patron ; et encore une fois, tout ce travail aurait pu être fait dans un shell Python. J'ai créé un programme simple qui contient toutes les choses que nous avons faites dans un fichier facile à visualiser. Vous pouvez le trouver sur pastebin à https:// | ||
+ | |||
+ | Le mois prochain, nous continuerons à traiter de Pandas et Python, en regardant ce coup-là un jeu de données différent. Jusque là, amusez-vous bien ! | ||
- | As always, I hope this article might have inspired at least a couple of you to try something new or to rework their system for the new year. If you have any issues, corrections, |
issue141/python.1548510042.txt.gz · Dernière modification : 2019/01/26 14:40 de auntiee