Outils pour utilisateurs

Outils du site


issue157:python

Ceci est une ancienne révision du document !


We are all experiencing a new world due to COVID-19. Stay at home orders, Work from home orders, businesses closed, lost jobs, long lines at the grocery stores, shortages when you get into the stores and social distancing. This is the new normal, at least for a while. Many “experts” are suggesting that we may never return to the “old” normal and even more are suggesting that this will last for a year or longer.

We are presented with the number of confirmed cases, number of deaths and number of hospitalizations due to COVID-19 on every TV news show, radio show and on the Internet. Where are these numbers coming from and how do we make sense of them? Luckily, those of us who know Python with just a little work can do some of the data analysis for ourselves and with logic can see what the trends are really doing. The goal here is not to provide any answers, but to give you the ability to look at the data and see the trends for yourself. As the saying goes, “knowledge is power”. Way back in December 2018 (FCM#140), I talked about Pandas and Python. This month, we will use Pandas and Python to look at some of these numbers and make the graphs for ourselves. If you don’t have Pandas installed, please re-visit Full Circle Magazine # 140 to see the installation steps.

To get started, we need some data. I’m going to use a Comma Separated Variable data set available from https://data.humdata.org/dataset/novel-coronavirus-2019-ncov-cases . This data is compiled by the Johns Hopkins University Center for Systems Science and Engineering (JHU CCSE) and comes from many “trusted” sources. This site also has data for deaths and recovered as well. When you get to the page, scroll down below the chart and find the first download “button” that says “time_series_covid19_confirmed_global.csv” to the left of it. This will download the CSV file to your machine. Now use LibreOffice Calc (or some other spreadsheet viewer) to open the file. In theory, you should be able to simply double click on the downloaded file. Accept the import settings box.

PLEASE NOTE: I am using data that I downloaded on 5 May, 2020. Yours will be a little bit different, mainly in the fact that it will have more data, column wise as well as possibly more rows (since more regions can be added as new cases spread to more countries). The important thing to do here is to verify that the Country or Region that you are interested in is somewhere in Column B. It might be partially in Column B and partially in Column A. For example, if you are interested in Scotland, you would use the row marked “United Kingdom”, but if you want Greenland, you need to find “Denmark” in Column B then “Greenland” in Column A. For the purposes of this article, we’ll use “US” which is around row 227 (at least for now).

Create a convenient folder somewhere, copy the CSV file into the folder and open a terminal window in that folder. ( I use “Open in Terminal” from the GUI File manager.) Now it’s time to do some coding. We’ll use the Python Interpreter for this example. See box above. Look at the last line of the pandas head/tail dump. It says that there are 266 rows and 108 columns. We’ll get some of that information in a few moments. For right now, we’ll just grab the row that contains data for the US. If you wanted to find “Greenland”, use ‘Province/State” instead of ‘Country/Region’ in the line above.

Again, here we want to verify a few things. First, that the last data column is for the proper date (which is marked as “5/4/20”), the proper country and that there are (still) 108 columns. Secondly, the actual row data is shown here as 225. In the spreadsheet, however, it shows up as 227. That’s because there is a header row and (remember) Python is ZERO BASED. Now we have some data that we can play with. BUT, we need to get a bit more information to make our programming easier. »> sh = data.shape »> print(sh) (266, 108) »> lastcol = sh[1] »> print(lastcol) 108 »>

Here we use data.shape() to get the number of rows and the number of columns in the dataframe. This comes back as a tuple, so we can assign a variable “lastcol” to the 108 (sh[1]) of the variable sh. Now (top right) we will grab just the columns that contain the confirmed number of cases (Column E or 4 (again zero based)) through lastcol (108) for Row 255. We’ll use the .iloc method to grab the row, start column and last column from the dataframe and assign it to variable s1a. So we now have data that we can use, but the data was extracted, it came back as a data series, not a data frame. So, we’ll convert it to a dataframe. See bottom right. See bottom right.

So now we’ve got data that we can almost play with. But first, let’s assign the column headings to some proper, meaningful information. We’ll change the “index” column header to “dtstring” and the 255 (numeric not text) column header to ‘Cases’. See top right. It’s a very busy plot (bottom right), but you can definitely see the same kind of data you do from the news. For the next part of our data examination, we need to calculate the number of new cases each day from the day before. This is SUPER easy with the .shift() method available in Pandas (below). Now let’s see the daily differences on a graph… »> df.plot(kind='line',x='dtstring',y='diff',color='red') <matplotlib.axes._subplots.AxesSubplot object at 0x7f16e3ff3b90> »> plt.show()

Now that you have the basics of dealing with the basic data, you can go back to the beginning, where we pulled the data for the US and change it to the country or region of your choice. For example, change the line… s1 = data.loc[data['Country/Region']=='US'] To… s1 = data.loc[data['Country/Region']=='Norway'] When you print the s1 data, you will see that the row for Norway is 175. So in the line that we got just the data columns for that row (from column 4 to last column)… s1a = data.iloc[225,4:lastcol] You would change it to… s1a = data.iloc[175,4:lastcol] At this point, you would repeat all of the other steps again to create and modify the dataframe so you can plot it. What exactly should we take away from this data? That’s a very good question. The accuracy of this, is currently in somewhat of a (series) question. There are speculations that suggest that the number of confirmed cases are low due to the lack and quality of testing of the population of many of the areas. You can never be sure of the data, unless you gather it yourself. In cases like this, you have no choice but to believe, with a grain of salt, that the data was taken with the best level of care. WIth a little bit of creative web searches, you can find a lot more information on Pandas, various datasets and types of plots and options that you can use to show your data. Until next month; stay safe, healthy, positive and creative!

Lignes noires de l'encadré de la page 22

Now, we need to import two libraries, pandas and matplotlib.pyplot. Make sure that you alias them as shown…

Maintenant, nous devons importer deux bibliothèques, pandas and matplotlib.pyplot. Assurez-vous que vous leur créez un alias comme indiqué…

Now, let’s set the filename of the .csv file into a variable…

Maintenant, mettons le nom du fichier .csv dans une variable…

Next, have Pandas read the spreadsheet into a dataframe…

Maintenant, faisons en sorte que Pandas lise le tableur dans le cadre de données…

Lignes noires de l'encadré de la page 24

Now we can plot the data. Remember, there are 104 data points, so the date information on the X axis will be pretty squished together.

Maintenant, nous pouvons tracer les données. Souvenz-vous qu'il y a 104 points ; aussi, les informations de date sur l'axe des X sera bien tassée.

issue157/python.1591085492.txt.gz · Dernière modification : 2020/06/02 10:11 de d52fr