Outils pour utilisateurs

Outils du site


issue108:c_c

Ceci est une ancienne révision du document !


A few months ago, I was asked to create a JSON database for integration into an ubuntu phone app. Up until then, I’ve only ever used JSON files, but not created them from real-world data/files. I then set about using Python to list all files that should appear in the app, and creating the relevant JSON data. Since then, I’ve begun a data management and analysis course - where we work with similar JSON files. So for this month’s C&C, I’m going to walk you through how I created a new Python script to create a JSON database of my previous C&C articles. This way, I’ll be able to easily check when I wrote an article, and what the topic was.

The Script

I’ve posted my copy of the script on Pastebin: http://pastebin.com/eLASuY1T

But… I Don’t Have Your Articles!

This script isn’t intended to be used as-is. You can easily adjust it to generate lists/databases of files based off filenames. This is what I’ll be focusing on in the article.

What Will I Need?

The scripts are tested in Python 3.5. Older versions of 3 should work without issues - if you’re using Python2, you’ll need to adjust the code according to the error messages you’re given.

I’ve linked my version of the script with Drive - the CLI tool for managing Google Drive. If you’re not interested in using it, you can safely ignore/delete the lines 3-6, 26-40, and 107. Otherwise you’ll need: Drive (see Further Reading), and to have imported subprocess.call, and contextlib.contextmanager.

Otherwise the rest of the imports are necessary, and standard libraries.

What Do I Need To Change?

These changes are mandatory - and need to be adjusted for your system. Optional changes can be found in the next section.

You’ll want to edit line 9 - this is the target file (txtFile) for the intermediary list of all articles.

You’ll also need to edit line 13 - and adjust the list to contain the paths you want to include in your database. It is expected to be a list, so even if you’re looking at a single path, make sure it’s wrapped in square brackets.

If you don’t separate out the filename information with “ - “, then you’ll need to change line 70.

What Can I Change?

Line 10 indicates the file path to the JSON file. By default it’s fcm-database.json in the current folder.

Line 16 contains a dictionary that stores the date of issue 100 (August, 2015). This is important for the calculations in dateFind. The reason it’s not integrated into the function, is so it can be more easily changed.

Line 17 contains the empty dictionary which is used per entry. It could be integrated into a method, but I left it in to illustrate how it works (and to ensure that there’s a -1 element to delete).

How Does It Work?

The first 25 lines are mainly variables that need to be set up for later functions. Most should be self-explanatory, and so I will skip them. Lines 26-40 are explained in the section “Drive Functions”.

Function: dateFind

This function was created to calculate the month and year that corresponds to the article. This is important for me, as some topics can be outdated - so if a topic was covered in 2012, it may be time for a refresh. It does so by knowing one date (#100 was for August, 2015), and the difference between the currently processed issue, and issue 100. So issue 98 would be a difference of 2, and 107 would be -7. The difference (times 365/12 - roughly one month) is then subtracted from the date of the startPoint (2015-08), and results in a new year and month. The days shift, but since FCM is a monthly release, it’s unimportant (and even deleted entirely with the strftime function).

One could also just multiply 32 by the difference - though having that many extra days for some months may cause more problems. It’s by no means perfect - as it’s cumulative, there may be problems with February issues (for example).

Function: createArticleList

This function uses os.listdir, and some regexp/searches to ensure that only my articles are accounted for. Since my files are in the format FCM100 - C&C - Title, I simply make sure the filename starts with FCM[0-9]+ (one or more numbers after the FCM). I also make sure it doesn’t contain .desktop (this is important for drive - as every file has a .desktop companion). Once the lists have been added into output, the list isn’t flat (has multiple sublists). That’s what line 54 fixes.

Line 55 removes any .odt extension listed (as Google Drive files have no extension, and .odt was what I originally used). Lastly, I use the same sort of replace function to shorten any Command & Conquer into C&C - not strictly necessary, but ensures things are uniform.

Line 57 ensures there are no duplicates - as a Python set can contain only unique values. This essentially drops any duplicate filenames.

The list is then printed into a temporary file (which is practical for debugging, or if you just want to see quickly how many files there are). The file is then closed, and the function returns True.

Function: update_database

This is the crux of the file. It essentially opens the article-list.txt file, and reads each line of it. For each non-empty line, it then does the following steps: • Split the line on the “ - “ separator (so you get a list like this: [‘FCM100’, ‘C&C’, ‘Title’]) • Removes the FCM part - to get just the issue number. • Creates an empty key/value pair in the entry dict. • Fills the information into the entry dict (if the title is empty - as some of my files were poorly named in the past, it just inserts a string “Unknown”). It also removes any newline characters. • dateFind is used to calculate the estimated date for that issue. • Database.update is used to insert (or update) the information for the current issue. Once the for loop is completed, the file is then closed, the -1 entry of the database (original entryTemplate) is deleted, and the database is returned.

Function: write_database

This is a quick function that uses json.dumps to simply write the dict to a JSON file. It also indents it nicely.

Function: write_csv_database

This function uses csv.writer to create a valid CSV file. Line 95 dumps the keys of an entry (specifically, entry 100) into a list. The list is then used to create the CSV header, and also to make sure the order of the values are the same as the headers - so it matches up.

Function: main

This is just a function where I call the rest of the helper functions (and debugged issues). You can just as easily paste this into the if name == “main” area, but the recommendation is to use a main function, to allow easier importing.

Drive Functions

Function: cd

This just recreates the cd function from Bash, but also reverts to the original directory - so that the drive pull command can be executed in the correct directory, without messing up later writes to JSON or CSV files. Originally found on StackOverflow (see Further Reading).

Function: update_drive

This calls the cd command inside a with (to ensure the directory reverts), and calls drive pull.

Why Do Some Functions Return True?

This is mainly to indicate that the function completed correctly. It’s also a useful step if you want to support error handling.

What Do I Do With The Database?

There are a few things you can do. Using something like OpenRefine, you can clean up your database (such as find out which files are lacking titles). Or else you can export it to a CSV and import it into Google Sheets or similar.

Lastly, you can open it in something like Python’s Pandas and analyse it however you’d like.

Can I Search?

You can either open the JSON file and search by hand, open it in some form of data analysis tool, or write a new function to search the nested dictionary structure in Python itself.

I hope this article proves interesting to at least a few readers. Or, barring that, gives you some inspiration for new projects of your own. If you have any comments, suggestions or extensions, feel free to email me at lswest34+fcm@gmail.com.

Further Reading

https://github.com/odeke-em/drive Drive CLI tool.

http://stackoverflow.com/questions/431684/how-do-i-cd-in-python/24176022#24176022 Stack Overflow cd command in Linux.

issue108/c_c.1462032741.txt.gz · Dernière modification : 2016/04/30 18:12 de auntiee