1

Right at the start of September, I received an email from a long-time reader, who I've had previous contact with. The gist of the email was that he had written a script to search a PDF document, and then take each page with a match, and create a new PDF of only the results. The original scenario was that of a law student who had to hunt through PDFs that were thousands of pages long, but I can foresee it being useful for others too (students making a study guide on a particular topic, picking interesting articles out of PDFs, etc). As such, this month I will be giving a quick run-through of how the program works, and the technologies/commands it is based on. The Requirements • grep – in package grep (should be pre-installed in Ubuntu) • pdfinfo – in poppler-utils (should be pre-installed in Ubuntu) • pdfunite – in poppler-utils (should be pre-installed in Ubuntu) • pdftotext – in poppler-utils (should be pre-installed in Ubuntu) • pdfjam – in pdfjam package in Ubuntu, or textlive-extra-utils.

Tout début septembre, j'ai reçu un courriel d'un lecteur de longue date, avec qui j'avais déjà eu quelques contacts. L'essentiel du courriel était qu'il avait écrit un script pour faire des recherches dans un document PDF, et ensuite prendre chaque page correspondante et créer un nouveau fichier PDF avec seulement ces résultats. Le scénario original était celui d'un étudiant en droit qui devait faire des recherches dans des PDF contenant des milliers de pages, mais je peux prévoir qu'il sera utile pour les autres aussi (étudiants faisant un guide d'étude sur un sujet précis, extraire des articles intéressants depuis des fichiers PDF, etc.). Et donc, ce mois-ci, je vais donner une explication rapide du fonctionnement du programme et des technologies/commandes sur lesquels il est basé.

Les exigences : • grep - dans le paquet grep (devrait être pré-installé dans Ubuntu) ; • pdfinfo - dans poppler-utils (devrait être pré-installé dans Ubuntu) ; • pdfunite - dans poppler-utils (devrait être pré-installé dans Ubuntu) ; • pdftotext - dans poppler-utils (devrait être pré-installé dans Ubuntu) ; • pdfjam - dans le paquet pdfjam dans Ubuntu ou textlive-extra-utils.

2

Most of these commands are fairly self-explanatory. The most cryptic ones are grep (which is a search command for the command-line), and pdfjam (which is a shell script for merging and splitting PDFs). The Script The most recent copy of the script can be found here: http://homepages.dcc.ufmg.br/~lcerf/en/utilities.html#pdf-page-grep (the “download it” link is under ‘Installation’). I will be referencing line numbers, so it may be useful to download a copy and open it in a text editor that offers line numbers, in order to follow along. If you don't want to include a full path every time you search a PDF, you can either symbolically link it to /usr/bin with: sudo ln -s /path/to/script /usr/bin/pdf-page-grep or create a scripts folder in your user's home, and then add it to your PATH variable.

La plupart de ces commandes sont assez explicites. Les plus cryptiques sont grep (qui sert à faire une recherche en ligne de commande), et pdfjam (qui est un script shell pour la fusion et le fractionnement des fichiers PDF).

Le script

La version la plus récente du script est ici : http://homepages.dcc.ufmg.br/~lcerf/en/utilities.html#pdf-page-grep (le lien « Télécharger » est sous « Installation »). Je ferai référence aux numéros de ligne ; il pourra donc être utile, pour me suivre, de télécharger un exemplaire et de l'ouvrir dans un éditeur de texte qui affiche les numéros de ligne.

Si vous ne voulez pas préciser un chemin d'accès complet à chaque fois que vous recherchez dans un fichier PDF, vous pouvez soit faire un lien symbolique dans /usr/bin avec :

sudo ln -s /chemin/du/script /usr/bin/pdf-page-grep

soit créer un dossier de scripts dans votre répertoire HOME, puis l'ajouter à votre variable PATH.

3

How to use it • Install the requirements • Make the script executable: chmod +x /path/to/file Make sure to change the path to the location you saved the script to. • Run the script — Without arguments to see the usage information — Run the command based off the usage information – such as: /path/to/pdf-page-grep -i issue*.pdf pattern: command & conquer OR pattern: (empty to stop)

Comment l'utiliser

• Installer ce qui est requis. • Rendre le script exécutable :

chmod +x /chemin/du/fichier

Assurez-vous d'utiliser le chemin vers l'emplacement où vous avez enregistré le script. • Exécutez le script - sans arguments pour voir les informations d'utilisation ; - exécutez la commande en adaptant à votre usage, par exemple :

/chemin/de/pdf-page-grep -i numero*.pdf

motif : command & conquer

OU

motif : (vide pour arrêter)

4

This example would search for Command & Conquer (while ignoring upper and lower case) in all PDFs whose name starts with ‘issue’, and ends with ‘.pdf’ (which should cover all copies of FCM, unless you rename them). This way, you'll end up with a PDF of all C&C articles in the issues you've downloaded. Naturally, there are other options you can use (-E for extended regular expressions, -F for fixed strings, -P for perl regular expressions, -w for matching only whole words, and -x for matching whole lines). How does it work? If you open the script up in your favourite text editor, you'll notice that it's formatted nicely with indentations, comments, spacing, and a uniform system to loops. The first section of the file (lines 1-7) would fall under what I class as “preamble” - it contains information on the author, sets the environment at the top for Linux, gives information on the license, and then sets up the variables used later in the file. In this case, the only variable used is SUFFIX – which, as you might imagine, is the suffix added to the new PDF file that contains the matches (default value: -matches).

Cet exemple cherche « Command & Conquer » (sans tenir compte des majuscules et minuscules) dans tous les fichiers PDF dont le nom commence par « numero », et se termine par « .pdf » (ce qui doit couvrir tous les numéros du FCM, à moins que vous ne les renommiez). Ainsi, vous vous retrouverez avec un fichier PDF contenant tous les articles de C & C des numéros que vous avez téléchargés.

Naturellement, il y a d'autres options possibles (-E pour les expressions régulières étendues, -F pour des chaînes fixes, -P pour les expressions régulières Perl, -w pour chercher seulement des mots entiers, et -X pour chercher seulement des lignes entières).

Comment çela fonctionne-t-il ?

Si vous ouvrez le script dans votre éditeur de texte favori, vous remarquerez qu'il est joliment formaté avec indentations, commentaires, espacements et un système uniforme de boucles. La première section du fichier (lignes 1 à 7) est ce que j'appellerai un « préambule » - elle contient des informations sur l'auteur, définit l'environnement pour Linux, donne des informations sur la licence, puis met en place les variables utilisées plus tard dans le fichier. Dans ce cas, la seule variable est SUFFIX - qui, comme vous pouvez l'imaginer, est le suffixe ajouté au nouveau fichier PDF qui contient les correspondances (valeur par défaut: -matches).

5

Lines 9-25 contain an if-statement which checks whether or not there are any arguments – if not, it will then print out the usage information. When I write scripts such as these, I tend to also include a check to see if the argument is “-h”, and/or compare it to a list of accepted arguments. In this case I would skip the accepted arguments, as you're looking for file names, and can hardly have a complete list to compare against. Lines 27-28 create a temporary location for storing the PDFs while they're being processed by the script (as it converts the PDF with pdftotext in order to use grep on them). This is an accepted practice for keeping the script results clean (i.e. not leaving files all over your home folder). Lines 29-30 uses the trap command to empty the temporary folder when the script exits (including when the script is interrupted by the user or system – i.e. when you hit ctrl+c).

Les lignes 9 à 25 contiennent une instruction if qui vérifie s'il y a des arguments, sinon, elle affiche les informations d'utilisation. Quand j'écris des scripts comme celui-ci, j'ai tendance à inclure également une vérification pour voir si l'argument est « -h », et/ou le comparer à une liste des arguments acceptés. Dans ce cas, je vais sauter la vérification des arguments acceptés, car on va recevoir des noms de fichiers et on peut difficilement avoir une liste complète pour la comparaison.

Les lignes 27-28 créent un emplacement temporaire pour stocker les fichiers PDF pendant qu'ils sont en cours de traitement par le script (car on convertit le fichier PDF avec pdftotext pour utiliser grep sur les textes). C'est une pratique acceptée pour garder propres les résultats du script (c'est-à-dire ne pas laisser des fichiers partout dans votre dossier de départ).

Les lignes 29-30 utilisent la commande trap pour vider le dossier temporaire lorsque le script se termine (y compris lorsque le script est interrompu par l'utilisateur ou le système, c'est-à-dire lorsque vous appuyez sur Ctrl-c).

6

Lines 31-44 are a while loop that repeatedly asks the user for possible search terms, until they enter an empty string. Once an empty string is entered, it moves on in the program. This pattern can also be a basic regular expression. Lines 46 – 54 is a for loop that checks the passed arguments for any beginning with a hyphen – as it is assumed it indicates an argument. If I were the one authoring this script, I would have opted for an array of acceptable options, and searched for them instead. If your filename starts with a hyphen, I would imagine the script would fail. However, it is pretty uncommon for a file to be named in such a way.

Les lignes 31 à 44 sont une boucle while qui demande à l'utilisateur des termes à rechercher, jusqu'à ce qu'elle entre dans une chaîne vide. À ce moment-là, on passe à la suite du programme. Ce terme peut également être une expression régulière simple.

Les lignes 46 à 54 sont une boucle for qui vérifie si les arguments passés commencent avec un tiret, car ils sont supposés indiquer un argument. Si j'étais l'auteur de ce script, j'aurais plutôt opté pour un tableau d'options acceptables, que j'aurais alors pu rechercher. Si un nom de fichier commence par un tiret, j'imagine que le script sera en échec. Cependant, il est assez rare qu'un fichier soit nommé de cette façon.

7

Lines 56 – 93 is a for loop that basically reverses the check from lines 46-54 (checks for any argument not beginning with a hyphen), and assumes it is a filename. It then starts a new line, prints “matching pages in <filename>:<list of pages>”. In the end you should have a list of every PDF searched, as well as a list of every page number that matched one of your search terms. The last two lines will tell you where the results were saved, and how many matching PDFs were found. The actual search is done by converting each page of the pdf to text (using pdftotext), and then piping it through grep to find the results. If there is a match, it will return the page number, remember it in the variable $sel, and continue to the next page. After the for loop of pages is complete, it will increment the number of matched PDFs (if there was a match), extract the matched pages into a temporary file, reset the list of matched pages, and then remember the original name of the last matched PDF. Lines 96-101 check if the number of matched PDF files exists. If not, there were no matches, and the program exits.

Les lignes 56 à 93 sont une boucle for qui sert à inverser la vérification des lignes 46 à 54 (des arguments qui ne commencent pas par un tiret), et suppose que c'est un nom de fichier. Il commence alors une nouvelle ligne, affiche « pages correspondantes dans <nom de fichier> : <liste des pages> ». Au final, vous devriez avoir une liste de tous les fichiers PDF recherchés, ainsi qu'une liste de tous les numéros de page qui correspondent à l'un de vos termes de recherche. Les deux dernières lignes vous diront où les résultats ont été enregistrés et combien de fichiers PDF correspondant ont été trouvés. La recherche proprement dite est effectuée par conversion de chaque page du PDF en texte (en utilisant pdftotext), puis envoi vers grep à travers un « pipe » pour trouver les résultats. S'il y a une correspondance, il retournera le numéro de page, s'en souviendra dans la variable $sel et passera à la page suivante. Une fois terminée la boucle sur les pages, il incrémente le nombre de fichiers PDF correspondant (s'il y avait une correspondance), extrait les pages trouvées dans un fichier temporaire, réinitialise la liste des pages correspondant à la recherche, puis se souvient du nom d'origine du dernier PDF correspondant.

Les lignes 96 à 101 vérifient si le nombre de fichiers PDF correspondant à la recherche existe. Sinon, il n'y avait aucun résultat, et le programme se termine.

8

Lines 102-112 cover the case of one matching file (outputs “1 matching PDF file”, and then moves the temporary file into the final PDF of results – which avoids issues with pdfunite expecting more than one file), as well as multiple matches. When multiple matched PDFs exist, it will use pdfunite to merge the files into the -matches pdf. Line 113 – This line simply prints out the name of the resulting file, so the user can find it. I've skimmed over certain specifics of the script for two reasons – one being brevity, and the other being that figuring out exactly how a script works simply by reading it and running it is a good skill to have, especially if you plan on writing your own scripts or programs. If anyone has particular questions about a certain segment of the script, you're welcome to send me a quick email. If you have any other questions, suggestions, or requests, you're welcome to send me an email at lswest34+fcm@gmail.com.

Les lignes 102 à 112 couvrent le cas d'un fichier correspondant (et affichent « 1 fichier PDF correspondant », puis déplace le fichier temporaire dans le PDF final contenant les résultats, ce qui évite des problèmes avec pdfunite qui attendrait plus d'un fichier), ainsi que les correspondances multiples. Lorsque plusieurs fichiers PDF correspondants existent, il utilisera pdfunite pour fusionner les fichiers dans le pdf -matches.

La ligne 113 - Cette ligne affiche simplement le nom du fichier résultant, pour que l'utilisateur puisse le trouver.

Je ne me suis pas étendu sur certaines spécificités du script pour deux raisons : l'une étant la concision et l'autre étant que déterminer exactement comment un script fonctionne simplement en le lisant et en l'exécutant est une bonne compétence à avoir, surtout si vous prévoyez d'écrire vos propres scripts ou programmes. Si quelqu'un a des questions particulières sur un certain morceau du script, qu'il m'envoie un courriel rapide. Si vous avez d'autres questions, des suggestions ou des demandes, n'hésitez pas à m'envoyer un courriel à lswest34+fcm@gmail.com.