1

Right at the start of September, I received an email from a long-time reader, who I've had previous contact with. The gist of the email was that he had written a script to search a PDF document, and then take each page with a match, and create a new PDF of only the results. The original scenario was that of a law student who had to hunt through PDFs that were thousands of pages long, but I can foresee it being useful for others too (students making a study guide on a particular topic, picking interesting articles out of PDFs, etc). As such, this month I will be giving a quick run-through of how the program works, and the technologies/commands it is based on. The Requirements • grep – in package grep (should be pre-installed in Ubuntu) • pdfinfo – in poppler-utils (should be pre-installed in Ubuntu) • pdfunite – in poppler-utils (should be pre-installed in Ubuntu) • pdftotext – in poppler-utils (should be pre-installed in Ubuntu) • pdfjam – in pdfjam package in Ubuntu, or textlive-extra-utils.

Tout début septembre, j'ai reçu un courriel d'un lecteur de longue date, avec qui j'avais deéjà eu quelques contacts. L'essentiel du courriel était qu'il avait écrit un script pour faire des recherches dans un document PDF, et ensuite prendre chaque page correspondante et créer un nouveau fichier PDF avec seulement ces résultats. Le scénario original était celui d'un étudiant en droit qui devait faire des recherches dans des PDF contenant des milliers de pages, mais je peux prévoir qu'il sera utile pour les autres aussi (étudiants faisant un guide d'étude sur un sujet particulier, extraire des articles intéressants depuis des fichiers PDF, etc.). Et donc, ce mois-ci, je vais donner une explication rapide du fonctionnement du programme, et des technologies/commandes sur lesquels il est basé.

Les exigences • grep - dans le paquet grep (devrait être pré-installé dans Ubuntu) • pdfinfo - dans poppler-utils (devrait être pré-installé dans Ubuntu) • pdfunite - dans poppler-utils (devrait être pré-installé dans Ubuntu) • pdftotext - dans poppler-utils (devrait être pré-installé dans Ubuntu) • pdfjam - dans le paquet pdfjam dans Ubuntu ou textlive-extra-utils.

2

Most of these commands are fairly self-explanatory. The most cryptic ones are grep (which is a search command for the command-line), and pdfjam (which is a shell script for merging and splitting PDFs). The Script The most recent copy of the script can be found here: http://homepages.dcc.ufmg.br/~lcerf/en/utilities.html#pdf-page-grep (the “download it” link is under ‘Installation’). I will be referencing line numbers, so it may be useful to download a copy and open it in a text editor that offers line numbers, in order to follow along. If you don't want to include a full path every time you search a PDF, you can either symbolically link it to /usr/bin with: sudo ln -s /path/to/script /usr/bin/pdf-page-grep or create a scripts folder in your user's home, and then add it to your PATH variable.

La plupart de ces commandes sont assez explicites. Les plus cryptiques sont grep (qui sert à faire une recherche en ligne de commande), et pdfjam (qui est un script shell pour la fusion et le fractionnement des fichiers PDF).

Le script

La version la plus récente du script est ici : http://homepages.dcc.ufmg.br/~lcerf/en/utilities.html#pdf-page-grep (le lien « télécharger » est sous « Installation »). Je ferai référence aux numéros de ligne, donc il peut être utile, pour me suivre, de télécharger une copie et de l'ouvrir dans un éditeur de texte qui affiche les numéros de ligne.

Si vous ne voulez pas préciser un chemin d'accès complet à chaque fois que vous recherchez dans un fichier PDF, vous pouvez soit faire un lien symbolique dans /usr/bin avec :

sudo ln -s /path/to/script /usr/bin/pdf-page-grep

soit créer un dossier de scripts dans votre répertoire HOME, puis l'ajouter à votre variable PATH.

3

How to use it • Install the requirements • Make the script executable: chmod +x /path/to/file Make sure to change the path to the location you saved the script to. • Run the script — Without arguments to see the usage information — Run the command based off the usage information – such as: /path/to/pdf-page-grep -i issue*.pdf pattern: command & conquer OR pattern: (empty to stop)

4

This example would search for Command & Conquer (while ignoring upper and lower case) in all PDFs whose name starts with ‘issue’, and ends with ‘.pdf’ (which should cover all copies of FCM, unless you rename them). This way, you'll end up with a PDF of all C&C articles in the issues you've downloaded. Naturally, there are other options you can use (-E for extended regular expressions, -F for fixed strings, -P for perl regular expressions, -w for matching only whole words, and -x for matching whole lines). How does it work? If you open the script up in your favourite text editor, you'll notice that it's formatted nicely with indentations, comments, spacing, and a uniform system to loops. The first section of the file (lines 1-7) would fall under what I class as “preamble” - it contains information on the author, sets the environment at the top for Linux, gives information on the license, and then sets up the variables used later in the file. In this case, the only variable used is SUFFIX – which, as you might imagine, is the suffix added to the new PDF file that contains the matches (default value: -matches).

5

Lines 9-25 contain an if-statement which checks whether or not there are any arguments – if not, it will then print out the usage information. When I write scripts such as these, I tend to also include a check to see if the argument is “-h”, and/or compare it to a list of accepted arguments. In this case I would skip the accepted arguments, as you're looking for file names, and can hardly have a complete list to compare against. Lines 27-28 create a temporary location for storing the PDFs while they're being processed by the script (as it converts the PDF with pdftotext in order to use grep on them). This is an accepted practice for keeping the script results clean (i.e. not leaving files all over your home folder). Lines 29-30 uses the trap command to empty the temporary folder when the script exits (including when the script is interrupted by the user or system – i.e. when you hit ctrl+c).

6

Lines 31-44 are a while loop that repeatedly asks the user for possible search terms, until they enter an empty string. Once an empty string is entered, it moves on in the program. This pattern can also be a basic regular expression. Lines 46 – 54 is a for loop that checks the passed arguments for any beginning with a hyphen – as it is assumed it indicates an argument. If I were the one authoring this script, I would have opted for an array of acceptable options, and searched for them instead. If your filename starts with a hyphen, I would imagine the script would fail. However, it is pretty uncommon for a file to be named in such a way.

7

Lines 56 – 93 is a for loop that basically reverses the check from lines 46-54 (checks for any argument not beginning with a hyphen), and assumes it is a filename. It then starts a new line, prints “matching pages in <filename>:<list of pages>”. In the end you should have a list of every PDF searched, as well as a list of every page number that matched one of your search terms. The last two lines will tell you where the results were saved, and how many matching PDFs were found. The actual search is done by converting each page of the pdf to text (using pdftotext), and then piping it through grep to find the results. If there is a match, it will return the page number, remember it in the variable $sel, and continue to the next page. After the for loop of pages is complete, it will increment the number of matched PDFs (if there was a match), extract the matched pages into a temporary file, reset the list of matched pages, and then remember the original name of the last matched PDF. Lines 96-101 check if the number of matched PDF files exists. If not, there were no matches, and the program exits.

8

Lines 102-112 cover the case of one matching file (outputs “1 matching PDF file”, and then moves the temporary file into the final PDF of results – which avoids issues with pdfunite expecting more than one file), as well as multiple matches. When multiple matched PDFs exist, it will use pdfunite to merge the files into the -matches pdf. Line 113 – This line simply prints out the name of the resulting file, so the user can find it. I've skimmed over certain specifics of the script for two reasons – one being brevity, and the other being that figuring out exactly how a script works simply by reading it and running it is a good skill to have, especially if you plan on writing your own scripts or programs. If anyone has particular questions about a certain segment of the script, you're welcome to send me a quick email. If you have any other questions, suggestions, or requests, you're welcome to send me an email at lswest34+fcm@gmail.com.