How to use parallel to speed up your work

GNU Parallel is a tool to execute multiple commands at once. In its basic usage, you would list your commands in a file, so that it can execute them, several at a time.

It gives the most benefit on processes that don’t fully utilise your CPU. Almost every laptop, desktop and single board computer now has multiple CPU cores available, so you are probably missing out if you frequently perform batch operations without it.

Installation

On Debian or Ubuntu:

sudo apt-get install parallel
parallel --cite

On Fedora the package name is the same:

sudo dnf install parallel
parallel --cite

Example 1: Convert loops to pipes

Using the ImageMagick tool to convert a folder of GIF images to PNG format can be done in a loop:

for i in *.gif; do convert $i -scale 200% ${i%.*}.png; done

Or, you could print each command in a loop then pass them to parallel.

for i in *.gif; do echo convert $i -scale 200% ${i%.*}.png; done | parallel

The second command is many times faster on a multi-core computer.

Example 2: Replace xargs with parallel

This command executes a single “pngcrush” command on each PNG file in a directory, one at a time.

find . -type f -name '*.png' -print0  | xargs -0 -n1 -r pngcrush -q -ow -brute

To convert this to use parallel, you would use the following command-line:

find . -type f -name '*.png' | parallel "pngcrush -q -ow -brute {}"

Don’t use xargs in parallel mode

Expert command line users will also know about xargs -P, which seems to do the same thing at a glance.

xargs is good at making really long command-lines, and not so good at executing multiple commands at once. It will mix the output of the commands, and requires you to specify the number of jobs to run.

Parallel is designed to do lots of things at once, and it does it well. It will choose some good defaults for the number of processes to execute, and adds an insane collection of features that you need for large batches. To name just a few:

  • Control spawning of new jobs based on things like available memory, system load, or an absolute number of jobs to keep running
  • Distribute jobs to remote computers
  • Show progress
  • Control of when to terminate the jobs

Leave a Reply

Your email address will not be published.