August 12, 2021

Quick Tricks: Downloading Files in Parallel with Curl and Parallel

Recently, I've been doing a lot of downloading of data. The Kinetics dataset from DeepMind is about 600GB, and as hosted by the Common Visual Data Foundation,  consists of about 1300 links to files, each about 1.5GB that need to be downloaded. It's quite a task. In this post, I'll talk a bit about how I got this done.

What didn't work - Aria2

Originally, I looked online for a good way to download large numbers of files from the internet. Aria2 seemed to me to be the best option, and fit the bill perfectly. The first time that I went about this download, I used the following option:

$ aria2c -x8 -j32 -i k600_train_path.txt

This allowed me to use 8 connections per file, with 32 unique processes to fetch the data from the list of S3 links in the k600_train_path.txt file. While the usual common courtesy on the internet is to make no more than 4 connections, with 16 parallel downloads, I figured that Amazon, the great S3 provider, could handle the additional traffic which would be able to saturate my 1Gbps conneciton.

While the files downloaded fine (and in remarkable time, I might add), I had an issue: The files were mostly incomplete, and thus, contained a truncated amount of training data. Because tar/gzip is capable of recoving from some corrruption, and is a linear file, I actually didn't notice until I realized that some of my training classes had no examples!  Not only this, Aria2 didn't provide any warnings or errors, so I had figured everything was ok. I guess not.

Usually, when you're working with data at the research scale, you can run into issues that people hadn't thought much about in the initial design process. Files are too large, there are too many files, etc.

A better/old-school option: Curl + GNU Parallel

With Aria2 off the table, I went back to my bash roots, and dusted off two of my favorite command line programs: curl and parallel. Both of these programs are remarkably powerful, and I figured, it would be easy enough to stitch together the two to create a parallel downloader of my own. Not only this, but I probably could have done this faster than typing apt install aria2c which is the command to install the (somewhat flawed) data downloader above.

First, I had to think about how to download a single file. Easy, right?

$ curl -O https://s3.amazonaws.com/kinetics/600/test/part_59.tar.gz

Then, we just need to let parallel do it's magic. Since I have a pretty large number of files, I don't have to worry too much about the number of connections per file.

$ cat k600_train_path.txt | parallel --bar -j32 curl -sO {}

Let's break down the command above. First, this takes the list of URLs in the k600_train_path.txt file, and puts them into stdout. I then pipe these URLs into GNU parallel with two options: --bar to give me a nice download progress bar, and -j32 for 32 parallel threads of execution. What follows, is the command that I want to run for each of the URLs. curl -sO {}. The -sO says that curl should be silent (so it doesn't mess with my beautiful progress bar), and that the file should be dumped to disk. The {} is where each of the URLs is coming in from stdin should go.

This worked, and in fact, it was much cleaner, with tools that I already had installed on my system. I probably should have just gone with this from the start, but I am, reallly, a sucker for new tools. You can use this in your own work as well! Whenever you need a large number of files downloaded, or need to do a batch operation, I'd look into the wonders of GNU parallel, and what it can do! There's so much more to this program than meets the eye, and it's well worth your time to take a look. Check out the original docs here, and get it with sudo apt install parallel on ubuntu, or check here for any other system.