I've been tasked with exporting files in bulk from S3 for customers that are wanting to get their hard earned data before a service migration.
Archiving files from S3 is not the most complex process once you figure it all out, but I've found that it can be rather inefficient, usually.
Take for example, a data store with 5 million files, most of which are quite small. The process may be as follows:
Copy the files with the AWS CLI tool. It might be a good idea to use "sync" rather than "cp" in case the process is interrupted and needs to be "resumed" (probably mandatory if you are dealing with much larger data stores).
With the resulting files, zip them and possibly split them if the archives grow too large. Zipping is easy. The splitting might need some extra creativity.
I was curious if much of this could be automated. I'm certain that my work will probably not provide much value with my work, but it's an interesting exercise to get more familiar with AWS S3 and its APIs.
For this exercise, I used the Boto3 library for Python. Documentation gaps aside, I found it rather convenient for this purpose. There were a few hurdles to get over, mainly things that are general problems in Python, like the lagging support for asynchronous programming (I would imagine a similar library in JavaScript would have a much better time with threaded downloading).
Hurdles aside, I put together a small app to reach this goal:
- Threaded downloading for fast transfers of TONS of small files.
- Minimal disk space usage.
- Zipping occurs at the same time as asynchronous downloading, so the intermediate raw files are kept to a a minimum, removed right after they are zipped.
- Orderly packing - files are packed in the archives in the same order as the S3 listing.
- Automatic splitting of archives.
- Resumable operation - if the process crashes, it can pick up near to where it left off rather than starting all over. It's safe to run the same "job" again, even when it's complete, and not worry about it redoing work.
Sounds like a rant from a pedantic coder, but a little bit of extra thought goes a long way in a product's lifecycle. (If you are building a popular product, anyway.)
Here are the results: http://gh.mukunda.com/zips3
Now, while testing this, I came across an interesting issue: The Python zipfile library seems pretty slow.
I figured it delegates the zipping operation to a C library, so I'm not sure where the performance impact actually comes from. Maybe the performance is fairly normal? I don't really know what I should expect. Some things I've read recommends using the 7-zip standalone binary for adding files to archives, but that isn't too feasible when we are not buffering the entire export at once. Each time 7za.exe adds a file to an archive, it copies the ENTIRE zip file, which can take quite a bit of time once you reach even more than 1GB of data.
Here are the benchmarks I've recorded:
For: 119 mp4 files - 9,640,651,628 bytes of video data
This data obviously won't compress very well, but it is what I used to test the efficiency of the archiving process.
LIBRARY COMPRESS LEVEL TIME-SECONDS SIZE-BYTES RATIO
zipfile none 0 20.796875 9,642,136,212 100.02%
zipfile deflate 1 233.015625 9,467,764,981 98.21%
zipfile deflate 2 236.984375 9,465,914,971 98.19%
zipfile deflate 3 241.28125 9,464,205,126 98.17%
zipfile deflate 4 252.21875 9,461,009,313 98.14%
zipfile deflate 5 252.375 9,457,943,164 98.10%
zipfile deflate 6 254.4375 9,456,448,738 98.09%
zipfile deflate 7 258.34375 9,456,083,368 98.09%
zipfile deflate 8 266.453125 9,455,605,937 98.08%
zipfile deflate 9 288.640625 9,455,419,969 98.08%
tarfile none 0 15.734375 9,640,867,840 100.00%
tarfile tar.gz 9 281.921875 9,455,427,755 98.08%
7za.exe zip each file 586.478610 9,460,267,331 98.13%
7za.exe zip all at once 100.711286 9,460,267,331 98.13%
We can see that the 7za method is pretty good - more than double, almost triple, but there is no way to "stream" files into the process, so everything needs to be available upfront. Using it to add files one at a time takes twice as much time (simply from the overhead of copying the whole archive every time).
My thoughts overall? I'm a fan of keeping it simple, so what I would recommend is to "just use some extra disk space to buffer everything" and then use the AWS CLI and 7zip. Temporary disk space is cheap these days, right?
We can see some pretty crazy stuff (read: difficult to test/maintain) in my Python code to get the same effect - much of the effort spent is writing the pipelined operation with the multi-threaded downloader. At least I'm more familiar with Python queues now.
Recommended approach to exporting S3 data in mass:
(1) Use AWS CLI to perform the export. I'm pretty sure you can configure some things like "how many download threads" in the local AWS profile file to boost the operation speed on good networks.
(2) AWS "sync" the items. Why "sync" and not "cp"? Sync can be resumed. The less efficiency is worth it to avoid complications when something goes wrong.
(3) Use 7zip to compress everything. This seems to be a fairly optimized program in the modern day.
(4) Delete the temporary files.