s3cmd-What is the algorithm to compute the Amazon-S3 Etag for a file larger than 5GB?

Emerson Farrugia 2021-01-02 02:18:28

Say you uploaded a 14MB file to a bucket without server-side encryption, and your part size is 5MB. Calculate 3 MD5 checksums corresponding to each part, i.e. the checksum of the first 5MB, the second 5MB, and the last 4MB. Then take the checksum of their concatenation. MD5 checksums are often printed as hex representations of binary data, so make sure you take the MD5 of the decoded binary concatenation, not of the ASCII or UTF-8 encoded concatenation. When that's done, add a hyphen and the number of parts to get the ETag.

Here are the commands to do it on Mac OS X from the console:

$ dd bs=1m count=5 skip=0 if=someFile | md5 >>checksums.txt
5+0 records in
5+0 records out
5242880 bytes transferred in 0.019611 secs (267345449 bytes/sec)
$ dd bs=1m count=5 skip=5 if=someFile | md5 >>checksums.txt
5+0 records in
5+0 records out
5242880 bytes transferred in 0.019182 secs (273323380 bytes/sec)
$ dd bs=1m count=5 skip=10 if=someFile | md5 >>checksums.txt
2+1 records in
2+1 records out
2599812 bytes transferred in 0.011112 secs (233964895 bytes/sec)

At this point all the checksums are in checksums.txt. To concatenate them and decode the hex and get the MD5 checksum of the lot, just use

$ xxd -r -p checksums.txt | md5

And now append "-3" to get the ETag, since there were 3 parts.

Notes

If you uploaded with aws-cli via aws s3 cp then you most likely have a 8MB chunksize. According to the docs, that is the default.
If the bucket has server-side encryption (SSE) turned on, the ETag won't be the MD5 checksum (see the API documentation). But if you're just trying to verify that an uploaded part matches what you sent, you can use the Content-MD5 header and S3 will compare it for you.
md5 on macOS just writes out the checksum, but md5sum on Linux/brew also outputs the filename. You'll need to strip that, but I'm sure there's some option to only output the checksums. You don't need to worry about whitespace cause xxd will ignore it.

Code Links

A Gist I wrote with a working script for macOS.
The project at s3md5.

sanyi 2013-11-11 10:52:10

interesting finding, hoping that amazon will not change it since it's undocumented feature

Emerson Farrugia 2013-11-11 11:12:32

Good point. According to the HTTP spec, the ETag is completely up to their discretion, the only guarantee is that they can't return the same ETag for a changed resource. I'm guessing there's not much advantage to changing the algorithm though.

DavidG 2014-08-05 22:59:32

Is there a way to compute the "part size" out of the etag?

Emerson Farrugia 2014-08-06 08:45:07

"Compute" no, "guess" maybe. If the ETag ends in "-4", you know that there are four parts, but that last part can have a size as small as 1 byte up to the part size. So dividing the file size by the number of parts gives you an estimate, but when the number of parts is small, e.g. -2, it gets harder to guess. If you have multiple files that were uploaded using the same part size, you could also look for adjacent part counts, e.g. -4 and -5 and narrow down what the part size can be, e.g. 1.9MB at -2 and 2.1MB at -3 means the part size is 2MB plus or minus 100KB.

iman 2018-03-13 17:08:39

I don't think it would be wise to rely on the internal implementation of AWS as long as they don't expose their hashing algorithm as a contract especialy if it impacts application correctness which is usually the case when you are verifying the integrity of data.

What is the algorithm to compute the Amazon-S3 Etag for a file larger than 5GB?

热门帖子

热门github