I found a fun question in my inbox today: If an application downloads a ZIP file with an update, what is the probability of the ZIP being corrupted? And should the update's hash (e.g. SHA256) be always attached as well? Let's take a look at the details.
There are basically two parts to the answer - the probability itself, and best practices.
Starting with the first one, let's consider a typical "download stack", i.e. HTTP over SSL over TCP over IP, and the ZIP file format itself. There are 3-4 mechanisms in play that need to be considered here:
- TCP packet's checksum: It's a 16-bit value, meaning (and slightly simplifying the problem to a rule-of-thumb) that if the transmitted data gets corrupted, there is a 1/216 (or ~0.15%) chance of the corruption not getting detected by the checksum. In practice if you're transmitting a lot of data (e.g. 1 GB) through a noisy medium (e.g. some form of radio), you're basically guaranteed to run into this problem.
- SSL/TLS (H)MAC/AEAD: Long story short SSL/TLS tries to do its best to protect the payload from being corrupted on purpose by a third party. Depending on the version this was done by either calculating a 128/256-bit MAC of the data (i.e. hash-based Message Authentication Code), or using AEAD (Authenticated Encryption with Associated Data). In general, it can be assumed that either approach will detect accidentally corrupted data, i.e. the probability of a corruption accidentally colliding a hash is basically non-existant, or 0.00000000000000000000000000000000000029% for 128-bit MACs, 0.00000000000000000000000000000000000000000000000000000000000000000000000000086% for 256-bit MACs, and so on.
- ZIP's CRC32: A 32-bit value. Not cryptographically safe (actually quite unsafe in fun and interesting ways, but that's for another time), but it still should be able to detect most corruptions that happen to file data in the ZIP archive (but NOT to ZIP headers and e.g. file names; even though file names are in two places in a ZIP files, almost no ZIP extractors compare both file names against each other).
- In addition every protocol parser (and the ZIP parser) on the way might detect corruption in the headers (though this isn't guaranteed). Also lower level protocols (e.g. Ethernet's FCS) might detect some corruptions - they usually also use 16- or 32-bit checksums.
So in the end, if we use HTTPS we should be safe, at least from corruptions made during transit (most of them are). However, if a corruption would be introduced e.g. while the data is still being handled on the sender's side (and a cosmic ray would hit the CPU in the right place), then by the time the data is safely transmitted through SSL it's already too late. So an additional update hash would save the day.
By the way...
If you'd like to learn SSH in depth, in the second half of January'25 we're running a 6h course - you can find the details at hexarcana.ch/workshops/ssh-course
What about current best practices?
Basically it's recommended that downloadable updates are cryptographically signed with a private key, and after the download is complete, the application checks whether the signature is correct using a public key (that's hardcoded in the application). This way apart from detecting accidental corruptions, we're also stopping a potential attacker from supplying their own update package (e.g. after hacking the update server). Of course this means now we have to protect the private key and somehow safely incorporate it into our build process, but at the end of the day it's probably worth it.
Gynvael
Comments:
Add a comment: