Tuesday, November 11, 2014

Three way MD5 collision

Previously I explained how I created two images one of James Brown the other of Barry White with the same MD5 hash. At the end of the post I said I was going to try and create a three way collision where three images have the same MD5 hash. Neil K made a suggestion about the image

So I set to work.

After a couple of false starts where I started with the wrong image file I managed to achieve a three way collision. Here are the images.

If you want to check
$ curl -s http://www.fishtrap.co.uk/black.jpg.coll | md5
$ curl -s http://www.fishtrap.co.uk/brown.jpg.coll | md5
$ curl -s http://www.fishtrap.co.uk/white.jpg.coll | md5

A new hash value

This isn't the same hash as before instead the 3 images now collide with a new hash value b69dd1fd1254868b6e0bb8ed9fe7ecad . This is because I had to add near collision blocks to all three images. In the case of the first two the blocks added are the same. This is probably best illustrated with a diagram.


Again I created the files with HashClash. As inputs I used white.jpg and black.jpg images. To make brown.jpg.coll I just had to append the extra collision blocks to brown.jpg which was already a collision with white.jpg. 

I could go on adding more and more files in a tree structure to get many documents to collide. The number of collisions needed is n-1 where n is the number of files. It was this tree of collisions that allowed Marc Stevens to predict the 2008 US presidential election.

A word about file sizes

The files started out different sizes to each other, however, before each collision was generated between two files padding had to be added to one of the files to make it the same as the other. Without this step it would be impossible to extend a collision in the unpadded version to the full MD5 algorithm. This is because the padding includes the size of data processed. 

Wednesday, November 5, 2014

Colliding MD5 Images go crazy

I nearly chocked on my cornflakes while looking at blog stats yesterday morning to see several thousand page views in the morning. Looking at the referring URLs it seemed a link had been posted on Hacker News. I browse that site most days. It was on the front page at number five, anything on the front page is BIG. Later on it got posted to a couple of reddit.com subreddits. I was getting a lot of traffic, all quite unexpected.

Originally I wrote the blog post for a couple of reasons, firstly to aid my own memory about what I had done and how to repeat it and secondly to explain to a few of my friends how I made the images which I had already put on twitter. I didn't really see it as Hacker News material, anyone who has an interest in hash functions will have done this I thought. 

Here are my answers to a couple of things people have said

This isn't new

Agreed, it certainly is not. All the papers and software I referenced and used are at least four years old. The original block buster attack by Wang is ten years old. Even before that in 1996 just four years after its release MD5 was known to be weak at not recommended. 

My guess at what caught peoples attention were two things:
  1. Images are much easier to comprehend as collisions than random byte strings.
  2. The fact I was clearly a rank amateur with nothing more than an AWS account and basic knowledge.

Can you do the same thing with SHA1 and SHA2?

No not really, in the case of SHA1 there are theoretical attacks which work the same way that mean it should not be used either. Marc Stevens has published an attack which he estimates to have a complexity of 261 compression functions for same prefix and 277 for chosen prefix collisions. In fact HashClash contains some code for finding differential paths and near collision blocks for SHA1. However to come up with a collision would take vast computing effort and (unless you control a botnet) expense. I'm sceptical of the exactness of the numbers in here but certainly that order of magnitude. As if to prove this fact there are no published collisions in the full SHA1. 

So with those sort of numbers I'm not got to stick it on an AWS instance backed by my credit card. 

SHA1 seems to be surprisingly resistant to differential analysis attacks this is one reason pretty much nobody has moved to using SHA3 yet. I like SHA3 or Keccak as it is also known.

SHA2 looks like even more of a challenge than SHA1 for differential analysis.

Is this practical?

Hopefully not, no one should be using MD5 for anything. However, old habits die hard and once upon a time MD5 seemed like a fast and secure hash function. You can still find it in use in package managers, download pages and is probably used server side in many applications to verify file uniqueness. I haven't found anywhere still using it in SSL certs following flame.

Can you do a third image collision?

Currently a work in progress, watch this space.