Hash collision

In computer science, a collision or clash is a situation that occurs when two distinct pieces of data have the same hash value, checksum, fingerprint, or cryptographic digest.[1]

Due to the possible applications of hash functions in data management and computer security (in particular, cryptographic hash functions), collision avoidance has become a fundamental topic in computer science.

Collisions are unavoidable whenever members of a very large set (such as all possible person names, or all possible computer files) are mapped to a relatively short bit string. This is merely an instance of the pigeonhole principle.[1]

The impact of collisions depends on the application. When hash functions and fingerprints are used to identify similar data, such as homologous DNA sequences or similar audio files, the functions are designed so as to maximize the probability of collision between distinct but similar data, using techniques like locality-sensitive hashing.[2] Checksums, on the other hand, are designed to minimize the probability of collisions between similar inputs, without regard for collisions between very different inputs.[3]

Computer securityEdit

Hash functions can map different data to the same hash (by virtue of the pigeonhole principle), malicious users can take advantage of this to mimic data.[4]

For example; consider a hash function that hashes data by returning the first three characters of the string it is given (i.e. "Password12345" goes to "Pas"). A hacker, who does not know the user's password, could instead enter "Pass" — which would generate the same hash value of "Pas". Even though the hacker does not know the correct password, they do have a password that gives them the same hash - which would give them access. This type of attack is called a preimage attack.

In practice, security-related applications use cryptographic hash algorithms, which are designed to be long enough for random matches to be unlikely, fast enough that they can be used anywhere, and safe enough that it would be extremely hard to find collisions.[3]

See alsoEdit


  1. ^ a b Jered Floyd (2008-07-18). "What do Hash Collisions Really Mean?". permabit.wordpress.com: Permabits and Petabytes. Retrieved 2011-03-24. For the long explanation on cryptographic hashes and hash collisions, I wrote a column a bit back for SNW Online, "What you need to know about cryptographic hashes and enterprise storage". The short version is that deduplicating systems that use cryptographic hashes use those hashes to generate shorter "fingerprints" to uniquely identify each piece of data, and determine if that data already exists in the system. The trouble is, by a mathematical rule called the "pigeonhole principle", you can’t uniquely map any possible files or file chunk to a shorter fingerprint. Statistically, there are multiple possible files that have the same hash.
  2. ^ Rajaraman, A.; Ullman, J. (2010). "Mining of Massive Datasets, Ch. 3".
  3. ^ a b Al-Kuwari, Saif; Davenport, James H.; Bradford, Russell J. (2011). Cryptographic Hash Functions: Recent Design Trends and Security Notions. Inscrypt '10.
  4. ^ Schneier, Bruce. "Cryptanalysis of MD5 and SHA: Time for a New Standard". Computerworld. Archived from the original on 2016-03-16. Retrieved 2016-04-20. Much more than encryption algorithms, one-way hash functions are the workhorses of modern cryptography.