docs/MACHOC_HASH.md
# Machoc hash
The machoc hash is a fuzzy hash mechanism based on the *Call Flow Graph* _(CFG)_ of a function.
This documentation will provide the algorithm specification and some use cases.
## Algorithm specification
1. For each function, the basic blocks composing the CFG should be given ordinal numbers ordered by address.
2. A `call` instruction should split the basic block (to respect the original implementation with `metasm`).
3. Each basic block must be translated in a string of the form `NUMBER:[c,][DST, ...];`
* `NUMBER` is the basic block number
* `c,` must added if the basic block contains a call.
* `DST` is the next block's number. Each next block numbers must be added and separated with a comma.
* The line must be terminated by a semicolon.
4. Each block translation should be hashed with a fuzzy hashing function to produce a 32 bits output.
The reference implementation uses the `Murmurhash3` function.
5. The machoc hash of a binary should be the concatenation of all the machoc hashes of the different functions in the binary.
6. The output for a full binary could include the function address with the corresponding machoc hash.
## Example
To illustrate the algorithm, we will use the call flow graph of a function
composed of 10 basic blocks, with two of them containing a `call` instruction.
![Example Call Flow Graph](screenshots/cfg.png)
### Basic block numbering
Numbering the basic blocks ordered by address give the second graph.
![Numbered graph](screenshots/cfg_numbered.png)
### Basic block translation
Each translated block of the example graph gives the following output.
This output must be concatenated to give a single line.
```
1:2,3;
2:;
3:4,10;
4:6;
5:6;
6:c,7;
7:c,8;
8:5,9;
9:10;
10:;
```
### Final calculation
Hashing the output line with `Murmurhash3` gives the following hash :``0x1014997f``.
## Usage
Machoc hashes are used for several things in the Polichombr platform.
### Binary comparison
Machoc hashes can be used to calculate a similarity match between two samples.
Empiric experiments shows that a 80% Jaccard distance indicated a good match between two binaries.
### Clustering
One can use the machoc matches to create clusters of samples.
### Function matching
Machoc hashes can also be used to diff two samples and rename automatically similar function.
The function matching can be direct (same machoc hash), or based on heuristics.
For example, given two functions in two binaries, if the surrounding functions
in both binaries are direct matched, the two functions are probably similar despite producing a different hash.
## Implementations
Implementations of the *Machoc* algorithm (with some variations) are available for several tools and languages.
### Ruby with ```metasm```
This is the base implementation available in the ```MachocHash``` class in [AnalyzeIt.rb](https://github.com/ANSSI-FR/polichombr/blob/dev/analysis_tools/AnalyzeIt.rb#L174)
### IDAPython
This variant using MD5 instead of Murmurhash3 is available in the [0x00/idadiff](https://github.com/0x00ach/idadiff) repository.
### Radare2
A `Python` and `radare2` implementation is available at [https://github.com/conix-security/machoke/](https://github.com/conix-security/machoke/)