Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Golang Compatibility #46

Closed
bbengfort opened this issue Apr 3, 2023 · 2 comments
Closed

Golang Compatibility #46

bbengfort opened this issue Apr 3, 2023 · 2 comments

Comments

@bbengfort
Copy link

bbengfort commented Apr 3, 2023

Hello @hajimes thank you so much for providing this murmur3 implementation in Python and for all the work you do in open source; we really appreciate being able to use your library!

I was recently investigating some compatibility issues with the output produced by mmh3 and a Go library we were using internally. I asked the question on Stackoverflow and got a response back:

https://stackoverflow.com/questions/75921577/murmur3-hash-compatibility-between-go-and-python

It looks like the order of the two uint64s returned by the 128-bit algorithm is reversed between the two libraries; but it's simple enough to modify the returned results in either Go or Python to produce compatible hashes.

I was wondering; would you like me to open a PR to update the README with the compatibility information? Is there any other docs I should update in the PR?

Additionally, if there is any way to reverse the order order of the uint64s returned by murmur3 (e.g. with an argument to hash128 or hash_bytes) I'd be happy to open a PR for that as well. Let me know how you'd like to proceed!

@hajimes
Copy link
Owner

hajimes commented Apr 4, 2023

Hello @bbengfort!

Thank you very much for giving me a lot of examples (I checked your Stack Overflow question), which provided important clues for me to locate the problem.

Short answer

On little-endian machines, murmur3 in Go returns a 128-bit byte sequence that is incompatible with the original MurmurHash3 code and Python mmh3.

Specifically, the Sum function in https://github.com/spaolacci/murmur3/blob/master/murmur128.go seems to be the source of incompatibility.

To solve the issue, I wrote a Go gist here, whose input is the result of Sum128 (or Sum128WithSeed) and output is a 128-bit byte sequence that should be compatible with the original and mmh3.
https://gist.github.com/hajimes/b174e3cd7b0d0c14be97aa39010f2932

This is the first time I ever wrote a Go code, so feel free to modify it so that it suits your code style!

Long answer

The original MurmurHash3 code written by Austin Appleby is endian-sensitive.

The MurmurHash3_x64_128 function in C++ calculates two unsigned 64-bit integers and store them on memory.

Let's assume the result was [0o00112233, 0o44556677] (0o is an octal notation).

On little-endian machines, least significant bits (LSB) are written first, so their actual representations on memory is

<-- low address .. high address -->
         3322110077665544

This is what MurmurHash3_x64_128 (in C++) and mmh3.hash_bytes (in Python) returns (on little-endian machines).
(Note that mmh3.hash128 returns 0o4455667700112233 for the above case)

However, Sum in Go murmur3 stores two unsigned 64-bit integers in big-endian.
So given two integers [0o00112233, 0o44556677], it returns the byte sequence

<-- low address .. high address -->
         0011223344556677

which differs from the result in the original C++ and Python on little-endian platforms.

My gist in the above is basically the little-endian version of Go murmur3 Sum.


Edit: forgive me for "Short answer" is not that short haha

@bbengfort
Copy link
Author

Thank you so much for taking the time to respond to my question. Your Go snippet is well written; thank you very much! I will contact the authors of the Go Library to see their thoughts on updating their package with your suggestion. Thank you also for posting the temporary notice about compatibility -- I think that will be very helpful to other Go/Python users!

I also wanted to add for future readers of this issue that the Python workaround similarly flips the low and high address:

hash = mmh3.hash128(name.encode('utf-8'), signed=False).to_bytes(16, byteorder='big', signed=False)
hash = hash[8:] + hash[:8]

They can use the Go snippet or the Python snippet depending on which direction they want their hash compatibility to go.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants