-
Notifications
You must be signed in to change notification settings - Fork 1.4k
Description
Hey,
I'm starting to use the property function StableStringHash $(MSBuildProjectName)-$([MSBuild]::StableStringHash($(MSBuildProjectFile)).ToString("x8")) for hashing a project filepath for the purpose that when you want to output to an intermediate folder that is shared (in a different obj/* location), you need to have a unique filename for the project folder.
But while looking at the generated hashes, I'm seeing a weird pattern in some numbers, where a single character change was not causing enough shuffle of the bits, which indicates that the hash algorithm is weak:
LibChild3_30-5e56f133
LibChild3_31-66f1f133
LibChild3_32-5220f133
LibChild3_33-5abbf133
LibChild3_34-7c62f133
LibChild3_35-84fdf133
LibChild3_36-6f8cf133
LibChild3_4-a48ac640
LibChild3_5-f243a77b
LibChild3_6-091903ca
LibChild3_7-56d1e505
LibChild3_8-5e70f9dc
LibChild3_9-ac29db17
You could see that LibChild3_30 and all up to LibChild3_36 are sharing the same 16 bits f133, while the other below from LibChild3_4 to LibChild3_9 are changing as you would expect.
So first, I found that I made a mistake, and I should have used MSBuildProjectFullPath instead of MSBuildProjectFile... as it was only hashing the filename but not the fullpath (and if we have a same project filename in different folders, we still want this to hash properly the folder)
But still, it doesn't look good at all...
So looking at the code:
msbuild/src/Shared/CommunicationsUtilities.cs
Lines 655 to 682 in cd95fa9
| internal static int GetHashCode(string fileVersion) | |
| { | |
| unsafe | |
| { | |
| fixed (char* src = fileVersion) | |
| { | |
| int hash1 = (5381 << 16) + 5381; | |
| int hash2 = hash1; | |
| int* pint = (int*)src; | |
| int len = fileVersion.Length; | |
| while (len > 0) | |
| { | |
| hash1 = ((hash1 << 5) + hash1 + (hash1 >> 27)) ^ pint[0]; | |
| if (len <= 2) | |
| { | |
| break; | |
| } | |
| hash2 = ((hash2 << 5) + hash2 + (hash2 >> 27)) ^ pint[1]; | |
| pint += 2; | |
| len -= 4; | |
| } | |
| return hash1 + (hash2 * 1566083941); | |
| } | |
| } | |
| } |
and hashing simple strings like this:
$([MSBuild]::StableStringHash('10')) => cdbab78f
$([MSBuild]::StableStringHash('11')) => cdb9b78f
$([MSBuild]::StableStringHash('100')) => b7435729
$([MSBuild]::StableStringHash('101')) => 59eacbc4
$([MSBuild]::StableStringHash('1000')) => 00f35729
$([MSBuild]::StableStringHash('1001')) => 758e5729
And we can see that the algorithm is messing with string with a length that is even here. Hashing 10 vs 11 is generating cdbab78f and cdb9b78f, and just changed by a few bits (!)
I don't know where this algorithm comes from, but it doesn't look good as it is bit shifting+xor, instead of better approaches like xor+multiply_by_prime_number for simple hash. For example it could use FNV-1A and it would provide a much better hash.
But, I assume that now that StableStringHash is out, we cannot really change it right? (as programs are relying on it being stable... 🤔 )
So one possible way to workaround it is to hash(str + hash(str)) and that's probably what I'm gonna do...