The property function StableStringHash is relatively weak on some input

Hey,
I'm starting to use the property function `StableStringHash`  `$(MSBuildProjectName)-$([MSBuild]::StableStringHash($(MSBuildProjectFile)).ToString("x8"))` for hashing a project filepath for the purpose that when you want to output to an intermediate folder that is shared (in a different `obj/*` location), you need to have a unique filename for the project folder.

But while looking at the generated hashes, I'm seeing a weird pattern in some numbers, where a single character change was not causing enough shuffle of the bits, which indicates that the hash algorithm is weak:

```
LibChild3_30-5e56f133
LibChild3_31-66f1f133
LibChild3_32-5220f133
LibChild3_33-5abbf133
LibChild3_34-7c62f133
LibChild3_35-84fdf133
LibChild3_36-6f8cf133
LibChild3_4-a48ac640 
LibChild3_5-f243a77b 
LibChild3_6-091903ca 
LibChild3_7-56d1e505 
LibChild3_8-5e70f9dc 
LibChild3_9-ac29db17 
```

You could see that `LibChild3_30` and all up to `LibChild3_36` are sharing the same 16 bits `f133`, while the other below from `LibChild3_4` to `LibChild3_9` are changing as you would expect.

So first, I found that I made a mistake, and I should have used `MSBuildProjectFullPath` instead of `MSBuildProjectFile`... as it was only hashing the filename but not the fullpath (and if we have a same project filename in different folders, we still want this to hash properly the folder)

But still, it doesn't look good at all...

So looking at the code:

https://github.com/dotnet/msbuild/blob/cd95fa922d0466b53163f93a32895f53df202f4b/src/Shared/CommunicationsUtilities.cs#L655-L682

and hashing simple strings like this:

```
$([MSBuild]::StableStringHash('10'))   => cdbab78f
$([MSBuild]::StableStringHash('11'))   => cdb9b78f
$([MSBuild]::StableStringHash('100'))  => b7435729
$([MSBuild]::StableStringHash('101'))  => 59eacbc4
$([MSBuild]::StableStringHash('1000')) => 00f35729
$([MSBuild]::StableStringHash('1001')) => 758e5729
```

And we can see that the algorithm is messing with string with a length that is even here. Hashing `10` vs `11` is generating `cdbab78f` and `cdb9b78f`, and just changed by a few bits (!)

I don't know where this algorithm comes from, but it doesn't look good as it is bit shifting+xor, instead of better approaches like xor+multiply_by_prime_number for simple hash. For example it could use [FNV-1A](https://en.wikipedia.org/wiki/Fowler%E2%80%93Noll%E2%80%93Vo_hash_function#FNV-1a_hash) and it would provide a much better hash.

But, I assume that now that `StableStringHash` is out, we cannot really change it right? (as programs are relying on it being stable... 🤔 )

So one possible way to workaround it is to `hash(str + hash(str))` and that's probably what I'm gonna do... 

	internal static int GetHashCode(string fileVersion)
	{
	unsafe
	{
	fixed (char* src = fileVersion)
	{
	int hash1 = (5381 << 16) + 5381;
	int hash2 = hash1;

	int* pint = (int*)src;
	int len = fileVersion.Length;
	while (len > 0)
	{
	hash1 = ((hash1 << 5) + hash1 + (hash1 >> 27)) ^ pint[0];
	if (len <= 2)
	{
	break;
	}

	hash2 = ((hash2 << 5) + hash2 + (hash2 >> 27)) ^ pint[1];
	pint += 2;
	len -= 4;
	}

	return hash1 + (hash2 * 1566083941);
	}
	}
	}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

The property function StableStringHash is relatively weak on some input #7131

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

The property function StableStringHash is relatively weak on some input #7131

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions