No even with step=1 the transform of h is wrong. As I said (h<<5)+(h>>2) is equivalent to (h<<7 + h) (with preservation of the two highest bits) followed by a rotation of 2 bits to the right. and (h<<7+h) is equivalent to (h*129) so the effect is that the transform multiplies by a non-prime (129 which is divisible by 3), followed by a rotation of 2 bits (which is also wrong because the word size is a multiple of 2, so you have divided the hash collision search space by half the wordsize).
This should immediately be changed to (h<<4)-(h>>3): this is equivalent to (h*127)>>>3, where 127 is prime (but not optimal as it is a simple Mersenne that still does correctly distributes all bits of the input)
As well you use a multiplication before just adding a new byte, so hashes of strings with equal lenght 'a','b','c' are always successive integers.
You should really first add the new byte to h, then perform the multiplication (by 127, not 129 if you still want to use a Mersenne number where the multiplication is just an addition or difference of the input and the shifted input!) and then the rotation (by 3 bits, not 2!)
This won't make any different in terms of local time to compute a single hash. But the hash will be MUCH stronger (because you used a prime for the multiplication factor and used a rotation by a odd prime number to that all bits are equally distributed across loops).
And as I said, Mersenne primes are not the best primes for the constant factor: it is only a workaround solution for very small devices that don't have a fast multiplication of two integers. you should use a prime that has alternates the highest number of sequences of 1 and sequences of 0 (ideally a binary number of the form "101010...101" and only if it is a prime.
It appears that the hex number 0x5555 = 0b0101_0101_0101_0101 (for a 16-bit multiplier, suitable for a 32-bit wordsize for the hash) = 21845 is not a prime as it is divisible by a small factor 5.
So you need to allow two successive bits being set to 1:
*
0b1101_0101_0101_0101 = 0xD555 = 54613 : still not a prime
* 0b1011_0101_0101_0101 = 0xB555 = 46421 : still not a prime
* 0b1010_1101_0101_0101 = 0xAD55 = 44373 : still not a prime
* 0b1010_1011_0101_0101 = 0xAB55 = 44861 : still not a prime
* 0b1010_1010_1101_0101 = 0xAAD5 = 43733 : still not a prime
* 0b1010_1010_1011_0101 = 0xAAB5 = 43701 : still not a prime
* 0b1010_1010_1010_1101 = 0xAAAD = 43693 : still not a prime
But of course NOT a Mersenne (so cannot be computed with a single addition and two shifts like in the existing code which uses a non-prime Mersenne of the form 2^p+1; with the factor 127, where you've just replaced the "+" by a "-", you use a prime Mersenne of the form 2^p-1, but it's not optimal in terms of propagation of bits of the input in the hash value).
So the optimum solution (for 32-bit wordsize) for fast hashing and the lowest level of collisions is finally this line of code replacing the existing one:
h ^= ((h + cast_byte(ptr[l])) * 0xAAAB)>>>3
Note that even if the multiplication of (h+b) by 0xAAAB cannot be reduced by two shifts and an addition it can be realized that this is like multiplying (h+b) by 0xAA, keeping that result, then shifting by 8 bit and adding the kept result and finally adding the initial h. And to multiply (h+b) by 0xAA, you can split it in two parts as well by shifting by 4 bits so that you need just multiply by 0xA (which is simply the sum of the input shifted by 1 or 3 bits) So in fine, it can be implemented as a few additions and shifts on platforms that don't have fast multiplication (but I bet that even in this case, a single multiplication operation will be faster than using these multiple shifts and additions; the C compiler will generate the code itself to support these multiplications of integers)
You can make similar search for 16-bit or 64-bit wordsize, it won't take long to find a suitable prime that maximimizes the number of alternating sequences of bits set to 1 and 0. The same line of code above will be used, only the constant 0xAAAB will be different.