lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


No, you've reduced the number of additions/substractions if you accept more groups of two bits. My solution with only groups of 1 bit has no solution for 32-bit words, but it has a solution for only 1 group of 2 bits.

Think about multiplication by 0x011, it is the same as a multiplication by 0x100 and a subtraction of the input: 3*x = 4*x - x = (x<<2)-(x<<0).

The solution that maximum the shuffling of input bits must maximize the number of alternating sequences of 1 and 0. But if all your groups of 1 have 2 bits, and all your groups of 0 have one bit, every three bits you have only 2 alternances, so for a 16-bit multiplier you can only get at most ceil(16/3*2)=12 alternances; my solution has found 15 alternances by minimizing the number of long groups.

Think about how you compute multiplications with the Egyptian algorithm (which is optimal in base 2 where you only need additions or substractions and shifts and is faster than the classic multiplications using only additions and shift). The Egyptian algorithm for multiplication of integers is used in hardware ALU of every modern CPU because it divides the number of cycles by 2 (if additions must be sequenced and there's a single adder), or because it reduces the number of addition stages (that need to handle carries that can propagate to the left: with the Egyptian algorithm carries never propagate to more than one position to the left, so the multiplication can be parallelized and realized in a single clock cycle). For a 16-bit * 16-bit multiplication this algo requires at most 16 parallelizable adder stages (and shifts have no cost as they are hardwired in the multiplier and don't need any cycle). You can easily realize 128-bit * 128*bit multiplications without lot of hardware and still in a single cycle (you just need 128 parallel adders), something you can't do in classic programming, except if you use vector-instructions to parallelize the additions. the number of parallel adders is proportional to the bit length of the operand but does not increase the computing time which remaing in O(1) complecity instead of O(n) with classic multiplications you learn today in primary schools (Egyptians in the antiquity already used the better algorithm which can alternate additions and substraction, so they were abel to compute multiplications faster than us with our conventional "simple" method).

It is important to understand that because a multiplication is not more complex in time (just in hardware space) than a single addition or substraction. This algorithm can be used as well in FPU for computing multiplication of floatting points (separately multiply the integer mantissas, and add the exponents, adjust the exponent by shifting (and rounding) the correct number of bits to renormalize the mantissa and this requires only a single addition of a small value. So floatting points can be multiplied in 2 cycles independantly of their precision!








Le mar. 26 mai 2020 à 08:51, 云风 Cloud Wu <cloudwu@gmail.com> a écrit :
Philippe Verdy <verdyp@gmail.com> 于2020年5月26日周二 上午1:18写道:

> And as I said, Mersenne primes are not the best primes for the constant factor: it is only a workaround solution for very small devices that don't have a fast multiplication of two integers. you should use a prime that has alternates the highest number of sequences of 1 and sequences of 0 (ideally a binary number of the form "101010...101" and only if it is a prime.

> * 0b1010_1010_1010_1011 = 0xAAAB = 43691 : IT IS A PRIME ! (see https://primes.utm.edu/lists/small/10000.txt)

Is the form "...011011" a better choice ? It has more "1" in the
binary number, and the longest number of sequences of 1 is two. I
think the form "011" means 2 times of addition.

0b1011011011011011 = 0xB6Db = 46811 : It is a prime , too.