lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


And a similar approach is used to implement "fast fourier" algorithms, or to compute a polynom in 2 cycle (like a convergent Taylor series, or limited development for trigonometric functions, or exponentials, logarithms, square roots, without needing any costly loop). Hardware FPU use a lot of parallel adders (this requires lot of gates for them, but not so much as you would imagine, only a number properotional to the bitsize of input operands) for supporting these classic functions, and this is also used in vectorized instructions that also support the same functions and aritmmetic operations in 1 or 2 cycles.

This has consequences in security algorithms: things that seem complex to do by classic software can be parallelized a lot, except if individual stages are not fully independent of each other. And this applies as well to hashing algorithms: the Knuth's algorithm is good for fast hashing, but does not resist at all to cryptanalysis as it can be reversed in a single cycle, if you have a modest hardware to parallelize the stages. So it is not secure at all, but it is fast and can be even extremely fast (not costing more than 1 clock cycle if you have 32 parallel machines for reversing a 32-bit multiplication: this is exactly what is done to implement the integer division, also with 32 parallel adders for 32-bit operands!).

More complex problems (that resist the cryptanalysis and is not easily parallelisable) do not use multiplication of integers, but multiplication of polynoms, but there are known strategies to reduce the number of stages by using some of their properties. Their time complexity is not O(1) like integer multiplication, but O(n/2) for polynomial multiplications.



Le mar. 26 mai 2020 à 12:28, Philippe Verdy <verdyp@gmail.com> a écrit :
No, you've reduced the number of additions/substractions if you accept more groups of two bits. My solution with only groups of 1 bit has no solution for 32-bit words, but it has a solution for only 1 group of 2 bits.

Think about multiplication by 0x011, it is the same as a multiplication by 0x100 and a subtraction of the input: 3*x = 4*x - x = (x<<2)-(x<<0).

The solution that maximum the shuffling of input bits must maximize the number of alternating sequences of 1 and 0. But if all your groups of 1 have 2 bits, and all your groups of 0 have one bit, every three bits you have only 2 alternances, so for a 16-bit multiplier you can only get at most ceil(16/3*2)=12 alternances; my solution has found 15 alternances by minimizing the number of long groups.

Think about how you compute multiplications with the Egyptian algorithm (which is optimal in base 2 where you only need additions or substractions and shifts and is faster than the classic multiplications using only additions and shift). The Egyptian algorithm for multiplication of integers is used in hardware ALU of every modern CPU because it divides the number of cycles by 2 (if additions must be sequenced and there's a single adder), or because it reduces the number of addition stages (that need to handle carries that can propagate to the left: with the Egyptian algorithm carries never propagate to more than one position to the left, so the multiplication can be parallelized and realized in a single clock cycle). For a 16-bit * 16-bit multiplication this algo requires at most 16 parallelizable adder stages (and shifts have no cost as they are hardwired in the multiplier and don't need any cycle). You can easily realize 128-bit * 128*bit multiplications without lot of hardware and still in a single cycle (you just need 128 parallel adders), something you can't do in classic programming, except if you use vector-instructions to parallelize the additions. the number of parallel adders is proportional to the bit length of the operand but does not increase the computing time which remaing in O(1) complecity instead of O(n) with classic multiplications you learn today in primary schools (Egyptians in the antiquity already used the better algorithm which can alternate additions and substraction, so they were abel to compute multiplications faster than us with our conventional "simple" method).

It is important to understand that because a multiplication is not more complex in time (just in hardware space) than a single addition or substraction. This algorithm can be used as well in FPU for computing multiplication of floatting points (separately multiply the integer mantissas, and add the exponents, adjust the exponent by shifting (and rounding) the correct number of bits to renormalize the mantissa and this requires only a single addition of a small value. So floatting points can be multiplied in 2 cycles independantly of their precision!








Le mar. 26 mai 2020 à 08:51, 云风 Cloud Wu <cloudwu@gmail.com> a écrit :
Philippe Verdy <verdyp@gmail.com> 于2020年5月26日周二 上午1:18写道:

> And as I said, Mersenne primes are not the best primes for the constant factor: it is only a workaround solution for very small devices that don't have a fast multiplication of two integers. you should use a prime that has alternates the highest number of sequences of 1 and sequences of 0 (ideally a binary number of the form "101010...101" and only if it is a prime.

> * 0b1010_1010_1010_1011 = 0xAAAB = 43691 : IT IS A PRIME ! (see https://primes.utm.edu/lists/small/10000.txt)

Is the form "...011011" a better choice ? It has more "1" in the
binary number, and the longest number of sequences of 1 is two. I
think the form "011" means 2 times of addition.

0b1011011011011011 = 0xB6Db = 46811 : It is a prime , too.