lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]

 I found this explanation on this webpage

idiv r64 is 56 uops for the front-end, with latency from 41 to 95
cycles (from divisor to quotient, which is the relevant case here I
div r64 is 33 uops for the front-end, with latency from 35 to 87
cycles. (for that same latency path).

Probably DIV is faster than iDIV,
the DIV latency is shorter

i don't know if it has anything to do with this...

bil til <> 于2023年1月20日周五 23:42写道:
> oh ... yes, sorry, stupid me that I did not read this before, this is
> a nice comment above this function.
> yes, this might be... it will depend possibly quite a bit on the CPU
> machine you are using... I do not really know the typical PC CPUs very
> well on assembler level, please excuse... . you would have to check
> the C compiler created assembly code to check the exact reason here...
> .
> I am just working with Cortex M4 microcontroller (32bit, ARMCC/Keil
> C++ compiler), there I see the assembly code very straight forward in
> my debugger... . But in my Cortex M4 this comment is NOT correct: In
> the generated assembly for this the unsigned modulo comes like this:
>  UDIV          r3,r1,r2
>   MLS           r1,r2,r3,r1
>   ADD           r0,r0,r1,LSL #4
> the signed modulo is the following assembly list:
>   SDIV          r3,r1,r2
>   MLS           r1,r2,r3,r1
>   ADD           r0,r0,r1,LSL #4
> ... so the speed is exactly the same ... I can just "assume", that
> there is some slight difference in the modulo result, depending
> whether you do it the signed way or unsigned way... modulo of negative
> ints is somehow neiter in C nor in CPU's a "perfectly defined math
> operation"... . Some C compilers will calculate "-4 % 3" to 2, and
> some to -1 as far as I know, or similar problems.