## Feature #15166

open### 2.5 times faster implementation than current gcd implmentation

**Description**

This is to be more explicit (and accurate) than https://bugs.ruby-lang.org/issues/15161

This is my modified gcd benchmarks code, originally presented by Daniel Lemire (see 15161).

https://gist.github.com/jzakiya/44eae4feeda8f6b048e19ff41a0c6566

Ruby's current implementation of Stein's gcd algorithm is only slightly faster than the

code posted on the wikepedia page, and over 2.5 times slower than the fastest implementation

in the benchmarks.

```
[jzakiya@localhost ~]$ ./gcdbenchmarks
gcd between numbers in [1 and 2000]
gcdwikipedia7fast32 : time = 99
gcdwikipedia4fast : time = 121
gcdFranke : time = 126
gcdwikipedia3fast : time = 134
gcdwikipedia2fastswap : time = 136
gcdwikipedia5fast : time = 139
gcdwikipedia7fast : time = 138
gcdwikipedia2fast : time = 136
gcdwikipedia6fastxchg : time = 144
gcdwikipedia2fastxchg : time = 156
gcd_iterative_mod : time = 210
gcd_recursive : time = 215
basicgcd : time = 211
rubygcd : time = 267
gcdwikipedia2 : time = 321
gcd between numbers in [1000000001 and 1000002000]
gcdwikipedia7fast32 : time = 100
gcdwikipedia4fast : time = 121
gcdFranke : time = 126
gcdwikipedia3fast : time = 134
gcdwikipedia2fastswap : time = 136
gcdwikipedia5fast : time = 138
gcdwikipedia7fast : time = 138
gcdwikipedia2fast : time = 136
gcdwikipedia6fastxchg : time = 144
gcdwikipedia2fastxchg : time = 156
gcd_iterative_mod : time = 210
gcd_recursive : time = 215
basicgcd : time = 211
rubygcd : time = 269
gcdwikipedia2 : time = 323
```

This is Ruby's code per: https://github.com/ruby/ruby/blob/3abbaab1a7a97d18f481164c7dc48749b86d7f39/rational.c#L285-L307

which is basically the wikepedia implementation.

```
inline static long
i_gcd(long x, long y)
{
unsigned long u, v, t;
int shift;
if (x < 0)
x = -x;
if (y < 0)
y = -y;
if (x == 0)
return y;
if (y == 0)
return x;
u = (unsigned long)x;
v = (unsigned long)y;
for (shift = 0; ((u | v) & 1) == 0; ++shift) {
u >>= 1;
v >>= 1;
}
while ((u & 1) == 0)
u >>= 1;
do {
while ((v & 1) == 0)
v >>= 1;
if (u > v) {
t = v;
v = u;
u = t;
}
v = v - u;
} while (v != 0);
return (long)(u << shift);
}
```

This is the fastest implementation from the benchmarks. (I originally, wrongly, cited

the implementation in the article, which is 4|5th fastest in benchmarks, but

still almost 2x faster than the Ruby implementation.)

```
// based on wikipedia's article,
// fixed by D. Lemire, K. Willets
unsigned int gcdwikipedia7fast32(unsigned int u, unsigned int v)
{
int shift, uz, vz;
if ( u == 0) return v;
if ( v == 0) return u;
uz = __builtin_ctz(u);
vz = __builtin_ctz(v);
shift = uz > vz ? vz : uz;
u >>= uz;
do {
v >>= vz;
int diff = v;
diff -= u;
if ( diff == 0 ) break;
vz = __builtin_ctz(diff);
if ( v < u ) u = v;
v = abs(diff);
} while( 1 );
return u << shift;
}
```

The key to speeding up all the algorithms is using the `__builtin_ctz(x)`

directive

to determine the number of trailing binary '0's.

**Files**