Goldschmidt division

Feed

date: 2026-05-21 19:55:36

categories: programming

firstPublishDate: 2026-05-21 19:55:36

The Goldschmidt division algorithm is a fast division using the Newton-Raphson approximation.

Goldschmidt division on wikipedia

It is an algorithm computing the quotient `c` of `n` (numerator) divided by `d` (divider).

The Newton-Raphson approximation estimates `1/d` and the quotient is then:

`1/d` is estimated by iterating this:

The quotient `c` is estimated with:

Goldschmidt division for 16bit signed integers

We replace `d` with a value `a` so that `a` is in the range [0.5, 1[ and `1/a` is in the range ]1, 2]. We choose the range [0.5, 1[ for `a` because the iteration `x * (2 - a * x)` converges fast for this range.

`a` is `d` divided by a power of 2 number so that `a` is in the expected range:

x is an approximation of `1/a`, we get the code:

The computations are done using fixed point integers with 15 bit for the fractional number to fit `x * (2 - x * a)` into a 32 bit integer.

d is a 16bit signed integer in the range 0 to 32767, we compute `a` like this:

clz is a function (or the `lzcnt` instruction) counting the leading 0s in `d` (from 1 to 15), bit 15 is 0 and bit 14 is 1 so `a` in range 16384 to 32768.

The initial value for `x` is 1.5 which in fixed point is: `0xc000`. We get the code:

`n * 1/a` needs to be adjusted to become `n * 1/d` with shift right by 15-shift positions and the 15bit fractional number is removed with shift right by 15 positions. The fractional number in the result is an estimation of the remainder and is often not accurate.

Signed division is implemented by computing the sign of the result and dividing abs(n) with abs(d).

We can choose an initial value for x to make x converge faster.

The `clz` function is from the book `Henry S. Warren - Hacker's Delight (2nd Edition)-Addison-Wesley Professional (2012)`:

Implementation in C16 assembly

C16 CPU

The clz function is implemented like this:

The nrdiv function is the implementation of the newton-raphson method for the goldschmidt division approximation:

It is not very efficient to use the clz function because it takes 50 clock cycles to run and the `lzcnt` instruction gives the result in 1 clock cycle.

Long division

To compare the performance of the goldschmidt division, I implemented the traditional long division in assembly, it is the slowest:

Division by 10

There is an efficient divide by 10 function in the book `Henry S. Warren - Hacker's Delight (2nd Edition)-Addison-Wesley Professional (2012)`:

This divide by 10 function is useful for displaying a binary number is base 10.

I implemented it in C16 assembly:

Hardware implementation in verilog

The divide by 10 function implemented in verilog is:

I added a `div10` instruction to my C16 cpu and it takes 1 clock cycle.

I implemented a 32 bit goldschmidt division in verilog, the iterations need to be computed with 66 bit values.

For the goldschmidt division, we need to count the number of leading zero bits in the dividend to shift the dividend and compute an approximation of 1/dividend.

In verilog, leading zero are counted like this:

`sdivd` is the dividend, it is never equal to 0 because divide by 0 is invalid. It sets bit 4 in the result, then bit 3, 2, 1 and 0. For bit 0, it doesn't need to test if the bit is 1 because we know the dividend is not 0.

`diva` is the shifted dividend in the range 2^31 to 2^32-1. `divx` initial value is in the range 2^32 to 2^33, it selected by using 4 bits (bit 27 to 30, bit 31 is always 1) in `diva`.

The execution stage of the division takes 1 clock cycle at 100Mhz, I don't see any timing violation in the logs from vivado (vivado is the tool from amd/xilinx for compiling designs for FPGAs) so it should be ok.

I have been running 50 million divisions with random numbers in a testbench and the division approximation is correct.

I first tried with 32 initial values and 3 iterations, it didn't work. So I increased to 64 and 128 initial values and the approximation was still incorrect sometimes. Maybe it works with 256 initial values and 3 iterations but it takes less ressources to have 16 initial values and 4 iterations.

Performance

Values are clock cycles.

The software implementation of the goldschmidt division is 4 times faster than the long division and the divide by 10 function is ten times faster than the long division. In the hardware column, I wrote 3 clock cycle for the hw division and the hw divide by 10 because the C16 CPU is currently not pipelined and the execution stage takes 1 clock cycle.

Feed

Guestbook

Proxied content from gemini://gmi.noulin.net/2026-05-21-goldschmidt-division.gmi (external content)

Gemini request details:

Original URL
gemini://gmi.noulin.net/2026-05-21-goldschmidt-division.gmi
Status code
Success
Meta
text/gemini
Proxied by
kineto

Be advised that no attempt was made to verify the remote SSL certificate.