Montgomery modular multiplication

The computation of xymodpxy \; \text{mod} \; p is much slower than usual multiplication, since it requires a division to know how many times pp has to be subtracted from the product.

Montgomery modular multiplication (opens new window) is a way to do this computation faster. The trick is to move the computations to the Montgomery space which contains the integers smaller than p2mp 2^m for some positive integer mm. We move the element aa to the Montgomery space by multiplying it by R=2mpR = 2^m - p.

Let’s say we need to compute

(a+b)modp(a + b) \; \text{mod} \; p

We move the computation to the Montgomery space:

(aR+bR)modp(a R + b R) \; \text{mod} \; p

We compute aR+bR,a R + b R, then we need to divide it by RR, and finally we need to subtract pp if the result is bigger than pp (we know that the result is smaller than 2p2p because a<p,b<pa < p, b < p).

Well, this is more complicated than simply computing a+ba + b and then subtracting pp if the result is bigger than pp. However, it is not much more complicated if we do it the right way. Instead of computing (aR+bR)/R(a R + b R) / R simply as integers and in the last step computing modulo pp (subtracting pp if necessary), we can much more efficiently compute:

(aR+bR)/Rmodp(a R + b R) / R \; \text{mod} \; p

Let’s see how.

Efficient division by R modulo p

Note that

R=2mmodpR = 2^m \; \text{mod} \; p

That means

(aR+bR)/R=(aR+bR)/2mmodp(a R + b R) / R = (aR + bR) / 2^m \; \text{mod} \; p

For computers, dividing by 2m2^m means almost no work; it’s just a shift operation. Given that xx is divisible by 2m2^m, division by 2m2^m means shifting by mm positions:

x/2m=x>>mx / 2^m = x >> m

Of course, aR+bRaR + bR might not be divisible by 2m2^m. But in Zp\mathbb{Z}_p we have some liberty to make an arbitrary xx divisible by 2m2^m.

We just need to find kZk \in \mathbb{Z} such that x+kpx + k p is divisible by 2m2^m.

Computing x/2mx/2^m in Zp\mathbb{Z_p} means computing the inverse of 2m2^m in Zp\mathbb{Z_p} (which is the same as the inverse of RR in Zp\mathbb{Z_p}) – let’s denote this inverse by R1R^{-1} – and then computing xR1x R^{-1}:

x/2m=xR1modpx / 2^m = x R^{-1} \; \text{mod} \; p

For any xx it holds:

xR1=(x+kp)R1modpx R^{-1} = (x + k p) R^{-1} \; \text{mod} \; p

But once we have kZk \in \mathbb{Z} such that x+kpx + k p is divisible by 2m2^m, we can compute xR1x R^{-1} mod pp simply in integers as (x+kp)/2m(x + kp) / 2^m. So we can compute xR1x R^{-1} mod pp very efficiently simply by shifting x+kpx + kp by mm positions:

xR1modp=(x+kp)>>mx R^{-1} \; \text{mod} \; p = (x + kp) >> m

We will see in the example section how to compute kk (and why (x+kp)>>m(x + kp) >> m is smaller than 2p2p, which makes it easy to make it smaller than pp).

Multiplication

As we saw above, addition by moving to the Montgomery space and back is not much more expensive than without moving it there and back, but the real reason we are moving to the Montgomery space is multiplication.

The computation of abab mod pp is not nearly as simple as the computation of a+ba + b mod pp where in the worst-case scenario we need to do an additional subtraction of pp. This is why we move a,ba, b to the Montgomery space. We get x,yZpx, y \in \mathbb{Z}_p such that:

x=aRmodpx = aR \; \text{mod} \; p

y=bRmodpy = bR \; \text{mod} \; p

We compute xyxy and reduce it back to the Montgomery space by dividing it by RR. Note that:

xy=aRbRmodpxy = aR bR \; \text{mod} \; p

So by dividing xyxy by RR we get:

abRmodpabR \; \text{mod} \; p

So we have a Montgomery representation of abab. To compute the final value abab, we again use the efficient division by RR.

Note that this pays off when we have multiple operations. We move the elements to the Montgomery space, execute all the operations in the Montgomery space, and as the last step move the elements back from the Montgomery space.

Example

For demonstration purposes, let’s have: m=264,p=264257,R=257.m = 2^{64}, p = 2^{64} - 257, R = 257.

The division of aa by RR would go as follows.

First, we compute:

μ=p1mod264\mu = -p^{-1} \; \text{mod} \; 2^{64}

Note that for some lZ:l \in \mathbb{Z}:

μp=(l264+1)\mu p = -(l 2^{64} + 1)

We then compute:

(μ(amod264)mod264)p+a(\mu (a \; \text{mod} \; 2^{64}) \; \text{mod} \; 2^{64}) \cdot p + a

Let’s observe the expression above modulo 2642^{64}:

(μp(amod264)mod264)+amod264=(\mu p (a \; \text{mod} \; 2^{64}) \; \text{mod} \; 2^{64}) + a \; \text{mod} \; 2^{64} =

=(l264+1)(amod264)+amod264== -(l 2^{64} + 1) (a \; \text{mod} \; 2^{64}) + a \; \text{mod} \; 2^{64} =

=1(amod264)+amod264=0= -1 (a \; \text{mod} \; 2^{64}) + a \; \text{mod} \; 2^{64} = 0

That means we can divide

(μ(amod264)mod264)p+a(\mu (a \; \text{mod} \; 2^{64}) \; \text{mod} \; 2^{64}) \cdot p + a

by 2642^{64} using the shift operation.

So we found

k=(μ(amod264)mod264)k = (\mu (a \; \text{mod} \; 2^{64}) \; \text{mod} \; 2^{64})

that we were looking for in the Efficient division section above.

We can compute

x=((μ(amod264)mod264)p+a)/264x = ((\mu (a \; \text{mod} \; 2^{64}) \; \text{mod} \; 2^{64}) \cdot p + a) / 2^{64}

Because a<p264a < p2^{64} (this is our Montgomery space) and

(μ(amod264)mod264)p<p264(\mu (a \; \text{mod} \; 2^{64}) \; \text{mod} \; 2^{64}) \cdot p < p 2^{64}

we have

x<2px < 2p

If xpx \geq p, then we compute

x=xpx = x - p

Finaly, we have xx such that:

x=aRmodpx = aR \; \text{mod} \; p