Performance compared with native numpy

To compare the performance of galois and native numpy, we’ll use a prime field GF(p). This is because it is the simplest field. Namely, addition, subtraction, and multiplication are modulo p, which can be simply computed with numpy arrays (x + y) % p. For extension fields GF(p^m), the arithmetic is computed using polynomials over GF(p) and can’t be so tersely expressed in numpy.

Lookup performance

For fields with order less than or equal to 2^20, galois uses lookup tables for efficiency. Here is an example of multiplying two arrays in GF(31) using native numpy and galois with ufunc_mode="jit-lookup".

In [1]: import numpy as np

In [2]: import galois

In [3]: GF = galois.GF(31); print(GF.properties)
GF(31):
  characteristic: 31
  degree: 1
  order: 31
  irreducible_poly: Poly(x + 28, GF(31))
  is_primitive_poly: True
  primitive_element: GF(3, order=31)
  dtypes: ['uint8', 'uint16', 'uint32', 'int8', 'int16', 'int32', 'int64']
  ufunc_mode: 'jit-lookup'
  ufunc_target: 'cpu'

In [4]: def construct_arrays(GF, N):
   ...:     a = np.random.randint(1, GF.order, N, dtype=int)
   ...:     b = np.random.randint(1, GF.order, N, dtype=int)
   ...:     ga = a.view(GF)
   ...:     gb = b.view(GF)
   ...:     return a, b, ga, gb
   ...:

In [5]: N = int(10e3)

In [6]: a, b, ga, gb = construct_arrays(GF, N)

In [7]: a
Out[7]: array([29, 20, 29, ..., 29, 22, 24])

In [8]: ga
Out[8]: GF([29, 20, 29, ..., 29, 22, 24], order=31)

In [9]: %timeit (a * b) % GF.order
88.2 µs ± 931 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [10]: %timeit ga * gb
67.9 µs ± 425 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

The galois ufunc runtime has a floor, however. This is due to a requirement to view the output array and convert its dtype with astype(). For example, for small array sizes numpy is faster than galois because it doesn’t need to do these conversions.

In [15]: N = 10

In [16]: a, b, ga, gb = construct_arrays(GF, N)

In [17]: a
Out[17]: array([17, 22,  9, 11,  7, 14, 27, 16, 21, 30])

In [18]: ga
Out[18]: GF([17, 22,  9, 11,  7, 14, 27, 16, 21, 30], order=31)

In [19]: %timeit (a * b) % GF.order
1.32 µs ± 22.5 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [20]: %timeit ga * gb
35.1 µs ± 879 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

This runtime discrepancy can be explained by the time numpy takes to perform the type conversion and view.

In [21]: %timeit a.astype(np.uint8).view(GF)
31.2 µs ± 5.53 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

However, for large N galois is strictly faster than numpy.

In [22]: N = int(10e6)

In [23]: a, b, ga, gb = construct_arrays(GF, N)

In [24]: a
Out[24]: array([29,  9, 16, ..., 15, 24,  9])

In [25]: ga
Out[25]: GF([29,  9, 16, ..., 15, 24,  9], order=31)

In [26]: %timeit (a * b) % GF.order
109 ms ± 1.01 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [27]: %timeit ga * gb
55.2 ms ± 1.18 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Calculation performance

For fields with order greater than 2^20, galois will use explicit arithmetic calculation rather than lookup tables. Even in these cases, galois is faster than numpy!

Here is an example multiplying two arrays in GF(2097169) using numpy and galois with ufunc_mode="jit-calculate".

In [1]: import numpy as np

In [2]: import galois

In [3]: prime = galois.next_prime(2**21); prime
Out[3]: 2097169

In [4]: GF = galois.GF(prime); print(GF.properties)
GF(2097169):
  characteristic: 2097169
  degree: 1
  order: 2097169
  irreducible_poly: Poly(x + 2097122, GF(2097169))
  is_primitive_poly: True
  primitive_element: GF(47, order=2097169)
  dtypes: ['uint32', 'int32', 'int64']
  ufunc_mode: 'jit-calculate'
  ufunc_target: 'cpu'

In [5]: def construct_arrays(GF, N):
   ...:     a = np.random.randint(1, GF.order, N, dtype=int)
   ...:     b = np.random.randint(1, GF.order, N, dtype=int)
   ...:     ga = a.view(GF)
   ...:     gb = b.view(GF)
   ...:     return a, b, ga, gb
   ...:

In [6]: N = int(10e3)

In [7]: a, b, ga, gb = construct_arrays(GF, N)

In [8]: a
Out[8]: array([331469, 337477, 453485, ..., 186502, 794636, 535201])

In [9]: ga
Out[9]: GF([331469, 337477, 453485, ..., 186502, 794636, 535201], order=2097169)

In [10]: %timeit (a * b) % GF.order
88.3 µs ± 557 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [11]: %timeit ga * gb
57.2 µs ± 749 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

And again, the runtime comparison with numpy improves with large N because the time of viewing and type converting the output is small compared to the computation time. galois achieves better performance than numpy because the multiplication and modulo operations are compiled together into one ufunc rather than two.

In [12]: N = int(10e6)

In [13]: a, b, ga, gb = construct_arrays(GF, N)

In [14]: a
Out[14]: array([2090232, 2071169, 1463892, ..., 1382279, 1067677, 1901668])

In [15]: ga
Out[15]: GF([2090232, 2071169, 1463892, ..., 1382279, 1067677, 1901668], order=2097169)

In [16]: %timeit (a * b) % GF.order
109 ms ± 781 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [17]: %timeit ga * gb
50.3 ms ± 619 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)