M4RI-20110601 Release Notes

M4RI-20110601 was released on 30 May 2011. It is available at:

http://m4ri.sagemath.org/downloads/

About M4RI

M4RI is a library for fast arithmetic with dense matrices over F2. The name M4RI comes from the first implemented algorithm: The "Method of the Four Russians" inversion algorithm published by Gregory Bard. This algorithm in turn is named after the "Method of the Four Russians" multiplication algorithm which is probably better referred to as Kronrod's method. M4RI implements asymptotically fast matrix multiplication, linear system solving, reduced row echelon forms, PLE decomposition and basic arithmetic. M4RI is used by the Sage mathematics software and the PolyBoRi library. M4RI is available under the General Public License Version 2 or later (GPLv2+).

Changes in 20110601

Note: This version breaks compatibility with previous versions. In most cases simply recompiling against this new version should be sufficient. However, if third party code reaches into the internals of the library, it might require adaptation.

Faster Column Swaps

by Carlo Wood

Column swaps are now considerably faster.

Old Code

./bench_packedmatrix mzd_col_swap 64 64 1 17
Total running time:  3.646 seconds.
Sample size: 136; mean: 456.2; standard deviation: 20.3
99% confidence interval: +/- 4.5 (1.0%): [451.7..460.8]
function: mzd_col_swap, count: 156250, m: 64, n: 64, cola: 1, colb: 17, 
wall time: 0.171514 us, cpu cycles: 456, cc/m: 7.128807

./bench_packedmatrix mzd_col_swap 10000 10000 1 7182
Total running time: 28.364 seconds.
Sample size: 198; mean: 264278; standard deviation: 14315
99% confidence interval: +/- 2637 (1.0%): [261641..266916]
function: mzd_col_swap, count: 1000, m: 10000, n: 10000, cola: 1, colb: 7182, 
wall time: 0.099348 ms, cpu cycles: 264278, cc/m: 26.427831

20110601

./bench_packedmatrix mzd_col_swap 64 64 1 17
Total running time:  6.916 seconds.
Sample size: 447; mean: 263.2; standard deviation: 21.6
99% confidence interval: +/- 2.6 (1.0%): [260.6..265.9]
function: mzd_col_swap, count: 156250, m: 64, n: 64, cola: 1, colb: 17, 
wall time: 0.098963 us, cpu cycles: 263, cc/m: 4.113206

./bench_packedmatrix mzd_col_swap 10000 10000 1 7182
Total running time: 16.878 seconds.
Sample size: 141; mean: 197454; standard deviation: 8929
99% confidence interval: +/- 1961 (1.0%): [195493..199416]
function: mzd_col_swap, count: 1000, m: 10000, n: 10000, cola: 1, colb: 7182, 
wall time: 0.074228 ms, cpu cycles: 197454, cc/m: 19.745422

Faster Transpose

by Carlo Wood

Transposing a matrix is now much much faster than in previous versions of M4RI. Considerable effort went into this. The code provides a good case study how much care has to be taken to achieve good performance.

Old Code

function: mzd_transpose, count: 9765, m: 32, 
wall time: 1.583922 us, cpu cycles: 4213, cc/m^2: 4.114724

function: mzd_transpose, count: 9182, m: 33, 
wall time: 1.654324 us, cpu cycles: 4401, cc/m^2: 4.041117

function: mzd_transpose, count: 2441, m: 64, 
wall time: 5.282671 us, cpu cycles: 14052, cc/m^2: 3.430777

function: mzd_transpose, count: 2366, m: 65, 
wall time: 5.813187 us, cpu cycles: 15464, cc/m^2: 3.660125

function: mzd_transpose, count: 610, m: 128, 
wall time: 2.327869 us, cpu cycles: 6193, cc/m^2: 0.378011

function: mzd_transpose, count: 271, m: 192, 
wall time: 0.030325 ms, cpu cycles: 80661, cc/m^2: 2.188073

function: mzd_transpose, count: 252, m: 199, 
wall time: 0.059583 ms, cpu cycles: 158497, cc/m^2: 4.002360

function: mzd_transpose, count: 152, m: 256, 
wall time: 0.011204 ms, cpu cycles: 29799, cc/m^2: 0.454693

20110601

function: mzd_transpose, count: 9765, m: 32, n: 32, 
wall time: 0.163236 us, cpu cycles: 434, cc/mn: 0.423901

function: mzd_transpose, count: 9182, m: 33, n: 33, 
wall time: 0.538445 us, cpu cycles: 1432, cc/mn: 1.315113

function: mzd_transpose, count: 2441, m: 64, n: 64, 
wall time: 0.490782 us, cpu cycles: 1305, cc/mn: 0.318607

function: mzd_transpose, count: 2366, m: 65, n: 65, 
wall time: 0.593829 us, cpu cycles: 1579, cc/mn: 0.373811

function: mzd_transpose, count: 610, m: 128, n: 128, 
wall time: 1.621311 us, cpu cycles: 4314, cc/mn: 0.263277

function: mzd_transpose, count: 271, m: 192, n: 192, 
wall time: 3.712177 us, cpu cycles: 9871, cc/mn: 0.267780

function: mzd_transpose, count: 252, m: 199, n: 199, 
wall time: 4.345238 us, cpu cycles: 11557, cc/mn: 0.291831

function: mzd_transpose, count: 152, m: 256, n: 256, 
wall time: 6.440789 us, cpu cycles: 17124, cc/mn: 0.261287

Faster Addition

by Martin Albrecht

The previous addition code was very badly written which made addition very slow for small matrices. this version fixes this issue.

Improved Benchmarketing Tools

by Carlo Wood

All benchmarking programs now provide much more accurate data, using appropriate statistical modelling to give a confidence interval.

Usage

 * Example usage:
 *
 * ./bench_elimination -s 0 -m 4 -c 90 -a 0.005 -d -t 30 -n 1000 1000 1000
 *
 * would run at most 30 seconds (-t) or 1000 times (-n), whichever comes
 * first, or stop after the real average of wall time (-s 0) falls with 90%
 * certainty (-c) in a range that is +/- 0.005 times the observed mean (-a: accuracry),
 * but no sooner than that at least 4 (-m: minimum) measurements have been
 * done. It would also print (-d: dump) each measurement (0:microseconds 1:cpuclocks).

Example

/bench_elimination -t 180 -c 95 10000 10000 m4ri 5000 # max run. time: 180 secs, confidence: 95%
.......
Total running time: 18.483 seconds.
Sample size: 7; mean: 1916541669; standard deviation: 19560954
95% confidence interval: +/- 18091517 (0.9%): [1898450152..1934633186]
m: 10000, n: 10000, last r:  5000, cpu cycles: 1916541669, wall time: 0.720465

Furthermore, it is now possible to gather benchmarking information such as cache misses. For this feature the PAPI library is required.

Swapped Internal Bit-representation

by Carlo Wood

Bit zero is now at index zero instead of at index RADIX-1.

This breaks compatibility with previous versions.

New Matrix Layout

by Carlo Wood

Our matrix layout was improved to allow looping over matrix rows without address lookups for each row.

This breaks compatibility with previous versions.

TRSM Upper Left using Greasing

by Martin Albrecht

Improved TRSM upper left code by explicitly using greasing, cf., http://martinralbrecht.wordpress.com/2010/11/19/trsm-with-greasing-trsm-reduced-to-matrix-multiplication/.

TRSM Speed-up

New Configure Options --enable-debug-mzd and --enable-debug-dump

by Carlo Wood

debug-mzd runs a series of consistency checks on matrices at the end of function calls.
debug-dump prints a hash of all matrices at the end of a function call. This allows to spot regressions quicker (by comparing hashs)

Improved Configure Scripts

by Carlo Wood

Our configure scripts saw an overhaul which should make them more standard conforming.

Supported Platforms

make check passes on the following platforms

x86_64 Linux (sage.math, redhawk);
x86 OSX (Xeon, bsd);
x86 Linux (VirtualBox);
UltraSPARC T2 Solaris using GCC (t2.math.washington.edu)
ia64 Linux (iras);

The code also builds in Visual Studio 10 under Windows.

Wiki

m4ri / M4RI-20110601

M4RI-20110601 Release Notes

About M4RI

Changes in 20110601

Faster Column Swaps

Old Code

20110601

Faster Transpose

Old Code

20110601

Faster Addition

Improved Benchmarketing Tools

Usage

Example

Swapped Internal Bit-representation

New Matrix Layout

TRSM Upper Left using Greasing

New Configure Options --enable-debug-mzd and --enable-debug-dump

Improved Configure Scripts

Supported Platforms