Optimizations

If you are like us, you want to get the fastest possible version of your numerical code to run as many samples as possible and solve the largest systems possible. To this end there are a number of possible optimizations already provided for you by the xerus library. The following list expands on the most relevant of them in roughly the order of effectiveness.

Disabling Runtime Checks

The library contains many runtime checks for out-of-bounds access, other invalid inputs (like illegal contractions), consistency and even to check the correct behaviour of internal structures. Depending on the complexity your code and the time spent inside xerus (and not one of the libraries it uses) you can expect a large performance gain by disabling these checks in the config.mk file during compilation of xerus.

It is not advisable to do this while developing, as it will be much more difficult to detect errors in your calls to xerus functions, but once you have established, that your code works as expected you might want to try replacing the libxerus.so object used by your project with one compiled with the -D XERUS_DISABLE_RUNTIME_CHECKS flag.

Use c++ instead of Python

The interface between the languages python and c++ makes it necessary to perform operations for the sole purpose of compatibility between the otherwise incompatible languages. Often this includes copies of vectors of integers (whenever dimensions are specified or queried) and sometimes even deep copies of whole tensors (.from_ndarray() and .to_ndarray()). The only way to get rid of this overhead is to write your appliation in c++ instead of python. Most instructions that xerus offers for python look very similar in c++, so a transition might be simpler than you think. Simply check out the rest of the tutorials to compare the code snippets.

This transition is particularly useful, if you wrote your own numerical algorithms in python. As an example consider the simple ALS implementation in the example section (ALS)), where the c++ implementation is faster by about a factor of two. If most of the runtime is spend inside one of xerus’s own algorithms like the ALS, it is likely not worth much though.

Compiling Xerus with High Optimizations

Per default the library already compiles with high optimization settings (corresponding basically to -O3) as there is rarely any reason to use lower settings for numerical code. If you want to spend a significant amount of cpu hours in numerical code using the xerus library though, it might be worthwile to go even further.

The most significant change in runtime speed gains due to compiler settings at this point will come from link-time optimizations (for c++projects using xerus). To make use of them you will need a sufficiently recent versions of the g++ compiler and ar archiver. After compiling the libxerus.so object with the USE_LTO = TRUE flag you can then enable -flto in your own compilation process. The optimizations that will be used then extending more than a single compilation unit and might thus use significant system resources during compilation.

If link-time optimization is not an option (or not sufficient) it is also possible to replace the high optimizations flag in your config.mk file with the DANGEROUS_OPTIMIZATION = TRUE flag. This will enable non-IEEE-conform optimizations that should typically only change floating point results in the least significant bit but might lead to undefined behaviour in case a NaN or overflow is encountered during runtime. (It is rumored that there is an even higher optimization setting available for xerus for those who know how to find it and want to get even the last 1% of speedup…)

Avoiding Indexed Expressions

The comfort of being able to write Einstein-notation-like equations in the source code of the form A(i,k) = B(i,j)*C(j,k); comes with the price of a certain overhead during runtime. It is in the low single-digit percent range for typical applications but can become significant when very small tensors are being used and the time for the actual contraction thus becomes negligible.

In such cases it can be useful to replace such equations (especially ones that are as simple as above) with the explicit statement of contractions and reshuffels. For above equation that would simply be

// equivalent to A(i,k) = B(i,j^2)*C(j^2,k)
contract(A, B, false, C, false, 2);

i.e. read as: contract two tensors and store the result in A, left hand side B, not transposed, right hand side C, not transposed, contract two modes.

If it is necessary to reshuffle a tensor to be able to contract it in such a way. This can be done with the reshuffle function.

// equivalent to: A(i,j,k) = B(i,k,j)
reshuffle(A, B, {0,2,1});

Decompositions similarly have their low(er) level calls. They require properly reshuffled tensors and you have to provide a splitPosition, i.e. the number of modes that will be represented by the left-hand-side of the result.

// equivalent to: (Q(i,j,r), R(r, k)) = xerus.QR(A(i,j,k))
calculate_qr(Q, R, A, 2);

It is our opinion that code written with these functions instead of indexed espressions are often much harder to understand and the speedup is typically small… but just in case you really want to, you now have the option to use them.