Erling's blog

Thursday, August 24, 2017

Matrix ordering and graph partitioning

Since I spend a lot of time on solving sparse linear equation systems then I am also a user of sparse matrix reordering methods. My claim to fame is that I have implemented approximate minimum degree myself and it is used in MOSEK.

Below I summarize some interesting link to graph partitioning software:

Metis (the most famous).
Mondriaan (do not know much about this).
Spooles (Not that active).
Multisection a generalization of bisection.

My Git links and tips

At my company we have started using the source control system Git instead of Perforce. Here I have collected some Git related links.

Understanding Git conceptually

Monday, January 9, 2017

Installing TexLive 2016 on Windows Server 2016

My first attempt at intsalling TexLive 2016 on a Windows 2016 failed. However, the following worked

Download install-tl-windows.exe from http://tug.org/texlive/acquire-netinstall.html
set PATH=c:\windows\system32;%PATH%
Run the script install-tl-windows.bat

then reason is that cmd.exe has to be on the PATH.

Tuesday, April 12, 2016

Error handling in BLAS libraries

It is very common to use a BLAS library to perform linear algebra operations such as dense matrix times dense matrix multiplication which can be performed using the dgemm function. The advantage of BLAS is

it is well a defined standard.
and hardware vendors such as Intel supplies a tuned version.

Now at MOSEK my employeer we use the Intel MKL library that includes a BLAS implementation. It really helps us deliver good floating point performance. Indeed we use a sequential version of Intel MKL but call it from potentially many threads using Clik plus. This works well due to the well designed BLAS interface. However, there is one rotten apple in the basket and that is error handling.

Here I will summarize why the error handling in the BLAS standard is awful from my perspective.

First of all why can errors occur when you do the dgemm operation if we assume the dimensions of the matrices are correct and ignoring issues with NANs and the like. Well, in order to obtain good performance the dgemm function may allocate additional memory to store smallish matrices that fit into the cache. I.e. the library use a blocked version to improve the performance.

Oh wait that means it can run of memory and then what? The BLAS standard error handling is to print a message to stderr or something along that line.

Recall that dgemm is embedded deep inside MOSEK which might be embedded deep inside a third party program. This implies an error message printed to stderr does not make sense to the user. Also the user would NOT like us to terminate the application with a fatal error. Rather we want to know that an out of space situation happened and terminate gracefully. Or doing something to lower the space requirement. E.g. use a fewer threads.

What is the solution to this problem? The only solution offered is to replace a function named xerbla that gets called when an error happens. The idea is that the function can set a global flag indicating an error happened. This might be a reasonable solution if the program is single threaded. Now instead assume you use a single threaded dgemm (from say Intel MKL) but call it from many threads. Then first of all you have to introduce a lock (a mutex) around the global error flag leading to performance issues. Next it is hard to figure out which of all the dgemm calls that failed. Hence, you have to fail them all. What pain.

Why is the error handling so primitive in BLAS libraries. I think the reasons are:

BLAS is an old Fortran based standard.
For many years BLAS routine would not allocate storage. Hence, dgemm would never fail unless the dimensions where wrong.
BLAS is proposed by academics which does not care so much about error handling. I mean if you run out of memory you just buy a bigger supercomputer and rerun your computations.

If the BLAS had been invented today it would have been designed in C most likely and then all functions would have returned an error code. I know dealing with error codes is a pain too but that would have made error reporting much easier for those who wanted to do it properly.

Here are some links with background information:

2009 Intel MKL foum post (The 2 last post is the most interesting)
2012 Intel MKL forum post (By me on my problems with error handling)
2015 Intel MKL forum post (Some understand on why things as they are)
BLIS discussion (Newer BLAS library does not focus on error handling either)

Friday, May 22, 2015

Nonsymmetric semidefinite optimization problems.

In semidefinite optimization we optimize over a matrix variable that must be symmetric and positive semidefinite.

Assume we want to relax the assumption about symmetry. Is that an important generalization? The answer is no for the following reason. Since

(X+X')/2 is PSD

implies

is PSD. Observe

X = (X+X')/2+(X-X')/2

and

y'( (X-X')/2) y >= 0.

implying X is PSD.

Note (X-X')'=-(X-X') implying X-X' is skew symmetric.

Hence, any nonsymmetric semidefinite optimization problem can easily be posed as a standard symmetric semidefinite optimization problem.

Wednesday, February 18, 2015

Multi threaded programming

I found the talk: Plain Threads are the GOTO of todays computing by Hartmut Kaiser very interesting because I have been working on improving the multithreaded code in MOSEK recently. And is also thinking how MOSEK should deal with all the cores in the CPUs in the future. I agree with Hartmut something else than plain threads is needed.

Here are some potential replacements:

Berkeley Unified Parallel C
CheckedThreads Source with a liberal license. Cilk like. Both C and C++.
Cilk Plus Supported by Intel C, gcc and Clang.
HPX C++ only.
Occa. A unified approach to multithreaded languages. See also.
OpenMP Supported by many compilers.
Pfunc Cilk like features. Liberal license. Can be called from C but is C++.
Starpu
Threaded building blocks. C++ only.
Wool A pure C implementation that provides something very close to Cilk.
XKAAPI

Previously I have used OpenMP but I really dislike that. In my opinion it is somewhat ugly and you feel the Fortran origins. Recently I have played with Cilk which is very simple.

Checkedthreads seems like a good option if a simple C only tool can do the job. I have plan to try that at MOSEK.

Pfunc seems very powerful and there is ph.d. thesis about its design. The project might be semi-dead though.

Wool also seems very interesting. It is plain C and the author claims the spawn overhead is very low. There is also an older comparison with other task libraries.

Btw I biased towards tools that has no C++ bindings because currently MOSEK is plain C code. Adding a dependency on a C++ runtime library adds headaches.

Some common terminology when working on parallism is

Fork-join parallel model
Pthreads programming and tutorial

Wednesday, November 6, 2013

Complexity of solving conic quadratic problems

First a clarification conic quadratic optimization and second order cone optimization is the same thing. I prefer the name conic quadratic optimization though.

Frequently it is asked on the internet what is the computational complexity of solving conic quadratic problems. Or the related questions what is the complexity of the algorithms implemented in MOSEK, SeDuMi or SDPT3.

Here are a some typical questions

To the best of my knowledge almost all open source and commercial software employ a primal-dual interior-point algorithm using for instance the so-called Nesterov-Todd scaling.

A conic quadratic problem can be stated on the form

\[
\begin{array}{lccl}
\mbox{min} & \sum_{j=1}^d (c^j)^T x^j & \\
\mbox{st} & \sum_{j=1}^d A^j x^j & = & b \\
& x^j \in K^j & \\
\end{array}
\]
where \(K_j\) is a \(n^j\) dimensional quadratic cone. Moreover, I will use \(A = [A^1,\ldots, A^d ]\) and \(n=\sum_j n^j\). Note that \(d \leq n\). First observe the problem cannot be solved exactly on a computer using floating numbers since the solution might be irrational. This is in contrast to linear problems that always have rational solution if the data is rational.

Using for instance the primal-dual interior point algorithm the problem can be solved to \(\varepsilon\) accuracy in \(O(\sqrt{d} \ln(\varepsilon^{-1}))\) interior-point iterations, where \(\varepsilon\) is the accepted duality gap. The most famous variant having that iteration complexity is based on Nesterov and Todds beautiful work on symmetric cones.

Each iteration requires solution of a linear system with the coefficient matrix
\[ \label{neweq} \left [ \begin{array}{cc} H & A^T \\ A & 0 \\ \end{array} \right ] \mbox{ (*)} \]
This is the most expensive operation and that can be done in \(O(n^3)\) complexity using Gaussian elimination so we end at the complexity \(O(n^{3.5}\ln(\varepsilon^{-1}))\).

That is the theoretical result. In practice the algorithms usually works much better because they normally finish in something like 10 to 100 iterations and rarely employs more than 200 iterations. In fact if the algorithm requires more than 200 iterations then typically numerical issues prevent the software from solving the problem.

Finally, typically conic quadratic problem is sparse and that implies the linear system mentioned above can be solved must faster when the sparsity is exploited. Figuring our to solve the linear equation system (*) in the lowest complexity when exploiting sparsity is NP hard and therefore optimization only employs various heuristics such minimum degree order that helps cutting the iteration complexity. If you want to know more then read my Mathematical Programming publication mentioned below. One important fact is that it is impossible to predict the iteration complexity without knowing the problem structure and then doing a complicated analysis of that. I.e. the iteration complexity is not a simple function of the number constraints and variables unless A is completely dense.

To summarize primal-dual interior-point algorithms solve a conic quadratic problem in less 200 times the cost of solving the linear equation system (*) in practice.

So can the best proven polynomial complexity bound be proven for software like MOSEK. In general the answer is no because the software employ an bunch of tricks that speed up the practical performance but unfortunately they destroy the theoretical complexity proof. In fact, it is commonly accepted that if the algorithm is implemented strictly as theory suggest then it will be hopelessly slow.

I have spend of lot time on implementing interior-point methods as documented by the Mathematical Programming publication and my view on the practical implementations are that are they very close to theory.