Tuesday, April 12, 2016

Error handling in BLAS libraries

It is very common to use a BLAS library to perform linear algebra operations such as dense matrix times dense matrix multiplication which can be performed using the dgemm function. The advantage of BLAS is

  • it is well a defined standard.
  • and hardware vendors such as Intel supplies a tuned version.

Now at MOSEK my employeer we use the Intel MKL library that includes a BLAS implementation. It really helps us deliver good floating point performance. Indeed we use a sequential version of Intel MKL but call it from potentially many threads using Clik plus. This works well due to the well designed BLAS interface. However, there is one rotten apple in the basket and that is error handling.

Here I will summarize why the error handling in the BLAS standard is awful from my perspective.

First of all why can errors occur when you do the dgemm operation if we assume the dimensions of the matrices are correct and ignoring issues with NANs and the like. Well, in order to obtain good performance the dgemm function may allocate additional memory to store smallish matrices that fit into the cache. I.e. the library use a blocked version to improve the performance.

Oh wait that means it can run of memory and then what? The BLAS standard error handling is to print a message to stderr or something along that line.

Recall that dgemm is embedded deep inside MOSEK which might be embedded deep inside a third party program. This implies an error message printed to stderr does not make sense to the user. Also the user would NOT like us to terminate the application with a fatal error. Rather we want to know that an out of space situation happened and terminate gracefully. Or doing something to lower the space requirement. E.g. use a fewer threads.

What is the solution to this problem? The only solution offered is to replace a function named xerbla that gets called when an error happens. The idea is that the function can set a global flag indicating an error happened. This might be a reasonable solution if the program is single threaded. Now instead assume you use a single threaded dgemm (from say Intel MKL) but call it from many threads. Then first of all you have to introduce a lock (a mutex) around the global error flag leading to performance issues. Next it is hard to figure out which of all the dgemm calls that failed. Hence, you have to fail them all. What pain.

Why is the error handling so primitive in BLAS libraries. I think the reasons are:

  • BLAS is an old Fortran based standard. 
  • For many years BLAS routine would not allocate storage. Hence, dgemm would never fail unless the dimensions where wrong.
  • BLAS is proposed by academics which does not care so much about error handling. I mean if you run out of memory you just buy a bigger supercomputer and rerun your computations.
If the BLAS had been invented today it would have been designed in C most likely and then all functions would have returned an error code. I know dealing with error codes is a pain too but that would have made error reporting much easier for those who wanted to do it properly.


Here are some links with background information:

Friday, May 22, 2015

Nonsymmetric semidefinite optimization problems.

In  semidefinite optimization we optimize over a matrix variable that must be symmetric and positive semidefinite.

Assume we want to relax the assumption about symmetry. Is that an important generalization? The answer is no for the following reason. Since 

   (X+X')/2 is PSD

implies

   X 

is PSD. Observe

   X = (X+X')/2+(X-X')/2

and

  y'( (X-X')/2) y >= 0.

implying X is PSD.

Note  (X-X')'=-(X-X') implying X-X' is skew symmetric.

Hence, any nonsymmetric semidefinite optimization problem can easily be posed as a standard symmetric semidefinite optimization problem.

Wednesday, February 18, 2015

Multi threaded programming

I found the talk: Plain Threads are the GOTO of todays computing by Hartmut Kaiser very interesting because I have been working on improving the multithreaded code in MOSEK recently. And is also thinking how MOSEK should deal with all the cores in the CPUs in the future. I agree with Hartmut something else than plain threads is needed.

Here are some potential replacements:
Previously I have used OpenMP but I really dislike that. In my opinion it is somewhat ugly and you feel the Fortran origins. Recently I have played with Cilk which is very simple.

Checkedthreads seems like a good option if a simple C only tool can do the job. I have plan to try that at MOSEK.

Pfunc seems very powerful and there is ph.d. thesis about its design. The project might be semi-dead though.

Wool also seems very interesting. It is plain C and the author claims the spawn overhead is very low. There is also an older comparison with other task libraries.

Btw I biased towards tools that has no C++ bindings because currently MOSEK is plain C code. Adding a dependency on a C++ library runtime library adds headaches.

Some common terminology when working on parallism is

Wednesday, November 6, 2013

Complexity of solving conic quadratic problems

First a clarification conic quadratic optimization and second order cone optimization is the same thing. I prefer the name conic quadratic optimization though.

Frequently it is asked on the internet what is the computational complexity of solving conic quadratic problems. Or the related questions what is the complexity of the algorithms implemented in MOSEK, SeDuMi or SDPT3.

Here are a some typical questions
To the best of my knowledge almost all open source and commercial software employ a primal-dual interior-point algorithm using for instance the so-called Nesterov-Todd scaling.

A conic quadratic problem can be stated on the form
\[
\begin{array}{lccl}
\mbox{min} & \sum_{j=1}^d (c^j)^T x^j & \\
\mbox{st} & \sum_{j=1}^d A^j x^j & = & b \\
& x^j \in K^j & \\
\end{array}
\]
where \(K_j\) is a \(n^j\) dimensional quadratic cone. Moreover, I will use \(A = [A^1,\ldots, A^d ]\) and \(n=\sum_j n^j\). Note that \(d \leq n\). First observe the problem cannot be solved exactly on a computer using floating numbers since the solution might be irrational. This is in contrast to linear problems that always have rational solution if the data is rational.

Using for instance the primal-dual interior point algorithm the problem can be solved to \(\varepsilon\) accuracy in \(O(\sqrt{d} \ln(\varepsilon^{-1}))\) interior-point iterations, where \(\varepsilon\) is the accepted duality gap. The most famous variant having that iteration complexity is based on Nesterov and Todds beautiful work on symmetric cones.

Each iteration requires solution of a linear system with the coefficient matrix
\[ \label{neweq} \left [ \begin{array}{cc} H & A^T \\ A & 0 \\ \end{array} \right ] \mbox{ (*)} \]
This is the most expensive operation and that can be done in \(O(n^3)\) complexity using Gaussian elimination so we end at the complexity \(O(n^{3.5}\ln(\varepsilon^{-1}))\).
That is the theoretical result. In practice the algorithms usually works much better because they normally finish in something like 10 to 100 iterations and rarely employs more than 200 iterations. In fact if the algorithm requires more than 200 iterations then typically numerical issues prevent the software from solving the problem.

Finally, typically conic quadratic problem is sparse and that implies the linear system mentioned above can be solved must faster when the sparsity is exploited. Figuring our to solve the linear equation system (*) in the lowest complexity when exploiting sparsity is NP hard and therefore optimization only employs various heuristics such minimum degree order that helps cutting the iteration complexity. If you want to know more then read my Mathematical Programming publication mentioned below. One important fact is that it is impossible to predict the iteration complexity without knowing the problem structure and then doing a complicated analysis of that. I.e. the iteration complexity is not a simple function of the number constraints and variables unless A is completely dense.

To summarize primal-dual interior-point algorithms solve a conic quadratic problem in less 200 times the cost of solving the linear equation system (*) in practice.

So can the best proven polynomial complexity bound be proven for software like MOSEK. In general the answer is no because the software employ an bunch of tricks that speed up the practical performance but unfortunately they destroy the theoretical complexity proof. In fact, it is commonly accepted that if the algorithm is implemented strictly as theory suggest then it will be hopelessly slow.

I have spend of lot time on implementing interior-point methods as documented by the Mathematical Programming publication and my view on the practical implementations are that are they very close to theory. 

Why I hate OpenMP

At +MOSEK we use OpenMP to parallelize our code. Well, in fact we used to use OpenMP exclusively. However, we have now moved to use pthreads or native Windows threads.

This seems like a very bad idea because the normal recommendation is to OpenMP, CiLK or Intel parallel building blocks or like. But I will now explain why I ignore this recommendation  and in fact think it is a bit naive.

At MOSEK we build a shared library (a DLL on Windows) that we sell to our customers. This library is called from the users program or programs like MATLAB, R etc. We have absolutely no control over the calling application. The calling application may be linked OpenMP or another threading library. Our users expects our application to work almost independently of what they do.

However, OpenMP is build on the implicit assumption you are an academic who have all the source code when building the application  and if an application is linked with OpenMP then the same version of OpenMP is used throughout the application. This definitely does not hold for us.

We have to ask ourselves what happen if the user build their application linking with MOSEK using gcc OpenMP and we build our library with Intel OpenMP. I think the answer is that the final applications crashes.

I recently reported crash in the Intel MKL BLAS library when called from MATLAB that seems to be caused by OpenMP issue. See the report. The answer from Intel is are you sure you use the same OpenMP as MATLAB. This is illustrates very well what I hate about OpenMP. It does not work unconditionally. I mean if you link with pthreads it then it works because pthreads is integrated into the operating system.

So my conclusion is. If you are a commercial library vendor then do not use OpenMP unless you want to get into trouble.

Wednesday, June 26, 2013

Slow convergence in conic quadratic optimization

The recent paper by Gould, Orban, and Robinson  in Mathematical Programming Computation study the QP
\[
\begin{array}{lc}
\mbox{min}  & x_1^2    \\
\mbox{st}     &  x_1 \geq 0
\end{array}
\]
because the problem does not have s strict complementarity solution. This fact leads interior-point methods to converge slowly.

In interesting observation is that it can be formulated as the following conic quadratic optimization problem:
\[
\begin{array}{lc}
\mbox{min} &  x_2        \\
\mbox{st}    & x_2 \geq ||x_1|| \\
                    &  x_1  \geq 0 
\end{array}
\]
An optimal primal solution is \(x_1 = 0\) and \(x_2=0\). Observe Vanderbei study the same problem in his working paper Section 2.2) on solving SOCPs with LOQO.

If I solve the problem with MOSEK I obtain

ITE PFEAS    DFEAS    GFEAS    PRSTATUS   POBJ              DOBJ              MU       TIME  
0   1.0e+000 1.0e+000 2.0e+000 0.00e+000  1.000000000e+000  0.000000000e+000  1.0e+000 0.00  
1   1.6e-001 1.6e-001 3.2e-001 1.00e+000  2.404340377e-001  0.000000000e+000  1.6e-001 0.00  
2   8.2e-003 8.2e-003 1.6e-002 1.12e+000  1.181687831e-002  0.000000000e+000  8.2e-003 0.00  
3   4.1e-004 4.1e-004 8.2e-004 1.01e+000  5.891864452e-004  0.000000000e+000  4.1e-004 0.00  
4   2.1e-005 2.1e-005 4.1e-005 1.00e+000  2.945495827e-005  0.000000000e+000  2.1e-005 0.00  
5   2.6e-010 2.6e-010 5.2e-010 1.00e+000  4.144019664e-010  0.000000000e+000  2.6e-010 0.00  

As can be seen from mu column (average complementarity gap) MOSEK converges quickly. In fact there is quadratic convergence in the last iteration. Now the dual problem is
\[
\begin{array}{lc}
\mbox{max} & 0        \\
\mbox{st}    & s_3  = 1 \\
                     & s_2 \geq ||s_1|| \\
                    &  s_1  \geq 0 
\end{array}
\]
and an optimal dual solution is \(s_1=0.0\), \(s_2=1.0\) and \(s_3=1\). In this case the dual solution is in the interior and hence the solutions are stricly complementarity. This is also indicated by the interior-point optimizer in MOSEK which also converge quickly.

An equivalent conic quadratic problem using a rotated quadratic cone can be stated as follows
\[
\begin{array}{lc}
\mbox{min} &  x_2        \\
\mbox{st}    & 2 x_3 x_2 \geq ||x_1||^2 \\
                    & x_3  =1 \\
                    &  x_1  \geq 0 
\end{array}
\]
Now solving that problem with MOSEK produces
ITE PFEAS    DFEAS    GFEAS    PRSTATUS   POBJ              DOBJ              MU       TIME  
0   1.0e+000 1.0e+000 1.7e+000 0.00e+000  7.071067812e-001  0.000000000e+000  1.0e+000 0.00  
1   1.4e-001 1.4e-001 2.4e-001 6.09e-001  1.517397994e-001  -4.749764644e-002 1.4e-001 0.00  
2   2.2e-002 2.2e-002 3.7e-002 1.21e+000  2.120093451e-002  -6.971441263e-003 2.2e-002 0.00  
3   5.3e-003 5.3e-003 9.0e-003 1.14e+000  4.674957163e-003  -1.445168902e-003 5.3e-003 0.00  
4   1.5e-003 1.5e-003 2.6e-003 1.07e+000  1.234899144e-003  -3.403939914e-004 1.5e-003 0.00  
5   4.4e-004 4.4e-004 7.4e-004 1.04e+000  3.331026029e-004  -7.962820209e-005 4.4e-004 0.00  
6   1.2e-004 1.2e-004 2.1e-004 1.02e+000  8.925963682e-005  -1.860468131e-005 1.2e-004 0.00  
7   3.4e-005 3.4e-005 5.8e-005 1.01e+000  2.373850359e-005  -4.389268847e-006 3.4e-005 0.00  
8   9.2e-006 9.2e-006 1.6e-005 1.01e+000  6.273175074e-006  -1.049389687e-006 9.2e-006 0.00  
9   2.4e-006 2.4e-006 4.2e-006 1.00e+000  1.649538759e-006  -2.545236451e-007 2.4e-006 0.00  
10  6.5e-007 6.5e-007 1.1e-006 1.00e+000  4.321347177e-007  -6.259486109e-008 6.5e-007 0.00  
11  1.7e-007 1.7e-007 2.9e-007 1.00e+000  1.128991865e-007  -1.558439042e-008 1.7e-007 0.00  
12  4.5e-008 4.5e-008 7.7e-008 1.00e+000  2.943872208e-008  -3.919657806e-009 4.5e-008 0.00  
13  1.2e-008 1.2e-008 2.0e-008 1.00e+000  7.665414689e-009  -9.952574739e-010 1.2e-008 0.00  

It is seen that interior-point optimizer now converge at a much slower rate and takes many more iterations. It is easy to verify that there is no strictly complementarity solution in this case.

This give rise to some comments. There might be several ways of representing a set and they not all equally effective. Secondly maybe we should think about improving the code for problems without a stricly complementarity solution as Gould et al. is suggesting.




Friday, May 25, 2012

An example of a badly scaled model with an easy fix

A scaled down version of problem we just received at MOSEK is

min s^2
st.   Ax - b = s   (P)
       x >= 0

Assume for simplicity A is a matrix with 1 row. Now in this case the elements in A is very large i.e. of the order of magnitude 10^7.  This implies the problem (P)  is badly scaled because the row contains a lot of elements of the size 10^7 and one of the size 1.

Now an equally good model is


min s^2
st.   Ax - b = (||A||_inf) s   (P')
       x >= 0


which is nicely scaled. An optimal solution to (P') is clearly also optimal to (P)!

Note this line of thinking could be relevant anywhere were penalty variables is employed. Many optimization problems contains penalty variables.