Broken L2 cache reporting on crostini
With Google's recent announcement of support for running real Linux apps on Chrome OS, I picked up a Pixelbook, since I've been long awaiting the viability of Chromebooks as development machines.
After setting up a dev VM and experimenting with various projects, I found that one Tensorflow application I was playing with would lock up, hard, inside the Crostini VM on my Chromebook.
After adding some debug prints, I discovered that virtually any calls into numpy.linalg.inv were hanging. I could reproduce as simply as:
python3 -c 'import numpy as np; np.linalg.inv(np.identity(3))'
Googling found https://github.com/numpy/numpy/issues/11041, which was similar, but my bug was far worse, and the workaround on that issue didn't solve my problem. It did, however, point me at the OPENBLAS_NUM_THREADS=1 environment variable, which limited openblas to a single thread; This would prove helpful later.
I suspected a Crostini bug at this point, since this was pretty basic functionality to be broken, but I reported a numpy bug in the interim while I debugged. (numpy ended up (correctly!) also guessing this was a crostini bug, but were also able to provide some helpful debugging pointers)
I stopped a hung process (with OPENBLAS_NUM_THREADS=1 to simplify the situation) in gdb and got a stack trace:
(gdb) bt #0 0x00007ffff445a5b8 in dtrsm_oltucopy_PRESCOTT () from /home/nelhage/.local/lib/python3.5/site-packages/numpy/core/../.libs/libopenblasp-r0-39a31c03.2.18.so #1 0x00007ffff426aad3 in dtrsm_LNLU () from /home/nelhage/.local/lib/python3.5/site-packages/numpy/core/../.libs/libopenblasp-r0-39a31c03.2.18.so #2 0x00007ffff43b0a24 in dgetrs_N_single () from /home/nelhage/.local/lib/python3.5/site-packages/numpy/core/../.libs/libopenblasp-r0-39a31c03.2.18.so #3 0x00007ffff4191965 in dgesv_ () from /home/nelhage/.local/lib/python3.5/site-packages/numpy/core/../.libs/libopenblasp-r0-39a31c03.2.18.so #4 0x00007ffff1e5a103 in ?? () from /home/nelhage/.local/lib/python3.5/site-packages/numpy/linalg/_umath_linalg.cpython-35m-x86_64-linux-gnu.so #5 0x00007ffff3aaed24 in ?? () from /home/nelhage/.local/lib/python3.5/site-packages/numpy/core/umath.cpython-35m-x86_64-linux-gnu.so #6 0x00007ffff3aaf538 in ?? () from /home/nelhage/.local/lib/python3.5/site-packages/numpy/core/umath.cpython-35m-x86_64-linux-gnu.so #7 0x00007ffff3ab0ddf in ?? () from /home/nelhage/.local/lib/python3.5/site-packages/numpy/core/umath.cpython-35m-x86_64-linux-gnu.so #8 0x000055555575d647 in PyObject_Call () #9 0x00005555556d4ee1 in PyEval_EvalFrameEx () #10 0x00005555556d493f in PyEval_EvalFrameEx () #11 0x00005555556d9286 in ?? () #12 0x00005555556d9f9f in PyEval_EvalCode () #13 0x00005555556b89bf in PyRun_StringFlags () #14 0x00005555557a9f3c in PyRun_SimpleStringFlags () #15 0x00005555557d8602 in Py_Main () #16 0x0000555555668c01 in main ()
This confirmed we're hung in OpenBLAS, and in particular tells us that numpy ships its own OpenBLAS. Debug symbols would almost certainly help here, so I installed Debian's OpenBLAS, and the corresponding debug symbols:
$ sudo apt install libopenblas-base libopenblas-base-dbgsym
Now we can force load that version, and get better symbols:
$ env LD_PRELOAD=/usr/lib/libopenblasp-r0.2.19.so OPENBLAS_NUM_THREADS=1 gdb --args python3 -c 'import numpy as np; np.linalg.inv(np.identity(3))' ... (gdb) bt #0 dtrsm_oltucopy_PRESCOTT (m=3, n=0, a=<optimized out>, lda=3, offset=<optimized out>, b=<optimized out>) at generic/trsm_ltcopy_4.c:346 #1 0x00007ffff5e5e8b4 in dtrsm_LNLU (args=args@entry=0x7fffffffbc10, range_m=range_m@entry=0x0, range_n=range_n@entry=0x0, sa=sa@entry=0x7fffecaa5000, sb=sb@entry=0x7fffecaa5100, dummy=dummy@entry=0) at trsm_L.c:153 #2 0x00007ffff5fa7765 in dgetrs_N_single (args=args@entry=0x7fffffffbc10, range_m=range_m@entry=0x0, range_n=range_n@entry=0x0, sa=sa@entry=0x7fffecaa5000, sb=sb@entry=0x7fffecaa5100, mypos=mypos@entry=0) at getrs_single.c:51 #3 0x00007ffff5d7d978 in dgesv_ (N=<optimized out>, NRHS=0x7fffffffbdcc, a=<optimized out>, ldA=<optimized out>, ipiv=<optimized out>, b=<optimized out>, ldB=0x7fffffffbdd4, Info=0x7fffffffbda0) at lapack/gesv.c:127 #4 0x00007fffef583103 in ?? () from /home/nelhage/.local/lib/python3.5/site-packages/numpy/linalg/_umath_linalg.cpython-35m-x86_64-linux-gnu.so #5 0x00007ffff11d7d24 in ?? () from /home/nelhage/.local/lib/python3.5/site-packages/numpy/core/umath.cpython-35m-x86_64-linux-gnu.so #6 0x00007ffff11d8538 in ?? () from /home/nelhage/.local/lib/python3.5/site-packages/numpy/core/umath.cpython-35m-x86_64-linux-gnu.so #7 0x00007ffff11d9ddf in ?? () from /home/nelhage/.local/lib/python3.5/site-packages/numpy/core/umath.cpython-35m-x86_64-linux-gnu.so #8 0x000055555575d647 in PyObject_Call () #9 0x00005555556d4ee1 in PyEval_EvalFrameEx () #10 0x00005555556d493f in PyEval_EvalFrameEx () #11 0x00005555556d9286 in ?? () #12 0x00005555556d9f9f in PyEval_EvalCode () #13 0x00005555556b89bf in PyRun_StringFlags () #14 0x00005555557a9f3c in PyRun_SimpleStringFlags () #15 0x00005555557d8602 in Py_Main () #16 0x0000555555668c01 in main ()
Line numbers! Variable names! Oh my!
After taking a few stack traces to see where we're stuck, I became pretty convinced we were stuck in this loop:
for(is = ls + min_i; is < ls + min_l; is += GEMM_P){ #ifndef TRANSA TRSM_ILTCOPY(min_l, min_i, a + (is + ls * lda) * COMPSIZE, lda, is - ls, sa); #else TRSM_IUNCOPY(min_l, min_i, a + (ls + is * lda) * COMPSIZE, lda, is - ls, sa); #endif TRSM_KERNEL(min_i, min_j, min_l, dm1, #ifdef COMPLEX ZERO, #endif sa, sb, b + (is + js * ldb) * COMPSIZE, ldb, is - ls); }
Looking at the assembly, we find the loop ends with a
0x00007ffff5e5e8f8 : movslq 0x280(%rax),%r10 0x00007ffff5e5e8ff : add %r10,%rbp 0x00007ffff5e5e902 : cmp %r15,%rbp 0x00007ffff5e5e905 : jl 0x7ffff5e5e870 <dtrsm_lnlu>
I set a breakpoint on 0x00007ffff5e5e8ff and inspected %r10, which I was pretty sure was the GEMM_P increment in the for loop above:
gdb) b *0x00007ffff5e5e8ff Breakpoint 1 at 0x7ffff5e5e8ff: file trsm_L.c, line 148. (gdb) c Continuing. Breakpoint 1, 0x00007ffff5e5e8ff in dtrsm_LNLU (args=args@entry=0x7fffffffbc10, range_m=range_m@entry=0x0, range_n=range_n@entry=0x0, sa=sa@entry=0x7fffecaa5000, sb=sb@entry=0x7fffecaa5100, dummy=dummy@entry=0) at trsm_L.c:148 148 in trsm_L.c (gdb) p $r10 $1 = 0
So we're looping forever because we're looping over something by an increment of 0. Hm. Now, where does that come from? We can ask the debugger if that address points to a symbol:
(gdb) x/lx (void*)$rax + 0x280 0x7ffff7dadf40 <gotoblas_prescott>: 0x00000000
so GEMM_P appears to be a macro(?) that expands into an offset after a symbol named gotoblas_PRESCOTT. Github suggests that symbol is a gotoblas_t: https://github.com/xianyi/OpenBLAS/blob/26e1cfb65314a5579cc74aa8d7d30660db3e9ee1/driver/others/dynamic.c#L58
gdb tells us that struct is in fact larger than 0x280 bytes, so we're pulling a field from inside. Scanning the fields (via ptype) we find several named things like gemm_p, and in fact find one at offset 0x280:
(gdb) p sizeof(gotoblas_t) $3 = 3992 (gdb) ptype gotoblas_t type = struct { .... } (gdb) p &((gotoblas_t*)0)->dgemm_p $4 = (int *) 0x280
I tried messing with watchpoints to catch the initialization of dgemm_p, but to no avail. Code search, however, revealed a whole bunch of assignments in a nested maze of #ifdef conditionals, all in one function in setparam-ref.c (I would later learn that OpenBLAS compiles this file once per architecture with a different mix of #defines, but that's mostly just fun trivia...).
We note that it's pulling the size, in most of these branches, based on l2, which is the result of:
int l2 = get_l2_size();
Is it possible the custom hypervisor used by crostini is incorrectly reporting a size-0 L2 cache? We read get_l2_size; the crux of it is:
cpuid(0x80000006, &eax, &ebx, &ecx, &edx);
Some quick googling confirms that cpuid query 0x80000006 returns information about the L2 cache. With the help of the cpuid(1) command-line tool, we can see what our virtual CPU returns:
$ cpuid -1 -r -l 0x80000006 CPU: 0x80000006 0x00: eax=0x00000000 ebx=0x00000000 ecx=0x00000000 edx=0x00000000
All zeros! So, the hypervisor that manages the VMs on Google's new Crostini environment is failing to configure the CPUID values for L2 cache size, resulting in OpenBLAS seeing a 0-size L2 cache, and looping forever as it tries to loop over data in L2-sized chunks!
I filed two bugs, one against Google about the cpuid issue, and one against OpenBLAS asking for greater robustness in this particular edge case. The latter is already fixed, and the former has been confirmed and will hopefully be resolved soon.












