[srslte-users] srsUE Segmentation fault after long runtimes?

Peter Coder denvercoder_9 at outlook.com
Sat Dec 10 03:25:16 UTC 2016


Quoting Patrick Cutno : "(I am using a core i7-6770HQ if that matters)"

Yes, it does matter, actually!

I just ran into this same bug this week [1], also on an i7-6770HQ using:

   Ubuntu 16.04.1 LTS AMD64
   Ubuntu GLIBC 2.23-0ubuntu5
   srsUE changeset c7779a97930f9f20f2d40cd018dbbfdd60fc65d8
   srsLTE changeset 4c5b3700f335dfb330ba44fd003dc4b04a794449

The issue is in fact unlocking an already unlocked mutex, which is a pthreads API rule violation. 

Previously, this violation is not enforced (e.g. via assert) by pthreads (or glibc). The underlying mutex implementations, until fairly recently were all 'forgiving' of this violation, thus allowing such bugs to go unnoticed. However, with the introduction of TSX and HLE (Hardware Elision Locking) and glibc's utilization of them for low-level locks, things changed - TSX/HLE in hardware is completely unforgiving of this violation[2]. Compounding the issue, there were bugs in the first processors implementing these features (e.g Haswell and early Broadwell CPUs), so the glibc maintainers blacklisted the use of TSX/HLE for those processors[3]. This allowed hidden violation bugs to continue on further, undiscovered. Fast-forward to yet newer processors coming on to the market that don't have issues with their implementations (e.g. Skylake's like our i7-6770HQs) and thus are not blacklisted from utilizing TSX/HLE in glibc and viola, they surface.  (See ​[4] and [5] for more discussion)

The initial issue is that when phch_common::reset_ul is called, it will blindly unlock all mutexes regardless of their present state. Another is that phch_common::worker_end conditionally locks the current mutex but unconditional unlocks the next mutex in the worker set.

In my initial testing, the higher the bandwidth and the faster you send data, the quicker you trigger this issue. At 20MHz and fill tilt communication, it happens for me somewhere between immediately and a few minutes. I am using volk (avx2_64_mmx_orc) and I do have CPU frequency scaling disabled.

As for workarounds, I didn't see any simple options initially. You could blacklist the processor in glibc, but that would require rebuilding glibc from source which isn't all that friendly to do/maintain. Virtualizing the CPUID instruction and masking out the HLE bit is also heavy handed.

The state machine for the phch workers isn't too complex, so hopefully with a little massaging the invalid unlocking operations can be avoided entirely.


[1]
   Program terminated with signal SIGSEGV, Segmentation fault.
   #0  0x00007fc82ef78970 in _xend () at ../sysdeps/unix/sysv/linux/x86/elision-unlock.c:33
   33    ../sysdeps/unix/sysv/linux/x86/elision-unlock.c: No such file or directory.
   [Current thread is 1 (Thread 0x7fc8067fc700 (LWP 23018))]
   
   
   #0  0x00007fc82ef78970 in _xend () at ../sysdeps/unix/sysv/linux/x86/elision-unlock.c:33
   #1  __lll_unlock_elision (lock=0xc22208, private=0) at ../sysdeps/unix/sysv/linux/x86/elision-unlock.c:29
   #2  0x000000000045cfaf in srsue::phch_common::reset_ul() ()
   #3  0x000000000045a6f7 in srsue::phch_recv::run_thread() ()
   #4  0x00000000004293e7 in thread::thread_function_entry(void*) ()
   #5  0x00007fc82ef6d70a in start_thread (arg=0x7fc8067fc700) at pthread_create.c:333
   #6  0x00007fc82d5bb82d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109
   
   info symbol 0x00007fc82ef78970
   __lll_unlock_elision + 48 in section .text of /lib/x86_64-linux-gnu/libpthread.so.0

[2] https://sourceware.org/git/?p=glibc.git;a=commit;h=1cdbe579482c07e9f4bb3baa4864da2d3e7eb837 
[3] https://github.com/OpenMandrivaAssociation/glibc/blob/master/glibc-2.22-blacklist-CPUs-from-lock-elision.patch
[4] https://software.intel.com/en-us/forums/intel-isa-extensions/topic/675036
[5] https://lists.debian.org/debian-devel/2016/11/msg00219.html




More information about the srsran-users mailing list