[srslte-users] srsUE Segmentation fault after long runtimes?
Ismael Gomez
ismael.gomez at softwareradiosystems.com
Sat Dec 10 10:31:41 UTC 2016
Hi all,
I just pushed a fix for this into next. Can anyone try it please? Thanks
Ismael
On Sat, 10 Dec 2016 at 11:16 Ismael Gomez <
ismael.gomez at softwareradiosystems.com> wrote:
> Hi Peter,
>
> Thanks for that explanation!
>
> This happens at higher speeds and bandwidths because reset_ul() is called
> when there is a signal synchronization problem, which typically happens due
> to RF frontend buffer overrun.
>
> One solution could be to use pthread_mutex_trylock() to get the status of
> the mutex before unlocking. Do you think it could work?
>
> Regards and thanks for using and testing srsUE
>
> Ismael
>
>
> On Sat, 10 Dec 2016 at 07:18 Peter Coder <denvercoder_9 at outlook.com>
> wrote:
>
> Quoting Patrick Cutno : "(I am using a core i7-6770HQ if that matters)"
>
> Yes, it does matter, actually!
>
> I just ran into this same bug this week [1], also on an i7-6770HQ using:
>
> Ubuntu 16.04.1 LTS AMD64
> Ubuntu GLIBC 2.23-0ubuntu5
> srsUE changeset c7779a97930f9f20f2d40cd018dbbfdd60fc65d8
> srsLTE changeset 4c5b3700f335dfb330ba44fd003dc4b04a794449
>
> The issue is in fact unlocking an already unlocked mutex, which is a
> pthreads API rule violation.
>
> Previously, this violation is not enforced (e.g. via assert) by pthreads
> (or glibc). The underlying mutex implementations, until fairly recently
> were all 'forgiving' of this violation, thus allowing such bugs to go
> unnoticed. However, with the introduction of TSX and HLE (Hardware Elision
> Locking) and glibc's utilization of them for low-level locks, things
> changed - TSX/HLE in hardware is completely unforgiving of this
> violation[2]. Compounding the issue, there were bugs in the first
> processors implementing these features (e.g Haswell and early Broadwell
> CPUs), so the glibc maintainers blacklisted the use of TSX/HLE for those
> processors[3]. This allowed hidden violation bugs to continue on further,
> undiscovered. Fast-forward to yet newer processors coming on to the market
> that don't have issues with their implementations (e.g. Skylake's like our
> i7-6770HQs) and thus are not blacklisted from utilizing TSX/HLE in glibc
> and viola, they surface. (See [4] and [5] for more discussion)
>
> The initial issue is that when phch_common::reset_ul is called, it will
> blindly unlock all mutexes regardless of their present state. Another is
> that phch_common::worker_end conditionally locks the current mutex but
> unconditional unlocks the next mutex in the worker set.
>
> In my initial testing, the higher the bandwidth and the faster you send
> data, the quicker you trigger this issue. At 20MHz and fill tilt
> communication, it happens for me somewhere between immediately and a few
> minutes. I am using volk (avx2_64_mmx_orc) and I do have CPU frequency
> scaling disabled.
>
> As for workarounds, I didn't see any simple options initially. You could
> blacklist the processor in glibc, but that would require rebuilding glibc
> from source which isn't all that friendly to do/maintain. Virtualizing the
> CPUID instruction and masking out the HLE bit is also heavy handed.
>
> The state machine for the phch workers isn't too complex, so hopefully
> with a little massaging the invalid unlocking operations can be avoided
> entirely.
>
>
> [1]
> Program terminated with signal SIGSEGV, Segmentation fault.
> #0 0x00007fc82ef78970 in _xend () at
> ../sysdeps/unix/sysv/linux/x86/elision-unlock.c:33
> 33 ../sysdeps/unix/sysv/linux/x86/elision-unlock.c: No such file or
> directory.
> [Current thread is 1 (Thread 0x7fc8067fc700 (LWP 23018))]
>
>
> #0 0x00007fc82ef78970 in _xend () at
> ../sysdeps/unix/sysv/linux/x86/elision-unlock.c:33
> #1 __lll_unlock_elision (lock=0xc22208, private=0) at
> ../sysdeps/unix/sysv/linux/x86/elision-unlock.c:29
> #2 0x000000000045cfaf in srsue::phch_common::reset_ul() ()
> #3 0x000000000045a6f7 in srsue::phch_recv::run_thread() ()
> #4 0x00000000004293e7 in thread::thread_function_entry(void*) ()
> #5 0x00007fc82ef6d70a in start_thread (arg=0x7fc8067fc700) at
> pthread_create.c:333
> #6 0x00007fc82d5bb82d in clone () at
> ../sysdeps/unix/sysv/linux/x86_64/clone.S:109
>
> info symbol 0x00007fc82ef78970
> __lll_unlock_elision + 48 in section .text of
> /lib/x86_64-linux-gnu/libpthread.so.0
>
> [2]
> https://sourceware.org/git/?p=glibc.git;a=commit;h=1cdbe579482c07e9f4bb3baa4864da2d3e7eb837
> [3]
> https://github.com/OpenMandrivaAssociation/glibc/blob/master/glibc-2.22-blacklist-CPUs-from-lock-elision.patch
> [4]
> https://software.intel.com/en-us/forums/intel-isa-extensions/topic/675036
> [5] https://lists.debian.org/debian-devel/2016/11/msg00219.html
>
>
> _______________________________________________
> srslte-users mailing list
> srslte-users at lists.softwareradiosystems.com
> http://www.softwareradiosystems.com/mailman/listinfo/srslte-users
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.srsran.com/pipermail/srsran-users/attachments/20161210/08ac11f6/attachment.htm>
More information about the srsran-users
mailing list