[srslte-users] srsUE Segmentation fault after long runtimes?
Patrick Cutno
PCutno at girdsystems.com
Thu Dec 15 14:56:21 UTC 2016
Thank you for putting time into investigating the issue, but to be perfectly honest, I am not 100% familiar with LTE and how the code implements it just yet. However, if mutex isn’t needed to protect a critical section, the most straight forward answer would be to have a variable or function to check and trigger the next packet to be sent like you mentioned. At the moment I don’t have a more elegant solution but in the meantime, the segfault happens a lot less often now and I think I can continue my extended iperf test with little to no interruptions from the segfault. If I run into any other segfaults or think of a practical solution, I will be sure to let you know.
Thanks again for your time
Patrick
From: Ismael Gomez [mailto:ismael.gomez at softwareradiosystems.com]
Sent: Thursday, December 15, 2016 5:45 AM
To: Patrick Cutno <PCutno at girdsystems.com>; srslte-users at lists.softwareradiosystems.com
Subject: Re: [srslte-users] srsUE Segmentation fault after long runtimes?
Hi again Patrick. I'm afraid this won't have an easy fix and we'll have to think about it better. I don't really see why the mutex is unlocked twice. It should not happen looking at the code, I think. On the other hand, I think this bit of code should be redesigned. The purpose of using mutex here is just to order the transmission of packets to the radio such that TTI N is sent right after N-1 and not before, since the processing of each TTI takes arbitrary time. In that context, mutex elision should not happen because we are not using mutex to protect a critical section but to force an order in a sequence of events. I think that for that we should use some other tool, maybe a condition variable.
Any suggestion?
Thanks
On Wed, 14 Dec 2016 at 21:01 Patrick Cutno <PCutno at girdsystems.com<mailto:PCutno at girdsystems.com>> wrote:
It looks like the segfault still occurs in the new test_lock branch.
Thread 20 "ue" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fffd51ab700 (LWP 6254)]
__lll_unlock_elision (lock=0x10a9ae0, private=0)
at ../sysdeps/unix/sysv/linux/x86/elision-unlock.c:29
29 ../sysdeps/unix/sysv/linux/x86/elision-unlock.c: No such file or directory.
(gdb) bt full
#0 __lll_unlock_elision (lock=0x10a9ae0, private=0)
at ../sysdeps/unix/sysv/linux/x86/elision-unlock.c:29
No locals.
#1 0x00000000004aba9a in srsue::phch_common::worker_end(unsigned int, bool, floatcomplex *, unsigned int, srslte_timestamp_t) ()
No symbol table info available.
#2 0x00000000004a5d0a in srsue::phch_worker::work_imp() ()
No symbol table info available.
#3 0x00000000004d4a11 in srslte::thread_pool::worker::run_thread() [clone .localalias.78] ()
No symbol table info available.
#4 0x00000000004738e9 in thread::thread_function_entry(void*) ()
No symbol table info available.
#5 0x00007ffff7bc16fa in start_thread (arg=0x7fffd51ab700)
at pthread_create.c:333
__res = <optimized out>
pd = 0x7fffd51ab700
now = <optimized out>
unwind_buf = {cancel_jmp_buf = {{jmp_buf = {140736768685824,
-1253843483270665236, 1, 140737488346335, 140736768686528,
140737093259776, 1253929015978666988, 1253861648071699436},
mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0},
data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}}
---Type <return> to continue, or q <return> to quit---
not_first_call = <optimized out>
pagesize_m1 = <optimized out>
sp = <optimized out>
freesize = <optimized out>
__PRETTY_FUNCTION__ = "start_thread"
#6 0x00007ffff39f0b5d in clone ()
at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109
No locals.
(gdb)
________________________________
From: Ismael Gomez [ismael.gomez at softwareradiosystems.com<mailto:ismael.gomez at softwareradiosystems.com>]
Sent: Wednesday, December 14, 2016 12:21 PM
To: Patrick Cutno; srslte-users at lists.softwareradiosystems.com<mailto:srslte-users at lists.softwareradiosystems.com>
Subject: Re: [srslte-users] srsUE Segmentation fault after long runtimes?
I guess it happens at full rate also. It's probably the same buffer overflow reason. I think that it might be possible to disable completely the reset_ul(). I created a new branch called test_lock. Can you please pull that and test if that solves the issue? Thanks
On Wed, 14 Dec 2016 at 18:13 Patrick Cutno <PCutno at girdsystems.com<mailto:PCutno at girdsystems.com>> wrote:
Speaking of... it just happed again.
I ran a 'bt' and a 'bt full', it looks like it might be in a different place.
(gdb) bt
#0 __lll_unlock_elision (lock=0x10a8ae0, private=0)
at ../sysdeps/unix/sysv/linux/x86/elision-unlock.c:29
#1 0x00000000004aae1a in srsue::phch_common::worker_end(unsigned int, bool, floatcomplex *, unsigned int, srslte_timestamp_t) ()
#2 0x00000000004a5157 in srsue::phch_worker::work_imp() ()
#3 0x00000000004d3d51 in srslte::thread_pool::worker::run_thread() [clone .localalias.78] ()
#4 0x00000000004738e9 in thread::thread_function_entry(void*) ()
#5 0x00007ffff7bc16fa in start_thread (arg=0x7fffd51ab700)
at pthread_create.c:333
#6 0x00007ffff39f0b5d in clone ()
at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109
(gdb) bt full
#0 __lll_unlock_elision (lock=0x10a8ae0, private=0)
at ../sysdeps/unix/sysv/linux/x86/elision-unlock.c:29
No locals.
#1 0x00000000004aae1a in srsue::phch_common::worker_end(unsigned int, bool, floatcomplex *, unsigned int, srslte_timestamp_t) ()
No symbol table info available.
#2 0x00000000004a5157 in srsue::phch_worker::work_imp() ()
No symbol table info available.
#3 0x00000000004d3d51 in srslte::thread_pool::worker::run_thread() [clone .localalias.78] ()
No symbol table info available.
#4 0x00000000004738e9 in thread::thread_function_entry(void*) ()
No symbol table info available.
#5 0x00007ffff7bc16fa in start_thread (arg=0x7fffd51ab700)
at pthread_create.c:333
__res = <optimized out>
pd = 0x7fffd51ab700
now = <optimized out>
unwind_buf = {cancel_jmp_buf = {{jmp_buf = {140736768685824,
-7000932799496599927, 1, 140737488346303, 140736768686528,
140737093259776, 7000874771778881161, 7000950959935702665},
mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0},
data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}}
---Type <return> to continue, or q <return> to quit---
not_first_call = <optimized out>
pagesize_m1 = <optimized out>
sp = <optimized out>
freesize = <optimized out>
__PRETTY_FUNCTION__ = "start_thread"
#6 0x00007ffff39f0b5d in clone ()
at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109
No locals.
(gdb)
________________________________
From: Patrick Cutno
Sent: Wednesday, December 14, 2016 12:08 PM
To: Ismael Gomez; srslte-users at lists.softwareradiosystems.com<mailto:srslte-users at lists.softwareradiosystems.com>
Subject: RE: [srslte-users] srsUE Segmentation fault after long runtimes?
Apologies, I forgot to keep a copy of the backtrace. If I run into it again, I reply with the full backtrace.
From: Ismael Gomez [mailto:ismael.gomez at softwareradiosystems.com<mailto:ismael.gomez at softwareradiosystems.com>]
Sent: Wednesday, December 14, 2016 12:03 PM
To: Patrick Cutno <PCutno at girdsystems.com<mailto:PCutno at girdsystems.com>>; srslte-users at lists.softwareradiosystems.com<mailto:srslte-users at lists.softwareradiosystems.com>
Subject: Re: [srslte-users] srsUE Segmentation fault after long runtimes?
Do you know if in the same place?
On Wed, 14 Dec 2016 at 17:47 Patrick Cutno <PCutno at girdsystems.com<mailto:PCutno at girdsystems.com>> wrote:
I may have spoken too soon. While not as frequent, I occasionally get the elision lock segfault.
From: Patrick Cutno
Sent: Tuesday, December 13, 2016 9:41 AM
To: Ismael Gomez <ismael.gomez at softwareradiosystems.com<mailto:ismael.gomez at softwareradiosystems.com>>; srslte-users at lists.softwareradiosystems.com<mailto:srslte-users at lists.softwareradiosystems.com>
Subject: RE: [srslte-users] srsUE Segmentation fault after long runtimes?
I have also ran a few tests with no issues with the elision-unlock.c or ue_dl.c.
Thanks a lot!
Patrick
________________________________
From: Patrick Cutno
Sent: Friday, December 09, 2016 2:07 PM
To: Ismael Gomez; srslte-users at lists.softwareradiosystems.com<mailto:srslte-users at lists.softwareradiosystems.com>
Subject: RE: [srslte-users] srsUE Segmentation fault after long runtimes?
Ok, I managed to run a few 5 minute tests and a 1 hour test without the segfault caused by 'ue_dl.c'.
However, I am running into a new segfault caused by elision-unlock.c on the linux system where it seems like srsUE is attempting to unlock something thats not locked? A quick search on Google showed that some people could resolve the issue with using different versions of libc6 or glibc, some say its a processor compatibility issue (I am using a core i7-6770HQ if that matters). I occasionally got this segfault before switching to the 'next' branch as well but it was overshadowed by the number of times the ue_dl.c segfault happened.
Is this an issue with my computer, srsUE, or both? Any thoughts on the matter? I pasted my backtrace with the segfault below.
Thanks again
Patrick
.
.
.
RRC Connection released.
Random Access Transmission: seq=1, ra-rnti=0xa
Random Access Complete. c-rnti=0x4d, ta=10
RRC Connected
Sync error.
Thread 22 "ue" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fffd67fc700 (LWP 3716)]
__lll_unlock_elision (lock=0x801ca8, private=0)
at ../sysdeps/unix/sysv/linux/x86/elision-unlock.c:29
1481306531:297860 29 ../sysdeps/unix/sysv/linux/x86/elision-unlock.c: No such file or directory.
(gdb) bt full
#0 __lll_unlock_elision (lock=0x801ca8, private=0)
at ../sysdeps/unix/sysv/linux/x86/elision-unlock.c:29
No locals.
#1 0x00000000004abbdf in srsue::phch_common::reset_ul() ()
No symbol table info available.
#2 0x00000000004a9634 in srsue::phch_recv::run_thread() ()
No symbol table info available.
#3 0x000000000045e159 in thread::thread_function_entry(void*) ()
No symbol table info available.
#4 0x00007ffff7bc16fa in start_thread (arg=0x7fffd67fc700)
at pthread_create.c:333
__res = <optimized out>
pd = 0x7fffd67fc700
now = <optimized out>
unwind_buf = {cancel_jmp_buf = {{jmp_buf = {140736792086272,
1035420685597574661, 1, 140737488346367, 140736792086976,
140737201365504, -1035509743916775931, -1035437757262428667},
mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0},
data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}}
not_first_call = <optimized out>
pagesize_m1 = <optimized out>
sp = <optimized out>
freesize = <optimized out>
---Type <return> to continue, or q <return> to quit---
__PRETTY_FUNCTION__ = "start_thread"
#5 0x00007ffff3c36b5d in clone ()
at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109
No locals.
(gdb)
________________________________
From: Patrick Cutno
Sent: Friday, December 09, 2016 8:51 AM
To: Ismael Gomez; srslte-users at lists.softwareradiosystems.com<mailto:srslte-users at lists.softwareradiosystems.com>
Subject: RE: [srslte-users] srsUE Segmentation fault after long runtimes?
Wow, thanks for the fast and thorough response! I will do my best to try it out today or early next week.
Patrick
From: Ismael Gomez [mailto:ismael.gomez at softwareradiosystems.com]
Sent: Friday, December 09, 2016 5:13 AM
To: Patrick Cutno <PCutno at girdsystems.com<mailto:PCutno at girdsystems.com>>; srslte-users at lists.softwareradiosystems.com<mailto:srslte-users at lists.softwareradiosystems.com>
Subject: Re: [srslte-users] srsUE Segmentation fault after long runtimes?
Hi Patrick,
Thanks for testing srsUE. The messages: "Invalid Frequency Hopping parameters" are just warning messages for us but are not really important. We'll probably eliminate them in future releases. During such a long run, it is likely that the UE find a DCI message in the PDCCH for which the CRC matches the C-RNTI by coincidence. Very likely the grant will be invalid and that's the message the UE is printing.
The segfault could be because an incorrectly decoded CFI or some other bug in some part of the code. I've added a check in the function to skip decoding if the CFI is not valid. Just committed it to next branch in srsLTE. Can you check again when you got a chance?
Thanks again for using and testing srsUE.
Best regards,
Ismael
On Wed, 7 Dec 2016 at 16:02 Patrick Cutno <PCutno at girdsystems.com<mailto:PCutno at girdsystems.com>> wrote:
Hello world,
I’m new to srsLTE and this mailing list type forum (please bear with me if I do or say something silly).
I am currently trying to perform long iperf3 tests to measure bandwidth between a b210 with srsLTE UE and a PicoLTE with Amarisoft. Every now and again, the ue side of my system will segfault and it seems to occur randomly to me. Sometime I see the segfault after 30 mins. and other times, I can run the system for 10 hours without a problem. (Nothing else is running on the computers aside from srsLTE/Amarisoft and iperf3)
According to gdb, the fault happens in ../srsLTE/srslte/lib/ue/ue_dl.c:399 ‘current_ss->format = SRSLTE_DCI_FORMAT0;’. In gdb, when I try to print current_ss->format, it reports the memory is not accessible. I have pasted my back trace below if anyone could potentially give me some insight of why this randomly happens and how to fix it? If you need any other info, just let me know.
Thanks
Patrick
Starting program: /home/gird/srsUE/build/ue/src/ue ue_custom_1_4.conf
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
linux; GNU C++ version 5.4.0 20160609; Boost_105800; UHD_003.010.001.000-release
[New Thread 0x7fffe03ec700 (LWP 5395)]
[New Thread 0x7fffdfbeb700 (LWP 5396)]
--- Software Radio Systems LTE UE ---
Reading configuration file ue_custom_1_4.conf...
Using srsLTE version 001.004.000
[New Thread 0x7fffddaed700 (LWP 5397)]
[New Thread 0x7fffdd2ec700 (LWP 5398)]
[New Thread 0x7fffdcaeb700 (LWP 5399)]
[Thread 0x7fffdcaeb700 (LWP 5399) exited]
[Thread 0x7fffdd2ec700 (LWP 5398) exited]
[New Thread 0x7fffdd2ec700 (LWP 5400)]
[New Thread 0x7fffdcaeb700 (LWP 5401)]
[Thread 0x7fffdcaeb700 (LWP 5401) exited]
[Thread 0x7fffdd2ec700 (LWP 5400) exited]
[New Thread 0x7fffdd2ec700 (LWP 5402)]
[New Thread 0x7fffdcaeb700 (LWP 5403)]
[Thread 0x7fffdcaeb700 (LWP 5403) exited]
[Thread 0x7fffdd2ec700 (LWP 5402) exited]
[New Thread 0x7fffdd2ec700 (LWP 5404)]
[New Thread 0x7fffdcaeb700 (LWP 5405)]
[Thread 0x7fffdcaeb700 (LWP 5405) exited]
[Thread 0x7fffdd2ec700 (LWP 5404) exited]
Opening USRP with args: type=b200
[New Thread 0x7fffdd2ec700 (LWP 5406)]
[New Thread 0x7fffdcaeb700 (LWP 5407)]
[Thread 0x7fffdcaeb700 (LWP 5407) exited]
[Thread 0x7fffdd2ec700 (LWP 5406) exited]
[New Thread 0x7fffdd2ec700 (LWP 5408)]
[New Thread 0x7fffdcaeb700 (LWP 5409)]
[Thread 0x7fffdcaeb700 (LWP 5409) exited]
[Thread 0x7fffdd2ec700 (LWP 5408) exited]
[New Thread 0x7fffdd2ec700 (LWP 5410)]
[New Thread 0x7fffdcaeb700 (LWP 5411)]
-- Detected Device: B210
-- Operating over USB 3.
[New Thread 0x7fffd7fff700 (LWP 5412)]
-- Initialize CODEC control...
-- Initialize Radio control...
-- Performing register loopback test... pass
-- Performing register loopback test... pass
-- Performing CODEC loopback test... pass
-- Performing CODEC loopback test... pass
-- Setting master clock rate selection to 'automatic'.
-- Asking for clock rate 16.000000 MHz...
-- Actually got clock rate 16.000000 MHz.
-- Performing timer loopback test... pass
-- Performing timer loopback test... pass
-- Asking for clock rate 32.000000 MHz...
-- Actually got clock rate 32.000000 MHz.
-- Performing timer loopback test... pass
-- Performing timer loopback test... pass
[New Thread 0x7fffd77fe700 (LWP 5413)]
[New Thread 0x7fffd6ffd700 (LWP 5414)]
[New Thread 0x7fffd67fc700 (LWP 5415)]
Setting frequency: DL=375.0 Mhz, UL=325.0 MHz
[New Thread 0x7fffd5693700 (LWP 5416)]
[New Thread 0x7fffd4e92700 (LWP 5417)]
[New Thread 0x7fffcbfff700 (LWP 5418)]
Searching for cell...
Found CELL ID: 1 CP: Normal , CFO: 0.3 KHz.
Trying to decode MIB...
- Cell ID: 1
- Nof ports: 1
- CP: Normal
- PRB: 6
- PHICH Length: Normal
- PHICH Resources: 1
- SFN: 0
MIB received BW=1.4 MHz
[New Thread 0x7fffcb7fe700 (LWP 5419)]
Initializating cell configuration...
Setting Sampling frequency 1.92 MHz
Setting TX/RX offset 54 samples, 28.12 us
SIB1 received, CellID=257, PLMN Id: MCC 1 MNC 1
SIB2 received
[Thread 0x7fffcb7fe700 (LWP 5419) exited]
Random Access Transmission: seq=1, ra-rnti=10
Random Access Complete. c-rnti=257, ta=10
RRC Connected
Network attach successful. IP: 192.168.2.2
[New Thread 0x7fffcaffd700 (LWP 5421)]
RRC Connection released.
Random Access Transmission: seq=1, ra-rnti=10
Random Access Complete. c-rnti=258, ta=9
RRC Connected
RRC Connection released.
Random Access Transmission: seq=1, ra-rnti=10
Random Access Complete. c-rnti=259, ta=9
RRC Connected
Invalid Frequency Hopping parameters. Offset: 2, n_prb_1: 0
Invalid Frequency Hopping parameters. Offset: 2, n_prb_1: 0
Invalid Frequency Hopping parameters. Offset: 2, n_prb_1: 0
RRC Connection released.
Thread 21 "ue" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fffd6ffd700 (LWP 5414)]
srslte_ue_dl_find_ul_dci (q=0x7ffff7f47240, cfi=0, sf_idx=<optimized out>,
rnti=<optimized out>, dci_msg=0x7fffd6ffc8a0)
at /home/gird/srsLTE/srslte/lib/ue/ue_dl.c:399
399 current_ss->format = SRSLTE_DCI_FORMAT0;
(gdb) bt full
#0 srslte_ue_dl_find_ul_dci (q=0x7ffff7f47240, cfi=0, sf_idx=<optimized out>,
rnti=<optimized out>, dci_msg=0x7fffd6ffc8a0)
at /home/gird/srsLTE/srslte/lib/ue/ue_dl.c:399
search_space = {format = 3607086992, loc = {{L = 32767,
ncce = 3607086832}, {L = 32767, ncce = 4764800}, {L = 0,
ncce = 4007973720}, {L = 32767, ncce = 3607086992}, {L = 32767,
ncce = 3607086848}, {L = 32767, ncce = 4764800}, {L = 0,
ncce = 4979944}, {L = 0, ncce = 4294967295}, {L = 4294967295,
ncce = 3099525120}, {L = 32767, ncce = 3607086824}, {L = 32767,
ncce = 3607086864}, {L = 32767, ncce = 42}, {L = 0, ncce = 53}, {
L = 0, ncce = 2952822816}, {L = 32767, ncce = 42}, {L = 0,
ncce = 42}, {L = 0, ncce = 3220809265}, {L = 1041867344,
ncce = 2952818992}, {L = 32767, ncce = 53}, {L = 0, ncce = 53}, {
L = 0, ncce = 4058310079}, {L = 32767, ncce = 1686670400}},
nof_locations = 3214023806}
current_ss = 0x872ff7f54a54
#1 0x00000000004b003c in srsue::phch_worker::decode_pdcch_ul(srsue::mac_interface_phy::mac_grant_t*) ()
No symbol table info available.
#2 0x00000000004b591a in srsue::phch_worker::work_imp() ()
No symbol table info available.
#3 0x00000000004e5031 in srslte::thread_pool::worker::run_thread() [clone .localalias.78] ()
---Type <return> to continue, or q <return> to quit---
No symbol table info available.
#4 0x000000000046bdc9 in thread::thread_function_entry(void*) ()
No symbol table info available.
#5 0x00007ffff7bc16fa in start_thread (arg=0x7fffd6ffd700)
at pthread_create.c:333
__res = <optimized out>
pd = 0x7fffd6ffd700
now = <optimized out>
unwind_buf = {cancel_jmp_buf = {{jmp_buf = {140736800478976,
7185695482018004286, 1, 140737488346351, 140736800479680,
140737201361408, -7185746060066685634, -7185677373655352002},
mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0},
data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}}
not_first_call = <optimized out>
pagesize_m1 = <optimized out>
sp = <optimized out>
freesize = <optimized out>
__PRETTY_FUNCTION__ = "start_thread"
#6 0x00007ffff35fab5d in clone ()
at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109
No locals.
(gdb)
_______________________________________________
srslte-users mailing list
srslte-users at lists.softwareradiosystems.com<mailto:srslte-users at lists.softwareradiosystems.com>
http://www.softwareradiosystems.com/mailman/listinfo/srslte-users
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.srsran.com/pipermail/srsran-users/attachments/20161215/42d73ec7/attachment.htm>
More information about the srsran-users
mailing list