[LITMUS^RT] Hit crash in C-EDF (configured as partitioned)
Glenn Elliott
gelliott at cs.unc.edu
Tue Jan 29 03:44:54 CET 2013
With the ECRTS crunch, I don't expect anyone to take a look at this, but I think I may have hit a bug with C-EDF when cluster size = 1 (partitioned). It is a pretty strange crash. Waking from sys_wait_for_ts_release() seems to lead to a crash in unlink(). The code is trying to remove the bheap node of the waiting task from the ready queue. The check on is_queued() is successful, but the remove() call fails in a NULL pointer dereference.
Here's a stack trace:
[ 334.861983] BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
[ 334.862740] IP: [<ffffffff811cc417>] bheap_delete+0x5c/0xbf
[ 334.863040] PGD 98d291067 PUD 9979b8067 PMD 0
[ 334.863408] Oops: 0000 [#1] PREEMPT SMP
[ 334.863771] CPU 2
[ 334.863814] Modules linked in: nvidia(P) kvm parport_pc parport
[ 334.864499]
[ 334.864747] Pid: 5263, comm: klt_tracker Tainted: P W 3.0.0-litmus-2012.3-gpu-dbg #104 Tyan FT72-B7015/S7015
[ 334.865337] RIP: 0010:[<ffffffff811cc417>] [<ffffffff811cc417>] bheap_delete+0x5c/0xbf
[ 334.865847] RSP: 0018:ffff881277e2bc18 EFLAGS: 00010007
[ 334.866108] RAX: 0000000000000000 RBX: ffff880997ab6bd0 RCX: ffff8809bfc80000
[ 334.866372] RDX: 0000000000000000 RSI: ffff88098b6d10f8 RDI: ffffffff811c9bce
[ 334.866637] RBP: ffff881277e2bc28 R08: ffffffff81820730 R09: 000000000000006a
[ 334.866902] R10: 0000000000000000 R11: ffff8809bfc90d20 R12: ffff881272cb39d0
[ 334.867166] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000001
[ 334.867431] FS: 00007f3fce9a0980(0000) GS:ffff8809bfc80000(0000) knlGS:0000000000000000
[ 334.867903] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 334.868163] CR2: 0000000000000008 CR3: 0000000996f2e000 CR4: 00000000000006e0
[ 334.868427] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 334.868692] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 334.868957] Process klt_tracker (pid: 5263, threadinfo ffff881277e2a000, task ffff881272cb39d0)
[ 334.869429] Stack:
[ 334.869677] ffffffff81820730 ffff881272cb39d0 ffff881277e2bc38 ffffffff811d50ed
[ 334.870294] ffff881277e2bc78 ffffffff811d6ba5 ffff881272cb3d00 000000000000148f
[ 334.870911] ffff880900000002 0000004e176c6a6e ffff881272cb39d0 ffff8809bfc90cc0
[ 334.871529] Call Trace:
[ 334.871781] [<ffffffff811d50ed>] unlink+0x75/0x77
[ 334.872040] [<ffffffff811d6ba5>] job_completion.isra.9+0xaa/0xd9
[ 334.872303] [<ffffffff811d7107>] cedf_schedule+0x533/0x750
[ 334.872567] [<ffffffff81026ca6>] pick_next_task_litmus+0x3c/0x4b3
[ 334.872831] [<ffffffff810079bd>] ? native_sched_clock+0x3c/0x69
[ 334.873093] [<ffffffff8102c412>] ? __mmdrop+0x47/0x4b
[ 334.873354] [<ffffffff81049490>] ? sched_clock_cpu+0x43/0xcf
[ 334.873618] [<ffffffff8141155d>] schedule+0x410/0x913
[ 334.873878] [<ffffffff81027dc8>] ? get_parent_ip+0xf/0x40
[ 334.874139] [<ffffffff8102c412>] ? __mmdrop+0x47/0x4b
[ 334.874398] [<ffffffff81027e77>] ? sub_preempt_count+0x7e/0xa7
[ 334.874661] [<ffffffff81413248>] ? _raw_spin_unlock_irq+0x11/0x2c
[ 334.874924] [<ffffffff81410ec6>] ? wait_for_common+0x136/0x148
[ 334.875186] [<ffffffff8102ab5e>] ? try_to_wake_up+0x36e/0x36e
[ 334.875448] [<ffffffff811c8dea>] complete_job+0x19/0x1f
[ 334.875709] [<ffffffff811c8efd>] sys_wait_for_ts_release+0xd1/0x132
[ 334.875974] [<ffffffff81413ebb>] system_call_fastpath+0x16/0x1b
[ 334.876234] Code: 8b 53 28 48 89 02 48 8b 50 28 48 8b 4b 28 48 89 48 28 48 89 53 28 48 89 c3 48 8b 00 48 85 c0 75 c7 48 8b 06 31 d2 eb 07 48 89 c2
[ 334.878092] 8b 40 08 48 39 d8 75 f4 48 85 d2 48 8b 43 08 74 06 48 89 42
[ 334.879362] RIP [<ffffffff811cc417>] bheap_delete+0x5c/0xbf
[ 334.879660] RSP <ffff881277e2bc18>
[ 334.879912] CR2: 0000000000000008
Looking at RIP address in GDB:
(gdb) l *0xffffffff811cc417
0xffffffff811cc417 is in bheap_delete (litmus/bheap.c:279).
274 * first find prev */
275 prev = NULL;
276 pos = heap->head;
277 while (pos != node) {
278 prev = pos;
279 pos = pos->next;
280 }
281 /* we have prev, now remove node */
282 if (prev)
283 prev->next = node->next;
(gdb)
The task is waking up from a synchronous release, so I don't know why sched_cedf.c::unlink()::is_queued() would evaluate to 'true'. The task should have been unlink()'ed when it blocked for the Linux completion in do_wait_for_release(), right? I haven't run into this bug with cluster sizes of six, but I haven't seen any code that breaks when we assume cluster sizes are one.
-Glenn
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.litmus-rt.org/pipermail/litmus-dev/attachments/20130128/03393cae/attachment.html>
More information about the litmus-dev
mailing list