[LITMUS^RT] Hit crash in C-EDF (configured as partitioned)

Tue Jan 29 03:44:54 CET 2013

With the ECRTS crunch, I don't expect anyone to take a look at this, but I think I may have hit a bug with C-EDF when cluster size = 1 (partitioned).  It is a pretty strange crash.  Waking from sys_wait_for_ts_release() seems to lead to a crash in unlink().  The code is trying to remove the bheap node of the waiting task from the ready queue.  The check on is_queued() is successful, but the remove() call fails in a NULL pointer dereference.

Here's a stack trace:
[  334.861983] BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
[  334.862740] IP: [<ffffffff811cc417>] bheap_delete+0x5c/0xbf
[  334.863040] PGD 98d291067 PUD 9979b8067 PMD 0 
[  334.863408] Oops: 0000 [#1] PREEMPT SMP 
[  334.863771] CPU 2 
[  334.863814] Modules linked in: nvidia(P) kvm parport_pc parport
[  334.864499] 
[  334.864747] Pid: 5263, comm: klt_tracker Tainted: P        W   3.0.0-litmus-2012.3-gpu-dbg #104 Tyan FT72-B7015/S7015
[  334.865337] RIP: 0010:[<ffffffff811cc417>]  [<ffffffff811cc417>] bheap_delete+0x5c/0xbf
[  334.865847] RSP: 0018:ffff881277e2bc18  EFLAGS: 00010007
[  334.866108] RAX: 0000000000000000 RBX: ffff880997ab6bd0 RCX: ffff8809bfc80000
[  334.866372] RDX: 0000000000000000 RSI: ffff88098b6d10f8 RDI: ffffffff811c9bce
[  334.866637] RBP: ffff881277e2bc28 R08: ffffffff81820730 R09: 000000000000006a
[  334.866902] R10: 0000000000000000 R11: ffff8809bfc90d20 R12: ffff881272cb39d0
[  334.867166] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000001
[  334.867431] FS:  00007f3fce9a0980(0000) GS:ffff8809bfc80000(0000) knlGS:0000000000000000
[  334.867903] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  334.868163] CR2: 0000000000000008 CR3: 0000000996f2e000 CR4: 00000000000006e0
[  334.868427] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  334.868692] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[  334.868957] Process klt_tracker (pid: 5263, threadinfo ffff881277e2a000, task ffff881272cb39d0)
[  334.869429] Stack:
[  334.869677]  ffffffff81820730 ffff881272cb39d0 ffff881277e2bc38 ffffffff811d50ed
[  334.870294]  ffff881277e2bc78 ffffffff811d6ba5 ffff881272cb3d00 000000000000148f
[  334.870911]  ffff880900000002 0000004e176c6a6e ffff881272cb39d0 ffff8809bfc90cc0
[  334.871529] Call Trace:
[  334.871781]  [<ffffffff811d50ed>] unlink+0x75/0x77
[  334.872040]  [<ffffffff811d6ba5>] job_completion.isra.9+0xaa/0xd9
[  334.872303]  [<ffffffff811d7107>] cedf_schedule+0x533/0x750
[  334.872567]  [<ffffffff81026ca6>] pick_next_task_litmus+0x3c/0x4b3
[  334.872831]  [<ffffffff810079bd>] ? native_sched_clock+0x3c/0x69
[  334.873093]  [<ffffffff8102c412>] ? __mmdrop+0x47/0x4b
[  334.873354]  [<ffffffff81049490>] ? sched_clock_cpu+0x43/0xcf
[  334.873618]  [<ffffffff8141155d>] schedule+0x410/0x913
[  334.873878]  [<ffffffff81027dc8>] ? get_parent_ip+0xf/0x40
[  334.874139]  [<ffffffff8102c412>] ? __mmdrop+0x47/0x4b
[  334.874398]  [<ffffffff81027e77>] ? sub_preempt_count+0x7e/0xa7
[  334.874661]  [<ffffffff81413248>] ? _raw_spin_unlock_irq+0x11/0x2c
[  334.874924]  [<ffffffff81410ec6>] ? wait_for_common+0x136/0x148
[  334.875186]  [<ffffffff8102ab5e>] ? try_to_wake_up+0x36e/0x36e
[  334.875448]  [<ffffffff811c8dea>] complete_job+0x19/0x1f
[  334.875709]  [<ffffffff811c8efd>] sys_wait_for_ts_release+0xd1/0x132
[  334.875974]  [<ffffffff81413ebb>] system_call_fastpath+0x16/0x1b
[  334.876234] Code: 8b 53 28 48 89 02 48 8b 50 28 48 8b 4b 28 48 89 48 28 48 89 53 28 48 89 c3 48 8b 00 48 85 c0 75 c7 48 8b 06 31 d2 eb 07 48 89 c2 
[  334.878092]  8b 40 08 48 39 d8 75 f4 48 85 d2 48 8b 43 08 74 06 48 89 42 
[  334.879362] RIP  [<ffffffff811cc417>] bheap_delete+0x5c/0xbf
[  334.879660]  RSP <ffff881277e2bc18>
[  334.879912] CR2: 0000000000000008

Looking at RIP address in GDB:
(gdb) l *0xffffffff811cc417
0xffffffff811cc417 is in bheap_delete (litmus/bheap.c:279).
274			 * first find prev */
275			prev = NULL;
276			pos  = heap->head;
277			while (pos != node) {
278				prev = pos;
279				pos  = pos->next;
280			}
281			/* we have prev, now remove node */
282			if (prev)
283				prev->next = node->next;
(gdb) 

The task is waking up from a synchronous release, so I don't know why sched_cedf.c::unlink()::is_queued() would evaluate to 'true'.  The task should have been unlink()'ed when it blocked for the Linux completion in do_wait_for_release(), right?  I haven't run into this bug with cluster sizes of six, but I haven't seen any code that breaks when we assume cluster sizes are one.

-Glenn
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.litmus-rt.org/pipermail/litmus-dev/attachments/20130128/03393cae/attachment.html>