<html><body style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space; ">With the ECRTS crunch, I don't expect anyone to take a look at this, but I think I may have hit a bug with C-EDF when cluster size = 1 (partitioned). It is a pretty strange crash. Waking from sys_wait_for_ts_release() seems to lead to a crash in unlink(). The code is trying to remove the bheap node of the waiting task from the ready queue. The check on is_queued() is successful, but the remove() call fails in a NULL pointer dereference.<div><br></div><div>Here's a stack trace:</div><div><div>[ 334.861983] BUG: unable to handle kernel NULL pointer dereference at 0000000000000008</div><div>[ 334.862740] IP: [<ffffffff811cc417>] bheap_delete+0x5c/0xbf</div><div>[ 334.863040] PGD 98d291067 PUD 9979b8067 PMD 0 </div><div>[ 334.863408] Oops: 0000 [#1] PREEMPT SMP </div><div>[ 334.863771] CPU 2 </div><div>[ 334.863814] Modules linked in: nvidia(P) kvm parport_pc parport</div><div>[ 334.864499] </div><div>[ 334.864747] Pid: 5263, comm: klt_tracker Tainted: P W 3.0.0-litmus-2012.3-gpu-dbg #104 Tyan FT72-B7015/S7015</div><div>[ 334.865337] RIP: 0010:[<<b>ffffffff811cc417</b>>] [<ffffffff811cc417>] <b>bheap_delete</b>+0x5c/0xbf</div><div>[ 334.865847] RSP: 0018:ffff881277e2bc18 EFLAGS: 00010007</div><div>[ 334.866108] RAX: 0000000000000000 RBX: ffff880997ab6bd0 RCX: ffff8809bfc80000</div><div>[ 334.866372] RDX: 0000000000000000 RSI: ffff88098b6d10f8 RDI: ffffffff811c9bce</div><div>[ 334.866637] RBP: ffff881277e2bc28 R08: ffffffff81820730 R09: 000000000000006a</div><div>[ 334.866902] R10: 0000000000000000 R11: ffff8809bfc90d20 R12: ffff881272cb39d0</div><div>[ 334.867166] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000001</div><div>[ 334.867431] FS: 00007f3fce9a0980(0000) GS:ffff8809bfc80000(0000) knlGS:0000000000000000</div><div>[ 334.867903] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033</div><div>[ 334.868163] CR2: 0000000000000008 CR3: 0000000996f2e000 CR4: 00000000000006e0</div><div>[ 334.868427] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000</div><div>[ 334.868692] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400</div><div>[ 334.868957] Process klt_tracker (pid: 5263, threadinfo ffff881277e2a000, task ffff881272cb39d0)</div><div>[ 334.869429] Stack:</div><div>[ 334.869677] ffffffff81820730 ffff881272cb39d0 ffff881277e2bc38 ffffffff811d50ed</div><div>[ 334.870294] ffff881277e2bc78 ffffffff811d6ba5 ffff881272cb3d00 000000000000148f</div><div>[ 334.870911] ffff880900000002 0000004e176c6a6e ffff881272cb39d0 ffff8809bfc90cc0</div><div>[ 334.871529] Call Trace:</div><div>[ 334.871781] [<ffffffff811d50ed>] unlink+0x75/0x77</div><div>[ 334.872040] [<ffffffff811d6ba5>] job_completion.isra.9+0xaa/0xd9</div><div>[ 334.872303] [<ffffffff811d7107>] cedf_schedule+0x533/0x750</div><div>[ 334.872567] [<ffffffff81026ca6>] pick_next_task_litmus+0x3c/0x4b3</div><div>[ 334.872831] [<ffffffff810079bd>] ? native_sched_clock+0x3c/0x69</div><div>[ 334.873093] [<ffffffff8102c412>] ? __mmdrop+0x47/0x4b</div><div>[ 334.873354] [<ffffffff81049490>] ? sched_clock_cpu+0x43/0xcf</div><div>[ 334.873618] [<ffffffff8141155d>] schedule+0x410/0x913</div><div>[ 334.873878] [<ffffffff81027dc8>] ? get_parent_ip+0xf/0x40</div><div>[ 334.874139] [<ffffffff8102c412>] ? __mmdrop+0x47/0x4b</div><div>[ 334.874398] [<ffffffff81027e77>] ? sub_preempt_count+0x7e/0xa7</div><div>[ 334.874661] [<ffffffff81413248>] ? _raw_spin_unlock_irq+0x11/0x2c</div><div>[ 334.874924] [<ffffffff81410ec6>] ? wait_for_common+0x136/0x148</div><div>[ 334.875186] [<ffffffff8102ab5e>] ? try_to_wake_up+0x36e/0x36e</div><div>[ 334.875448] [<ffffffff811c8dea>] complete_job+0x19/0x1f</div><div>[ 334.875709] [<ffffffff811c8efd>] sys_wait_for_ts_release+0xd1/0x132</div><div>[ 334.875974] [<ffffffff81413ebb>] system_call_fastpath+0x16/0x1b</div><div>[ 334.876234] Code: 8b 53 28 48 89 02 48 8b 50 28 48 8b 4b 28 48 89 48 28 48 89 53 28 48 89 c3 48 8b 00 48 85 c0 75 c7 48 8b 06 31 d2 eb 07 48 89 c2 </div><div>[ 334.878092] 8b 40 08 48 39 d8 75 f4 48 85 d2 48 8b 43 08 74 06 48 89 42 </div><div>[ 334.879362] RIP [<ffffffff811cc417>] bheap_delete+0x5c/0xbf</div><div>[ 334.879660] RSP <ffff881277e2bc18></div><div>[ 334.879912] CR2: 0000000000000008</div></div><div><br></div><div>Looking at RIP address in GDB:</div><div><div>(gdb) l *<b>0xffffffff811cc417</b></div><div>0xffffffff811cc417 is in bheap_delete (litmus/bheap.c:279).</div><div>274<span class="Apple-tab-span" style="white-space:pre"> </span> * first find prev */</div><div>275<span class="Apple-tab-span" style="white-space:pre"> </span>prev = NULL;</div><div>276<span class="Apple-tab-span" style="white-space:pre"> </span>pos = heap->head;</div><div>277<span class="Apple-tab-span" style="white-space:pre"> </span>while (pos != node) {</div><div>278<span class="Apple-tab-span" style="white-space:pre"> </span>prev = pos;</div><div>279<span class="Apple-tab-span" style="white-space:pre"> </span>pos = pos->next;</div><div>280<span class="Apple-tab-span" style="white-space:pre"> </span>}</div><div>281<span class="Apple-tab-span" style="white-space:pre"> </span>/* we have prev, now remove node */</div><div>282<span class="Apple-tab-span" style="white-space:pre"> </span>if (prev)</div><div>283<span class="Apple-tab-span" style="white-space:pre"> </span>prev->next = node->next;</div><div>(gdb) </div><div><br></div></div><div>The task is waking up from a synchronous release, so I don't know why sched_cedf.c::unlink()::is_queued() would evaluate to 'true'. The task should have been unlink()'ed when it blocked for the Linux completion in do_wait_for_release(), right? I haven't run into this bug with cluster sizes of six, but I haven't seen any code that breaks when we assume cluster sizes are one.</div><div><br></div><div>-Glenn</div></body></html>