[LITMUS^RT] Hit crash in C-EDF (configured as partitioned)

Tue Jan 29 12:24:37 CET 2013

On Jan 29, 2013, at 12:17 PM, Björn Brandenburg <bbb at mpi-sws.org> wrote:

> On Jan 29, 2013, at 3:44 AM, Glenn Elliott <gelliott at cs.unc.edu> wrote:
> 
>> With the ECRTS crunch, I don't expect anyone to take a look at this, but I think I may have hit a bug with C-EDF when cluster size = 1 (partitioned).  It is a pretty strange crash.  Waking from sys_wait_for_ts_release() seems to lead to a crash in unlink().  The code is trying to remove the bheap node of the waiting task from the ready queue.  The check on is_queued() is successful, but the remove() call fails in a NULL pointer dereference.
> 
> I've hit the same bug.  It's triggered by arming the timer for a job twice. I have a fix, but it doesn't apply cleanly to the mainline plugin (I've ripped out the rt_domain code for my version). I'll see if I can quickly extract an experimental patch.

I pushed a patch with the gist of the fix to wip-cedf-timer-bug.

NB: completely untested, but seems to compile…

- Björn

commit 4b4959670aafe17685d484b65ab2dc683dfb4740
Author: Bjoern Brandenburg <bbb at mpi-sws.org>
Date:   Tue Jan 29 12:21:22 2013 +0100

    C-EDF: don't arm timers from requeue
    
    to avoid races / double arming of timers
    
    Conflicts:
    
        litmus/sched_cedf.c

diff --git a/litmus/sched_cedf.c b/litmus/sched_cedf.c
index b45b46f..89996cd 100644
--- a/litmus/sched_cedf.c
+++ b/litmus/sched_cedf.c
@@ -256,10 +256,8 @@ static noinline void requeue(struct task_struct* task)
 
        if (is_released(task, litmus_clock()))
                __add_ready(&cluster->domain, task);
-       else {
-               /* it has got to wait */
-               add_release(&cluster->domain, task);
-       }
+       else
+               TRACE_TASK(task, "not requeueing not-yet-released job\n");
 }
 
 #ifdef CONFIG_SCHED_CPU_AFFINITY
@@ -343,6 +341,9 @@ static void cedf_release_jobs(rt_domain_t* rt, struct bheap* tasks)
 /* caller holds cedf_lock */
 static noinline void job_completion(struct task_struct *t, int forced)
 {
+       cedf_domain_t *cluster = task_cpu_cluster(t);
+       lt_t now;
+
        BUG_ON(!t);
 
        sched_trace_task_completion(t, forced);
@@ -353,14 +354,20 @@ static noinline void job_completion(struct task_struct *t, int forced)
        tsk_rt(t)->completed = 1;
        /* prepare for next period */
        prepare_for_next_period(t);
-       if (is_released(t, litmus_clock()))
+       now = litmus_clock();
+       if (is_released(t, now))
                sched_trace_task_release(t);
        /* unlink */
        unlink(t);
        /* requeue
         * But don't requeue a blocking task. */
-       if (is_running(t))
-               cedf_job_arrival(t);
+       tsk_rt(t)->completed = 0;
+       if (is_running(t)) {
+               if (!is_released(t, now))
+                       add_release(&cluster->domain, t);
+               else
+                       cedf_job_arrival(t);
+       }
 }
 
 /* cedf_tick - this function is called for every local timer