[LITMUS^RT] Hit crash in C-EDF (configured as partitioned)

Tue Jan 29 14:02:06 CET 2013

On Jan 29, 2013, at 6:24 AM, Björn Brandenburg <bbb at mpi-sws.org> wrote:

> 
> On Jan 29, 2013, at 12:17 PM, Björn Brandenburg <bbb at mpi-sws.org> wrote:
> 
>> On Jan 29, 2013, at 3:44 AM, Glenn Elliott <gelliott at cs.unc.edu> wrote:
>> 
>>> With the ECRTS crunch, I don't expect anyone to take a look at this, but I think I may have hit a bug with C-EDF when cluster size = 1 (partitioned).  It is a pretty strange crash.  Waking from sys_wait_for_ts_release() seems to lead to a crash in unlink().  The code is trying to remove the bheap node of the waiting task from the ready queue.  The check on is_queued() is successful, but the remove() call fails in a NULL pointer dereference.
>> 
>> I've hit the same bug.  It's triggered by arming the timer for a job twice. I have a fix, but it doesn't apply cleanly to the mainline plugin (I've ripped out the rt_domain code for my version). I'll see if I can quickly extract an experimental patch.
> 
> I pushed a patch with the gist of the fix to wip-cedf-timer-bug.
> 
> NB: completely untested, but seems to compile…
> 
> - Björn
> 
> commit 4b4959670aafe17685d484b65ab2dc683dfb4740
> Author: Bjoern Brandenburg <bbb at mpi-sws.org>
> Date:   Tue Jan 29 12:21:22 2013 +0100
> 
>    C-EDF: don't arm timers from requeue
> 
>    to avoid races / double arming of timers
> 
>    Conflicts:
> 
>        litmus/sched_cedf.c
> 
> diff --git a/litmus/sched_cedf.c b/litmus/sched_cedf.c
> index b45b46f..89996cd 100644
> --- a/litmus/sched_cedf.c
> +++ b/litmus/sched_cedf.c
> @@ -256,10 +256,8 @@ static noinline void requeue(struct task_struct* task)
> 
>        if (is_released(task, litmus_clock()))
>                __add_ready(&cluster->domain, task);
> -       else {
> -               /* it has got to wait */
> -               add_release(&cluster->domain, task);
> -       }
> +       else
> +               TRACE_TASK(task, "not requeueing not-yet-released job\n");
> }
> 
> #ifdef CONFIG_SCHED_CPU_AFFINITY
> @@ -343,6 +341,9 @@ static void cedf_release_jobs(rt_domain_t* rt, struct bheap* tasks)
> /* caller holds cedf_lock */
> static noinline void job_completion(struct task_struct *t, int forced)
> {
> +       cedf_domain_t *cluster = task_cpu_cluster(t);
> +       lt_t now;
> +
>        BUG_ON(!t);
> 
>        sched_trace_task_completion(t, forced);
> @@ -353,14 +354,20 @@ static noinline void job_completion(struct task_struct *t, int forced)
>        tsk_rt(t)->completed = 1;
>        /* prepare for next period */
>        prepare_for_next_period(t);
> -       if (is_released(t, litmus_clock()))
> +       now = litmus_clock();
> +       if (is_released(t, now))
>                sched_trace_task_release(t);
>        /* unlink */
>        unlink(t);
>        /* requeue
>         * But don't requeue a blocking task. */
> -       if (is_running(t))
> -               cedf_job_arrival(t);
> +       tsk_rt(t)->completed = 0;
> +       if (is_running(t)) {
> +               if (!is_released(t, now))
> +                       add_release(&cluster->domain, t);
> +               else
> +                       cedf_job_arrival(t);
> +       }
> }
> 
> /* cedf_tick - this function is called for every local timer

Thanks Björn!  I'll give it a try this morning.  Can you say why I was only hitting this bug in a C-EDF partitioned case?  Did it only make this race condition more likely?

-Glenn