[LITMUS^RT] Hit crash in C-EDF (configured as partitioned)
Glenn Elliott
gelliott at cs.unc.edu
Tue Jan 29 14:02:06 CET 2013
On Jan 29, 2013, at 6:24 AM, Björn Brandenburg <bbb at mpi-sws.org> wrote:
>
> On Jan 29, 2013, at 12:17 PM, Björn Brandenburg <bbb at mpi-sws.org> wrote:
>
>> On Jan 29, 2013, at 3:44 AM, Glenn Elliott <gelliott at cs.unc.edu> wrote:
>>
>>> With the ECRTS crunch, I don't expect anyone to take a look at this, but I think I may have hit a bug with C-EDF when cluster size = 1 (partitioned). It is a pretty strange crash. Waking from sys_wait_for_ts_release() seems to lead to a crash in unlink(). The code is trying to remove the bheap node of the waiting task from the ready queue. The check on is_queued() is successful, but the remove() call fails in a NULL pointer dereference.
>>
>> I've hit the same bug. It's triggered by arming the timer for a job twice. I have a fix, but it doesn't apply cleanly to the mainline plugin (I've ripped out the rt_domain code for my version). I'll see if I can quickly extract an experimental patch.
>
> I pushed a patch with the gist of the fix to wip-cedf-timer-bug.
>
> NB: completely untested, but seems to compile…
>
> - Björn
>
> commit 4b4959670aafe17685d484b65ab2dc683dfb4740
> Author: Bjoern Brandenburg <bbb at mpi-sws.org>
> Date: Tue Jan 29 12:21:22 2013 +0100
>
> C-EDF: don't arm timers from requeue
>
> to avoid races / double arming of timers
>
> Conflicts:
>
> litmus/sched_cedf.c
>
> diff --git a/litmus/sched_cedf.c b/litmus/sched_cedf.c
> index b45b46f..89996cd 100644
> --- a/litmus/sched_cedf.c
> +++ b/litmus/sched_cedf.c
> @@ -256,10 +256,8 @@ static noinline void requeue(struct task_struct* task)
>
> if (is_released(task, litmus_clock()))
> __add_ready(&cluster->domain, task);
> - else {
> - /* it has got to wait */
> - add_release(&cluster->domain, task);
> - }
> + else
> + TRACE_TASK(task, "not requeueing not-yet-released job\n");
> }
>
> #ifdef CONFIG_SCHED_CPU_AFFINITY
> @@ -343,6 +341,9 @@ static void cedf_release_jobs(rt_domain_t* rt, struct bheap* tasks)
> /* caller holds cedf_lock */
> static noinline void job_completion(struct task_struct *t, int forced)
> {
> + cedf_domain_t *cluster = task_cpu_cluster(t);
> + lt_t now;
> +
> BUG_ON(!t);
>
> sched_trace_task_completion(t, forced);
> @@ -353,14 +354,20 @@ static noinline void job_completion(struct task_struct *t, int forced)
> tsk_rt(t)->completed = 1;
> /* prepare for next period */
> prepare_for_next_period(t);
> - if (is_released(t, litmus_clock()))
> + now = litmus_clock();
> + if (is_released(t, now))
> sched_trace_task_release(t);
> /* unlink */
> unlink(t);
> /* requeue
> * But don't requeue a blocking task. */
> - if (is_running(t))
> - cedf_job_arrival(t);
> + tsk_rt(t)->completed = 0;
> + if (is_running(t)) {
> + if (!is_released(t, now))
> + add_release(&cluster->domain, t);
> + else
> + cedf_job_arrival(t);
> + }
> }
>
> /* cedf_tick - this function is called for every local timer
Thanks Björn! I'll give it a try this morning. Can you say why I was only hitting this bug in a C-EDF partitioned case? Did it only make this race condition more likely?
-Glenn
More information about the litmus-dev
mailing list