[LITMUS^RT] A task_exit()/litmus_schedule() race condition.

Wed Apr 24 07:53:51 CEST 2013

On Apr 24, 2013, at 5:50 AM, Glenn Elliott <gelliott at cs.unc.edu> wrote:

> CPU 0: sched_litmus.c::litmus_schedule: Task A is selected to run by litmus->schedule().
> CPU 0: sched_litmus.c::litmus_schedule: Task A must migrate from CPU 2 to CPU 0, so it starts a song and dance with Linux rq locks.
> CPU 0: sched_litmus.c::litmus_schedule: The rq lock for CPU 0 is dropped so that the double_rq_lock() can be requested.
> 
> CPU 1: A different thread forces Task A to exit Litmus by calling "sched_setscheduler_nocheck(Task A, SCHED_NORMAL)."
> CPU 1: cedf_task_exit() is called on Task A.  (NOTE: Task A did not make this call itself.)

This is unfortunately yet another untested cased. In the past, we didn't force tasks out of the real-time class, so this case wasn't fully fleshed out.

> CPU 1: Task A exits Litmus.
> 
> CPU 0: The double_rq_lock is acquired.
> CPU 0: The state of Task A has changed unexpectedly.  It is running, but it's no longer real-time.  This triggers the "Bad: migration invariant FAILED: rt = 0, running = 1" message.

This warning exists to indicate that the migration code doesn't know how to handle this properly. You'll need to patch up the task state to deal with the unexpected exit.

> CPU 0: litmus_schedule() returns NULL instead of Task A.
> 
> This sequence doesn't look too bad, but something about it leaves the run queues in a corrupted state.  The warning at kernel/sched_fair.c:1269 gets raised on the CPU Task A was migrating *away from* when CPU 2 picks its next task to schedule [sched_fair.c::hrtick_start_fair()::WARN_ON(task_rq(p) != rq);].  

This invariant is central to the Linux scheduler. The trace afterwards is likely contains bogus warnings and errors resulting from this inconsistency.

> A few more warnings by CPU 2, before a NULL pointer dereference on CPU 0 cause a crash.
> 
> Can anyone with more experience with the Linux run queues offer more insights?  Does litmus_schedule() need to do more to complete Task A's migration before returning NULL?  Can you say what exactly is corrupted by a failed litmus_schedule() call?

litmus_schedule() is not the problem. The problem is in the abandoned migration attempt. Try moving the bailout code before the line that records the migration in Linux's data structure (in litmus/sched_litmus.c):

129:	set_task_cpu(next, smp_processor_id());

This should prevent the run queue invariant from becoming invalid.

- Björn

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.litmus-rt.org/pipermail/litmus-dev/attachments/20130424/9a4d961d/attachment.html>