[LITMUS^RT] A task_exit()/litmus_schedule() race condition.
Björn Brandenburg
bbb at mpi-sws.org
Wed Apr 24 07:53:51 CEST 2013
On Apr 24, 2013, at 5:50 AM, Glenn Elliott <gelliott at cs.unc.edu> wrote:
> CPU 0: sched_litmus.c::litmus_schedule: Task A is selected to run by litmus->schedule().
> CPU 0: sched_litmus.c::litmus_schedule: Task A must migrate from CPU 2 to CPU 0, so it starts a song and dance with Linux rq locks.
> CPU 0: sched_litmus.c::litmus_schedule: The rq lock for CPU 0 is dropped so that the double_rq_lock() can be requested.
>
> CPU 1: A different thread forces Task A to exit Litmus by calling "sched_setscheduler_nocheck(Task A, SCHED_NORMAL)."
> CPU 1: cedf_task_exit() is called on Task A. (NOTE: Task A did not make this call itself.)
This is unfortunately yet another untested cased. In the past, we didn't force tasks out of the real-time class, so this case wasn't fully fleshed out.
> CPU 1: Task A exits Litmus.
>
> CPU 0: The double_rq_lock is acquired.
> CPU 0: The state of Task A has changed unexpectedly. It is running, but it's no longer real-time. This triggers the "Bad: migration invariant FAILED: rt = 0, running = 1" message.
This warning exists to indicate that the migration code doesn't know how to handle this properly. You'll need to patch up the task state to deal with the unexpected exit.
> CPU 0: litmus_schedule() returns NULL instead of Task A.
>
> This sequence doesn't look too bad, but something about it leaves the run queues in a corrupted state. The warning at kernel/sched_fair.c:1269 gets raised on the CPU Task A was migrating *away from* when CPU 2 picks its next task to schedule [sched_fair.c::hrtick_start_fair()::WARN_ON(task_rq(p) != rq);].
This invariant is central to the Linux scheduler. The trace afterwards is likely contains bogus warnings and errors resulting from this inconsistency.
> A few more warnings by CPU 2, before a NULL pointer dereference on CPU 0 cause a crash.
>
> Can anyone with more experience with the Linux run queues offer more insights? Does litmus_schedule() need to do more to complete Task A's migration before returning NULL? Can you say what exactly is corrupted by a failed litmus_schedule() call?
litmus_schedule() is not the problem. The problem is in the abandoned migration attempt. Try moving the bailout code before the line that records the migration in Linux's data structure (in litmus/sched_litmus.c):
129: set_task_cpu(next, smp_processor_id());
This should prevent the run queue invariant from becoming invalid.
- Björn
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.litmus-rt.org/pipermail/litmus-dev/attachments/20130424/9a4d961d/attachment.html>
More information about the litmus-dev
mailing list