[LITMUS^RT] A task_exit()/litmus_schedule() race condition.
Glenn Elliott
gelliott at cs.unc.edu
Wed Apr 24 05:50:28 CEST 2013
I have a race condition in the scheduler and I'd like get some input:
CPU 0: sched_litmus.c::litmus_schedule: Task A is selected to run by litmus->schedule().
CPU 0: sched_litmus.c::litmus_schedule: Task A must migrate from CPU 2 to CPU 0, so it starts a song and dance with Linux rq locks.
CPU 0: sched_litmus.c::litmus_schedule: The rq lock for CPU 0 is dropped so that the double_rq_lock() can be requested.
CPU 1: A different thread forces Task A to exit Litmus by calling "sched_setscheduler_nocheck(Task A, SCHED_NORMAL)."
CPU 1: cedf_task_exit() is called on Task A. (NOTE: Task A did not make this call itself.)
CPU 1: Task A exits Litmus.
CPU 0: The double_rq_lock is acquired.
CPU 0: The state of Task A has changed unexpectedly. It is running, but it's no longer real-time. This triggers the "Bad: migration invariant FAILED: rt = 0, running = 1" message.
CPU 0: litmus_schedule() returns NULL instead of Task A.
This sequence doesn't look too bad, but something about it leaves the run queues in a corrupted state. The warning at kernel/sched_fair.c:1269 gets raised on the CPU Task A was migrating *away from* when CPU 2 picks its next task to schedule [sched_fair.c::hrtick_start_fair()::WARN_ON(task_rq(p) != rq);]. A few more warnings by CPU 2, before a NULL pointer dereference on CPU 0 cause a crash.
Can anyone with more experience with the Linux run queues offer more insights? Does litmus_schedule() need to do more to complete Task A's migration before returning NULL? Can you say what exactly is corrupted by a failed litmus_schedule() call?
Thanks,
Glenn
More information about the litmus-dev
mailing list