[LITMUS^RT] A task_exit()/litmus_schedule() race condition.

Wed Apr 24 05:50:28 CEST 2013

I have a race condition in the scheduler and I'd like get some input:

CPU 0: sched_litmus.c::litmus_schedule: Task A is selected to run by litmus->schedule().
CPU 0: sched_litmus.c::litmus_schedule: Task A must migrate from CPU 2 to CPU 0, so it starts a song and dance with Linux rq locks.
CPU 0: sched_litmus.c::litmus_schedule: The rq lock for CPU 0 is dropped so that the double_rq_lock() can be requested.

CPU 1: A different thread forces Task A to exit Litmus by calling "sched_setscheduler_nocheck(Task A, SCHED_NORMAL)."
CPU 1: cedf_task_exit() is called on Task A.  (NOTE: Task A did not make this call itself.)
CPU 1: Task A exits Litmus.

CPU 0: The double_rq_lock is acquired.
CPU 0: The state of Task A has changed unexpectedly.  It is running, but it's no longer real-time.  This triggers the "Bad: migration invariant FAILED: rt = 0, running = 1" message.
CPU 0: litmus_schedule() returns NULL instead of Task A.

This sequence doesn't look too bad, but something about it leaves the run queues in a corrupted state.  The warning at kernel/sched_fair.c:1269 gets raised on the CPU Task A was migrating *away from* when CPU 2 picks its next task to schedule [sched_fair.c::hrtick_start_fair()::WARN_ON(task_rq(p) != rq);].  A few more warnings by CPU 2, before a NULL pointer dereference on CPU 0 cause a crash.

Can anyone with more experience with the Linux run queues offer more insights?  Does litmus_schedule() need to do more to complete Task A's migration before returning NULL?  Can you say what exactly is corrupted by a failed litmus_schedule() call?

Thanks,
Glenn