[LITMUS^RT] preparing a new release

Thu Nov 29 18:42:15 CET 2012

On Nov 29, 2012, at 5:52 PM, Jonathan Herman <hermanjl at cs.unc.edu> wrote:

> What is the following line defending against:
> 
> litmus/preempt.c:
>   31│                 /* Litmus tasks should never be subject to a remote                                                                                                                                                                      
>   32│                  * set_tsk_need_resched(). */
>   33│                 BUG_ON(is_realtime(tsk));                                                                                                                                                                                              
>   34│                 //TRACE_TASK(tsk, "SUPERBAD"); /* I added this */

It defends against misuse of set_tsk_need_resched(). You can't safely use set_tsk_need_resched() for non-local tasks without acquiring the task's corresponding runqueue lock. 

> 
> I keep hitting this when I test with a full schedule under GSN-EDF. Oddly, when I debug using gdb and view t->comm it is "rtspin", but this is not the case in the trace log. If I remove the BUG_ON so that TRACE_TASK is hit, I git the following lines:
> 158079 P0 [sched_state_will_schedule at litmus/preempt.c:34]: (kworker/0:0/0:0) SUPERBAD
> 158080 P0 [sched_state_will_schedule at litmus/preempt.c:37]: (kworker/0:0/0:0) set_tsk_need_resched() ret:ffffffff810268d4
> 

What is the symbolic name of ret:ffffffff810268d4? You might want to look at __builtin_return_address(1) instead. Do you have a backtrace?

> Which, as far as I can tell, is only possible if tsk->comm == "kworker/0:0". But then why is there no pid? This is on the current staging of liblitmus and litmus-rt.

Is it reproducible? Can you bisect the recent patches to see where it crept in?

> 
> 
> Ideas? I'm assuming race condition, as Glenn suggested, because usually if it's insane, its a race.

Looks indeed quite strange. What happens just before this bug? A context switch? A migration? Nonsensical races / panics can also be an indicator of stack corruption.

Thanks,
Björn