[LITMUS^RT] bug in overhead tracing

Björn Brandenburg bbb at mpi-sws.org
Wed Apr 16 08:13:12 CEST 2014



> On 15.04.2014, at 01:23, Glenn Elliott <gelliott at cs.unc.edu> wrote:
> 
> 
>> On Apr 14, 2014, at 3:48 PM, Glenn Elliott <gelliott at cs.unc.edu> wrote:
>> 
>> Hi Everyone,
>> 
>> I’m afraid that I’ve found a pretty bad bug in Litmus when overhead tracing is enabled.  The bug is pretty easy to reproduce: with a kernel compiled for overhead tracing, just set up a real-time task that loops on sched_yield().  Note that sched_yield() is on the code path for exiting np-sections in liblitmus.
>> 
>> BUG: The TS_SYSCALL_IN_END macro inadvertently enables interrupts when it should not.
>> 
>> Code path:
>> 1) [user] sched_yield()
>> 2) sys_sched_yield(): Disable interrupt and acquire the run-queue lock. (https://github.com/LITMUS-RT/litmus-rt/blob/master/kernel/sched/core.c#L4447)
>> 3) sys_sched_yield() calls sched_class->yield_task();
>> 4) [for SCHED_LITMUS tasks] yield_task_litmus() calls TS_SYSCALL_IN_END. (https://github.com/LITMUS-RT/litmus-rt/blob/master/kernel/sched/litmus.c#L216)
>> 5) TS_SYS_CALL_IN_END re-enables interrupts unconditionally: https://github.com/LITMUS-RT/litmus-rt/blob/master/include/litmus/litmus.h#L314
>> 
>> Ouch!  My system was locking up because the tick interrupt was being handled while the rq lock (from step 2, above) was still held (the tick interrupt handler acquires the rq lock so it can update scheduling statistics).  Anyway, the CPU deadlocked on itself.
>> 
>> What’s the fix?  I see three options:
>> 1) We give up on instrumenting sched_yield.  We just delete TS_SYSCALL* from yield_task_litmus().
>> 2) We push the TS_SYSCALL* probes up into sys_sched_yield()
>> 3) We make TS_SYSCALL_IN_END interrupt-flag aware.  An easy fix is to avoid the disable/enable interrupt code if interrupts are already disabled (code branch).  Another fix is just to save/disable/restore the current interrupt flags.  However, is there a more elegant solution?  Do we need to do any irq-related accounting in TS_SYSCALL_IN_END if interrupts are already disabled?  This code was developed at MPI, so I defer to their expertise.
>> 
>> Regardless of what we decide, I would like to see these IRQ-tracing macros converted into inline functions with normal function-like names.  This bug was particularly difficult to diagnose (it took me four days) because there is logic hidden in TS_SYSCALL_IN_END.  I overlooked this macro in by debugging because TS_* macros are normally enabled/disabled by feather-trace at runtime.  However, TS_SYSCALL_IN_END is nothing like a normal TS_* overhead tracing macro.  It still does work even if I don’t trace SYSCALL overheads.
>> 
>> Anyway, I’d like to hear back from someone at MPI on a suggested fix.  I’d be happy to put the patch together.
>> 
>> Thanks,
>> Glenn
> 
> 
> 
> Hi All,
> 
> I pushed a simple fix for the bug above: https://github.com/LITMUS-RT/litmus-rt/commit/fe59630160d179703b4fc20131c9fbef8efcdf39
> 
> This resolves the deadlock I had been observing.

The patch looks good, thanks for finding and resolving the issue. As you pointed out, we might want to also move more of the functionality into the Feather-Trace callback.  If I get a ssh-friendly connection at RTAS I can look into it; otherwise it'll have to wait a couple of days. 

Thanks,
Björn 



More information about the litmus-dev mailing list