[LITMUS^RT] litmus-dev Digest, Vol 79, Issue 2

Fri Dec 21 12:58:06 CET 2018

On 21. Dec 2018, at 03:24, Ricardo Teixeira <ricardo.btxr at gmail.com> wrote:
> 
> I am implementing a protocol based on MrsP, which involves the migration of tasks, but the protocol does not contemplate the suspension of tasks. I have not tried to reproduce this for the stock plugins. The plugin I'm using was implemented by a colleague, it also involves the migration of tasks.

Sorry, not much I can say in that case. Migrations are tricky. There are many ways in which a race condition might arise such that LITMUS^RT’s migration sanity checker is triggered (which is the code you pointed to). 

Basically, the code that triggered observes that the ‘next’ task, which was picked by the plugin to be dispatched next, and which needs to be pulled from another core’s runqueue, somehow changed state while the migration path dropped the remote core’s runqueue lock. Maybe it self-suspended due to blocking I/O, maybe it received a signal, maybe it hit a page fault, maybe it tried calling a library function not yet resolved by the dynamic linker thus triggering a page fault, maybe something else? 

In all likelihood, this is a fatal bug in the scheduler or locking protocol, and you should just fix the bug. However, for edge cases, LITMUS^RT provides the next_became_invalid() callback in the plugin API, which allows you to do something more clever than just dropping all references to the task (which causes it to become an unkillable “zombie” task). That said, I would strongly urge you to not just work around the problem using next_became_invalid() unless you fully understand the root cause and realize that it’s a fundamental limitation and not just a simple bug. 

My approach to debugging a problem like this would be to simply “carpet-bomb” the scheduler plugin and locking protocol implementations with TRACE_TASK() and stare at debug traces until I understand exactly what sequence of events leads to the sanity checker triggering. Of course, with a sufficient number of TRACE_TASK() instances, you might just change the timing enough to hide the race, which would make it quite a bit harder to debug. Good luck. :-)

Generally, protocols like the MrsP are very difficult to implement correctly precisely because it’s tricky to get the migrations right. This also applies to varying degrees to the MBWI, OMIP, and MC-IPC protocols. Just my 2 cents… 

- Björn

-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 5041 bytes
Desc: not available
URL: <http://lists.litmus-rt.org/pipermail/litmus-dev/attachments/20181221/b76ba27a/attachment.bin>