[LITMUS^RT] SRP -> PCP

Tue May 15 16:49:26 CEST 2012

On Apr 26, 2012, at 2:58 AM, Björn Brandenburg wrote:

> 
> On Apr 25, 2012, at 9:04 PM, Björn Brandenburg wrote:
> 
>> On Apr 25, 2012, at 8:37 PM, Glenn Elliott <gelliott at cs.unc.edu> wrote:
>> 
>>> I don't believe SRP can support suspensions, since it could violate stack constraints.  To be honest, I haven't throughly reviewed the Litmus SRP implementation, so maybe this really isn't an issue and I can just use SRP as-is.
>>> 
>> 
>> Processes is Linux don't use the same stack, the constraint is irrelevant.  The important thing is that a resuming process must check the system ceiling since it may have been raised while the process was suspended. The current implementation *should* do that.
> 
> It was late last night (and there was a great soccer match on TV ;-) ); let me elaborate a bit. When a resource-holding job suspends, the system ceiling remains unchanged (i.e.,  elevated). Therefore, other jobs with priority equal or lower to the system ceiling will not be eligible to execute. Mutual exclusion is still ensured. LITMUS^RT *should* enforce this in the current implementation, but I don't think it has seen much testing recently. There's a call to srp_ceiling_block() at the end of schedule(); it *should* prevent lower-priority jobs from being scheduled while the system ceiling is elevated (again, this hasn't been tested in ages).
> 
> When a job suspends for any reason, the system ceiling may be raised while it is absent. In this case, it is not eligible to execute when it resumes until the ceiling is lowered, just as if it were a new release.  A job that suspends N times can thus incur pi-blocking due to the SRP for the duration of (N+1) critical sections (with included suspension lengths).
> 
> Note that higher-priority jobs that suspend during a critical section still prevent lower-priority jobs from executing. Under s-oblivious analysis, this is implicitly taken care of, but under s-aware analysis, you probably have to account for this explicitly. I'm not aware of any published analysis that takes self-suspensions in combination with the SRP into account.
> 
> Anyway, under the SRP, maximum pi-blocking incurred by a job that does not require shared resources is bounded by the maximum critical section length (including any suspensions). Under the PCP, the maximum pi-blocking incurred by independent jobs due to shared resources is bounded by the maximum CPU demand of any resource-holding job (i.e., suspensions of lower-priority resource-holding jobs do not cause pi-blocking). If you have both jobs that do not share *any* resources and jobs that suspend for (relatively) long times while holding shared resources, then you may be better served by the PCP. Note that this difference disappears for jobs that request *any* shared resource at all due to ceiling blocking.
> 
> I hope this helps a bit. Let me know if you have questions. A good way to get check out the current SRP implementation would be  to add a test case to liblitmus that checks if suspensions are handled correctly.
> 
> - Björn

Following up on our prior SRP discussions, I am currently trying to use PSN-EDF + SRP + self-suspensions and the system seems to be deadlocking on me.  I'm still trying to diagnose the root cause, but I do see the following warning printed out to the syslog:

[ 1104.358092] WARNING: at litmus/edf_common.c:42 __edf_higher_prio+0x7b/0x153()
[ 1104.358095] Hardware name: FT72-B7015
[ 1104.358096] Modules linked in: kvm_intel kvm nvidia(P)
[ 1104.358106] Pid: 4159, comm: klt_tracker Tainted: P        W   3.0.0-litmus-rtss12-dgl-dbg-srp #292
[ 1104.358109] Call Trace:
[ 1104.358115]  [<ffffffff81032ce3>] warn_slowpath_common+0x80/0x98
[ 1104.358119]  [<ffffffff81032d10>] warn_slowpath_null+0x15/0x17
[ 1104.358124]  [<ffffffff811d8e13>] __edf_higher_prio+0x7b/0x153
[ 1104.358128]  [<ffffffff811d8f69>] edf_higher_prio+0x16/0x18
[ 1104.358132]  [<ffffffff811d8fd0>] edf_preemption_needed+0x65/0x83
[ 1104.358137]  [<ffffffff811dded1>] psnedf_check_resched+0x18/0x3b
[ 1104.358142]  [<ffffffff811d8c64>] __add_ready+0x9a/0xa3
[ 1104.358146]  [<ffffffff811ddc7b>] requeue+0x76/0xaa
[ 1104.358150]  [<ffffffff811de181>] psnedf_schedule+0x28d/0x31a
[ 1104.358154]  [<ffffffff8102b9dd>] pick_next_task_litmus+0x3e/0x3de
[ 1104.358159]  [<ffffffff8104f184>] ? sched_clock_cpu+0x45/0xd8
[ 1104.358165]  [<ffffffff8141f43f>] schedule+0x491/0x99d
[ 1104.358172]  [<ffffffff81024ee8>] ? __wake_up_common+0x49/0x7f
[ 1104.358176]  [<ffffffff8142121f>] ? _raw_spin_unlock_irqrestore+0x12/0x2d
[ 1104.358181]  [<ffffffff8102b33f>] ? get_parent_ip+0x11/0x41
[ 1104.358185]  [<ffffffff8102c29a>] ? sub_preempt_count+0x92/0xa5
[ 1104.358189]  [<ffffffff811da88a>] ? unlock_srp_semaphore+0x11a/0x141
[ 1104.358193]  [<ffffffff811d8645>] complete_job+0x19/0x1d
[ 1104.358198]  [<ffffffff811d71be>] sys_complete_job+0x2a/0x35
[ 1104.358202]  [<ffffffff81421dbb>] system_call_fastpath+0x16/0x1b

This gets printed whenever edf_high_prio() is called to compare the priority of a task against itself.  From what I can tell, a task is being added to the ready queue and the task returned by __next_ready() is already scheduled (see edf_common.c::edf_preemption_needed()).  This worries me a little bit---is it okay for a task to appear in the ready queue when it is actively scheduled?  Given that psnedf_schedule() calls "requeue(pedf->scheduled, edf)" (i.e., it requeues the scheduled task) I suppose it is safe?

My kernel varies from mainline Litmus a little bit in that I refactored edf_higher_prio() to give me lower level control over comparisons (I need to compare the base priority of one task against the effective priority of another) (hence the appearance of "__edf_higher_prio" in the above trace).  Anyway, those changes shouldn't matter on this code path.  I haven't modified PSN-EDF in any way and edf_higher_prio() is working exactly like the original implementation.

Thanks,
Glenn