[LITMUS^RT] Help with Understanding Scheduling Results
Björn Brandenburg
bbb at mpi-sws.org
Fri Mar 13 04:39:09 CET 2015
> On 12 Mar 2015, at 23:06, Geoffrey Tran <gtran at isi.edu> wrote:
>
> I have been working with LITMUS-RT, but have come across some weird results
> and was hoping someone could help clarify it please.
Hi Geoffrey,
thanks for your interest in LITMUS^RT. I’ve not observed any similar behavior; it indeed looks pretty strange. Some questions/comments inline below.
>
> The machine used is a 16-core system. 8 cores are dedicated to the host,
> while up to the remainig 8 cores are used to run a Xen virtual machine.
> Just in case more background is needed, Xen is a popular hypervisor. The
> RTDS scheduler is a new real-time scheduler that was originally the RT-Xen
> project. It is now a part of the upstream Xen project. The scheduler
> gives each VCPU a guaranteed capacity based on the input parameters.
>
> LITMUS-RT v2014.2 is installed in the guest. The hypervisor is using the
> rtds scheduler, with each VCPU receiving the full allocation (Budget: 10000,
> Period: 10000).
Which LITMUS^RT scheduler are you using?
Can you reproduce the issue with LITMUS^RT running on bare metal?
>
> The issue is as follows: when running a simple two tasks with less than 50%
> single core utilization each (Cost=200ms, Period=500ms), jobs are still missing
> in certain cases.
> - When we use only 2VCPUs for the guest, success in all jobs is observed
> - When we use 3 or more VCPUs, misses are observed
Do you use budget enforcement ?
> The results from st_job_stats and an exerpt of the logs for the 3 VCPU case
> are below:
> # Task, Job, Period, Response, DL Miss?, Lateness, Tardiness
> # task NAME=myapp PID=14442 COST=250000000 PERIOD=500000000 CPU=0
> 14442, 2, 500000000, 32592, 0, -499967408, 0
> 14442, 3, 500000000, 61805, 0, -499938195, 0
> 14442, 4, 500000000, 291581439, 0, -208418561, 0
> 14442, 5, 500000000, 251964671, 0, -248035329, 0
> 14442, 6, 500000000, 252424182, 0, -247575818, 0
> 14442, 7, 500000000, 251938074, 0, -248061926, 0
> 14442, 8, 500000000, 252145862, 0, -247854138, 0
> 14442, 9, 500000000, 251845811, 0, -248154189, 0
> 14442, 10, 500000000, 257706935, 0, -242293065, 0
> 14442, 11, 500000000, 251850581, 0, -248149419, 0
> 14442, 12, 500000000, 252553597, 0, -247446403, 0
> 14442, 13, 500000000, 251765063, 0, -248234937, 0
> 14442, 14, 500000000, 252902538, 0, -247097462, 0
> 14442, 15, 500000000, 1009091185, 1, 509091185, 509091185
> 14442, 16, 500000000, 760966632, 1, 260966632, 260966632
> 14442, 17, 500000000, 512866266, 1, 12866266, 12866266
> 14442, 18, 500000000, 264818921, 0, -235181079, 0
> 14442, 19, 500000000, 253024397, 0, -246975603, 0
> 14442, 20, 500000000, 252785150, 0, -247214850, 0
> 14442, 21, 500000000, 252466946, 0, -247533054, 0
> 14442, 22, 500000000, 1459862887, 1, 959862887, 959862887
> 14442, 23, 500000000, 1211903080, 1, 711903080, 711903080
> 14442, 24, 500000000, 963919848, 1, 463919848, 463919848
> # task NAME=myapp PID=14443 COST=200000000 PERIOD=500000000 CPU=0
> 14443, 2, 500000000, 58150, 0, -499941850, 0
> 14443, 3, 500000000, 3202178, 0, -496797822, 0
> 14443, 4, 500000000, 201662924, 0, -298337076, 0
> 14443, 5, 500000000, 213828161, 0, -286171839, 0
> 14443, 6, 500000000, 202532002, 0, -297467998, 0
> 14443, 7, 500000000, 961643647, 1, 461643647, 461643647
> 14443, 8, 500000000, 663707479, 1, 163707479, 163707479
> 14443, 9, 500000000, 365603701, 0, -134396299, 0
> 14443, 10, 500000000, 201910605, 0, -298089395, 0
> 14443, 11, 500000000, 209025099, 0, -290974901, 0
> 14443, 12, 500000000, 210602663, 0, -289397337, 0
> 14443, 13, 500000000, 247544048, 0, -252455952, 0
> 14443, 14, 500000000, 459680759, 0, -40319241, 0
> 14443, 15, 500000000, 202763167, 0, -297236833, 0
> 14443, 16, 500000000, 202368570, 0, -297631430, 0
> 14443, 17, 500000000, 201857551, 0, -298142449, 0
> 14443, 18, 500000000, 201999263, 0, -298000737, 0
> 14443, 19, 500000000, 958909178, 1, 458909178, 458909178
> 14443, 20, 500000000, 660591085, 1, 160591085, 160591085
> 14443, 21, 500000000, 362025653, 0, -137974347, 0
> 14443, 22, 500000000, 202565748, 0, -297434252, 0
> 14443, 23, 500000000, 202456104, 0, -297543896, 0
> 14443, 24, 500000000, 202528628, 0, -297471372, 0
>
> .....
> [ 328638446] RELEASE 14443/14 on CPU 0 328638946.00ms
> [ 328638446] RELEASE 14442/14 on CPU 1 328638946.00ms
> [ 328638446] SWITCH_TO 14442/14 on CPU 0
> [ 328638698] SWITCH_FROM 14442/14 on CPU 0
> [ 328638698] COMPLETION 14442/14 on CPU 0
> [ 328638704] SWITCH_TO 14443/14 on CPU 1
> [ 328638905] SWITCH_FROM 14443/14 on CPU 1
> [ 328638905] COMPLETION 14443/14 on CPU 1
> [ 328638946] RELEASE 14442/15 on CPU 0 328639446.00ms
> [ 328638946] RELEASE 14443/15 on CPU 1 328639446.00ms
> [ 328638946] SWITCH_TO 14443/15 on CPU 0
> [ 328639148] SWITCH_FROM 14443/15 on CPU 0
> [ 328639148] COMPLETION 14443/15 on CPU 0
> [ 328639446] RELEASE 14443/16 on CPU 0 328639946.00ms
> [ 328639446] RELEASE 14442/16 on CPU 1 328639946.00ms
Random observation: the next release of 14443 happens on CPU 0, whereas 14442 is still being released on CPU 1. This means CPU 1 must have set a timer for the release to occur. Usually, it shouldn’t do that unless it either (a) processed a completion or (b) prevented a budget overrun. Either event should have left a trace record (I think). Are you sure the trace is complete (i.e., no missing writes)?
> [ 328639446] SWITCH_TO 14443/16 on CPU 0
> [ 328639648] SWITCH_FROM 14443/16 on CPU 0
> [ 328639648] COMPLETION 14443/16 on CPU 0
> [ 328639703] SWITCH_TO 14442/15 on CPU 1
> [ 328639946] RELEASE 14442/17 on CPU 1 328640446.00ms
> [ 328639946] RELEASE 14443/17 on CPU 0 328640446.00ms
> [ 328639946] SWITCH_TO 14443/17 on CPU 0
> [ 328639955] SWITCH_FROM 14442/15 on CPU 1
> [ 328639955] COMPLETION 14442/15 on CPU 1
> [ 328639955] SWITCH_TO 14442/16 on CPU 1
> [ 328640147] SWITCH_FROM 14443/17 on CPU 0
> [ 328640147] COMPLETION 14443/17 on CPU 0
> [ 328640206] SWITCH_FROM 14442/16 on CPU 1
> [ 328640206] COMPLETION 14442/16 on CPU 1
> [ 328640206] SWITCH_TO 14442/17 on CPU 1
> .....
>
>
> It is interesting to note that when both job 15's are released, the job for PID 14443 is
> allowed to run. however, for PID 14442, the job does not run until the next period!
>
> It has also been observed that running two tasks on a guest with 8 VCPUs sees the same
> task misses. But by artificailly loading the system by running 6 instances of "yes >
> /dev/null", all jobs meet their deadlines.
Do these instances of yes run as LITMUS^RT tasks or just as regular SCHED_OTHER tasks?
Also, can you provide matching output of the debug trace log (the TRACE() stream)?
I’ve not used LITMUS^RT on Xen. I don’t mean to blame Xen, but it would make it easier to for us if you could reproduce this behavior on bare metal.
Thanks,
Björn
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.litmus-rt.org/pipermail/litmus-dev/attachments/20150313/b540a2d0/attachment.html>
More information about the litmus-dev
mailing list