[LITMUS^RT] Question from experimenting with Litmus-RT Performance

Wed Jan 13 06:52:55 CET 2016

> On 12 Jan 2016, at 04:18, Yu-An(Victor) Chen <chen116 at usc.edu> wrote:
> 
> Hi,
> 
> I am doing some experiment with rt-xen and litmus-rt. What I am trying to do is to see the schedubility of real time tasks of 1vm while the other vm is fully utilized. Both guest VMs use litmus-rt.
> 
> The setup is the following:
> 
> Using Xen 4.5.0. 
> 1.  2vm sharing core 0-7 ( both vm can access core0-7) , with RTDS scheduler, both has period of 4000us and budget of 2000(us) 
> 2.  Dom0 using one core from CPU 8-15, with RTDS scheduler, period of 10000us and budget of 10000us
> 3.  both guest vm have ubuntu 12.04 and are using "litmus-rt-2014.2.patch" and with Geoffrey's patch for IPI interrupt (https://github.com/LITMUS-RT/liblitmus/pull/1/files <https://github.com/LITMUS-RT/liblitmus/pull/1/files>)
> 
> the taskset is generated as followed:
> 
> a taskset is composed of a collection of real-time tasks, and each real-time task is a sequence of jobs that are released periodically... All jobs are periodic, where each job Ti is defined by a period (and deadline) pi and a worse-case execution time ei, with pi ≥ ei ≥ 0 and pi, ei ∈ integers. Each job is comprised of a number of iterations of floating point operations during each job. This is based on the base task.c provided with the LITMUSRT userspace library.
> 
> The period for a task is from uniform distribution (10ms,100ms)
> and the utilization rate of a task is also from a uniform distribution (0.1,0.4)
> 
> In my experiment:
> 
> Step0: disable networking and other unused services.
> Step1: I loaded VM#2 with constant running task with total utilization of 4 cores.
> Step2: In VM#1 I many run iterations of tasksets from total utilization rate 0.2 cores all the way to 4.6 and record their schedulbility using st_trace. 
> 
> In my results, I do see that schedulbility do drop to zeros at either total util rate of 4.2 or 4.4. (we use worst-case execution time for benchmarking the base amount of computation that is why it takes more than total util rate of 4 for schedubility to drop to 0)
> 
> What puzzle me is why are there two groups of results as shown in the attached graph?(one group that has schedubility of 0 at total util rate of 4.2 and another group that has schedubility of 0 at total util rate of 4.4)  ( I used " * " in the legend to indicate the groups)
> 
> Shouldn't each run should be somewhat close to each other or showing randomness instead of seeing the two groups of performance curves?
> 
> I wonder if my base computation is wrong but that still does not explain why they are two types of performance curves. 
> 
> Any advice or suggestion on how I can go about this will be helpful!

Dear Chen,

thanks for your interest in LITMUS^RT. Concerning your observations, I can’t say for sure what’s going on, but a couple of issues stand out that you might want to consider.

First, schedulability is an analytical property that you cannot measure by observation. You can only observe the *lack* of schedulability (i.e., deadline misses), just like you cannot establish correctness by testing.

So based on your description, the question really is “why do we observe more deadline misses in some VMs than in others”. 

One possibility is that actually VMs in both groups are “equally schedulable”, but that some of them are “getting lucky” in your experiment. That is, you may not *observe* deadline misses in some workloads, even though they were actually not schedulable.

There are also a couple of other possible causes.

- Do you control page coloring? If not, some of your processes may be subject to more cache misses and/or cache interference than others, which would affect their execution times, which in turn could translate into higher or lower likelihood to miss a deadline.

-  As far as I know, Xen uses coarse-grained resource accounting, based a period tick with some coarse resolution. Your VMs might not actually getting precisely what you allocated to them. Maybe RT-Xen fixed this, I don’t know. (LITMUS^RT uses fine-grained accounting based on one-shot timers.)

- Cache interference etc. will drive up execution costs. So if you calibrated your “burn CPU time” loop in isolation, your tasks might take longer when run in parallel with contention on other cores. Make sure you log actual execution times with sched_trace to see if you are actually getting what you wanted.

- Of course, there could also be some bug in LITMUS^RT, but based on your description we don’t have enough detail to suspect anything in particular. When suspecting a bug in LITMUS^RT, please reproduce the problem on bare metal first.

I hope this gives you some pointers to investigate the issue. Please let us know what you find out.

Regards,
Björn

PS: please make sure you are subscribed to the list before posting to it. Thanks.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.litmus-rt.org/pipermail/litmus-dev/attachments/20160113/3b4765bc/attachment.html>