[LITMUS^RT] LITMUS^RT experiments using python scripts

Glenn Elliott gelliott at cs.unc.edu
Wed May 28 00:08:52 CEST 2014


On May 27, 2014, at 10:51 AM, Chandrasekaran Sivakumar <chandru.a6 at gmail.com> wrote:

> Hi all,
> I'm performing some experiments on LITMUS^RT using the scripts in GitHub. When I executed parse_exps.py, I got the overhead measurements printed on the terminal. Could anyone please tell the unit of those numbers ? Are they in milliseconds ? 
> Also, what does 'record loss' mean in the output ?
> 
>    
> Thanks and Regards,
> chandru



Hi Chandru,

== Summary ==
1) Overheads are reported in microseconds.
2) Task properties (e.g., response time and tardiness) are reported in milliseconds.
3) Records can be lost if the kernels internal ring buffers used to store logs overflow. This can occur when the system is overutilized by real-time tasks.  (I list ways to detect and mitigate record loss below.)

== Overheads ==
Overheads (e.g., scheduling, release, context switch, etc.) are in microseconds. The parse scripts scale from processor cycles (feather trace traces events in terms of cycles) to microseconds, here: https://github.com/LITMUS-RT/experiment-scripts/blob/master/parse/ft.py#L35.  You can trace back the source of the ‘cycles’ parameter’s value and find that it is set in config/config.py, here: https://github.com/LITMUS-RT/experiment-scripts/blob/master/config/config.py#L46

Observe that it calls the function “ft_freq()” to try to determine the frequency (ft_freq() is in common.py).  This does make one (potentially dangerous) assumption: that you’re processing the data on the same machine where it was recorded.  It also assumes that the CPU frequency did not change during measurements and that the CPUs were running at their advertised speed.  (No DVFS! We always recommend that Litmus users disable all power-saving modes in order to keep consistent clocks.)

Long story short, cycles are expressed in megahertz, so the output is in microseconds.

== Task Response Times ==
Being based upon a general purpose OS, Litmus isn’t really aimed at supporting real-time tasks with 10s of microsecond timing requirements.  We usually evaluate tasks with execution requirements on the millisecond time scale.  This is the reason why the statistics reported for task properties, such as response time, are reported in milliseconds.  You can see that here, where time is converted from nanoseconds to milliseconds: https://github.com/LITMUS-RT/experiment-scripts/blob/master/parse/sched.py#L324

== Record Loss ==
Trace events recorded by Litmus are stored in ring buffers.  At runtime, a userspace application (feather-trace-tools/ftcat) reads from the ring buffers and stores recorded events to disk.  Under heavy system utilization, ftcat can be starved for CPU time if real-time applications consume all CPU time.  This prevents the ring buffers from being flushed.  As a result, the ring buffers overflow and unflushed data gets overwritten by new events.  Hope is not lost, however, because:

(1) Litmus tells you of dropped events when all ftcat instances exit.  This message is printed to the console with printk(), so you should be able to examine your kernel log (with ‘dmesg’, for example)—grep for “Failed trace writes:”.  There should be 0 failed writes if every event was recorded.  (Here’s where in the Litmus kernel code where the message is printed: https://github.com/LITMUS-RT/litmus-rt/blob/master/litmus/ftdev.c#L149)

(2) I’ve found moderate success in instructing ftcat to dump logs to a RAM disk instead of the actual disk.  This allows ftcat to dump records more quickly since there is less overhead to write to a RAM disk than an actual disk.  On Ubuntu, you can dump logs to /dev/shm.  You need lots of system memory for this to work.  If you exceed your system memory, I believe items in /dev/shm will spill into your swap.

(3) You can compile Litmus to have really big ring buffers.  As long as system overutilization is not persistent or too prolonged, the buffers should be large enough to prevent overflow.
* For big overhead trace buffers, set CONFIG_SCHED_OVERHEAD_TRACE_SHIFT to be sufficiently large (https://github.com/LITMUS-RT/litmus-rt/blob/master/litmus/Kconfig#L280)
* For big sched trace buffers (the buffers that trace information for determining job response time, tardiness, etc.), set CONFIG_SCHED_TASK_TRACE_SHIFT to be sufficiently large (https://github.com/LITMUS-RT/litmus-rt/blob/master/litmus/Kconfig#L234)

Note, regarding CONFIG_SCHED_TASK_TRACE_SHIFT, I believe that you can manually edit litmus/Kconfig to increase the limit from 13 to something larger.

(4) parse_exps.py ignores trace events with lost records.  The data you get out of it is generally good.  However, be aware that your results can be skewed by survivorship bias if record loss is too heavy (a classic story about survivorship bias: http://dish.andrewsullivan.com/2013/05/27/the-failures-you-cant-see/).  parse_exps.py is configured to tolerate up to 20% record loss.  As I recall, the scripts will print overhead data to the console if the scripts believes that there is not enough information to generate CSV files.  Record loss tolerance is set here: https://github.com/LITMUS-RT/experiment-scripts/blob/master/config/config.py#L62

-Glenn



More information about the litmus-dev mailing list