[LITMUS^RT] Outlier

Tue Aug 6 08:54:24 CEST 2013

On Aug 5, 2013, at 9:14 PM, Hiroyuki Chishiro <chishiro at cs.unc.edu> wrote:

> I measured overheads in the recent developer version of LITMUS^RT, which I got in github (linux-kernel version 3.0 and modified version of 2012.3).
> However, I found an outlier though "statistical outlier filtering is no longer required" in https://wiki.litmus-rt.org/litmus/Releases.

Let me clarify. Here's what I meant by "statistical outlier filtering is no longer required".

Previously, we had measurement errors that produced outliers that did not correspond to true events that actually happened. Let's call those "erroneous outliers", because they were created by weaknesses in the measurement process.

Of course there is always the possibility of "true outliers", that is, samples that are much higher than the average or median or xth percentile or whatever your definition of "typical" is. True outliers reflect unpredictability or bad worst-case behavior of the actual system.

The difference is that erroneous outliers *should* be filtered, as they distort the observations, whereas true outliers *must not* be filtered, as discarding them introduces inaccuracy (essentially, you are underestimating worst-case costs when accidentally filtering true outliers).

Because we had no good way of identifying which samples where true outliers and which samples where erroneous outliers, and because the rate of erroneous outliers was high enough to prevent drawing useful conclusions w.r.t. maxima from the collected data, in the past we had to accept the risk of removing true outliers due to statistical outlier removal techniques.

With the changes introduced as part of my RTAS'13 work, it is now possible to (I believe) reliably tell apart erroneous outliers from true outliers. This is accomplished by tracking and identifying the causes of erroneous outliers (interrupts, out-of-order samples, gaps in the traces), and by filtering ONLY samples for which it is known that they were disturbed.

Outlier filtering is still required, but no *statistical*, indiscriminate outlier filtering. Erroneous outliers are thus removed, whereas true outliers are left untouched.

> For example, when I measure SEND-RESCHED overhead using many task sets in the G-EDF plugin, most of worst-case overheads are less than 2us but an outlier is about 120us.
> I think that this is an outlier.

Yes, but it is it a true or an erroneous outlier? If there are long code segments that disable interrupts, it is entirely possible that true outliers of that magnitude exist.

> Overhead data are collected by experiment-scripts in https://github.com/brandenburg/experiment-scripts.

These are not my scripts; they are all due to Jonathan afaik. From all that I've heard, they are really great, but I have yet had no time to study them and don't know how they work or how they process sampling data. By default, ft2csv should reject samples that correspond to erroneous outliers, but it's possible to suppress interrupt filtering.

Note that you need to run ftsort *prior* to running ft2csv. I don't know if Jonathan's scripts do this yet, as my RTAS'13 work happened in parallel with Jonathan's work.

So there are several possibilities:

1) The outlier that you observed is a true outlier (likely, in my opinion).

2) There's a bug in ft2csv, ftsort, or the kernel that make some erroneous outliers appear to be true outliers (possible, but I have collected ~500GB of data over 24h+ hours without ever encountering such a problem).

3) Jonathan's scripts disable interrupt filtering (unlikely).

4) Jonathan's scripts do not sort the trace files (I have no idea). However, even if this is the case, I would expect this to mask true outliers, not generate erroneous outliers.

I hope this clarifies the situation. Obviously, I can't claim that LITMUS^RT or any other Linux-based system is free from true outliers. I meant exactly what I said: *statistical* outlier filtering is no longer required, because we now have better, more accurate methods of identifying erroneous outliers.

Btw, if I had to guess, I'd bet your outlier is caused by sleep states. Waking a core that went into a deep sleep state can easily take 100+ microseconds. I'd try disabling everything related to power management in the BIOS and the kernel configuration.

- Björn