[LITMUS^RT] Question about TSC

Wed Jan 18 10:44:59 CET 2012

2012/1/18 Björn Brandenburg <bbb at mpi-sws.org>:
>
> On Jan 18, 2012, at 8:33 AM, Mac Mollison wrote:
>
>> This question seems most appropriate for the public LITMUS list, so here
>> goes.
>>
>> Bjoern, on p. 300 of your dissertation, you state that TSC time stamps
>> are comparable across all cores on your particular test platform because
>> (1) all TSCs are driven by the same clock signal
>> (2) processors are prohibited from entering sleep states during
>> experiments
>>
>> However, accoring to Thomas Gleixner, the TSC is never trustworthy (not
>> even with the tsc_invariant bit that comes with Nehalem and later CPUs)
>> [1] - see [2] for broader context. It seems like the primary concern
>> here is SMIs changing the TSC around.
>>
>> Do you have any reason to believe you can "trust" the TSC for your
>> experiments, other than (presumably) never having any problems with it?
>> In particular, what about SMIs? Frankly, I think never seeing any
>> problems with it is more than "good enough" for experimentation
>> purposes, but is there additional evidence?
>>
>> I'm interested in constructing the strongest argument for using the TSC
>> for profiling something in userspace, hence my question. I'd like
>> to be able to say something stonger than "I know it's OK because it
>> works, even though the experts tell me not to trust it." I think right
>> now I can pretty much assume I don't have any nasty SMIs, so I can make
>> a decent argument, but not a great one. Maybe that was your logic as
>> well?
>>
>> [1] http://lwn.net/Articles/388286/
>> [2] http://lwn.net/Articles/388188/
>>
>> Thanks,
>> Mac
>
> Hi Mac,
>
> given the closed/under-documented nature of most x86 hardware, this is hard to argue conclusively. In general, I believe Thomas Gleixner that you can't trust the TSC *if you want your code to work on all existing hardware*. However, there exist many systems (and particular configurations) in which the TSC is perfectly fine.

Mac,

Section 17.12 Volume 3B of Intel Software Developer's manuals reports
a list of Intel chipsets where the TSC is synchronized across cores,
and chipsets where there may be skew errors.

Thanks,
- Andrea

> It's always difficult to experimentally prove the impossibility of something. The best I can say is that my trace data didn't show evidence of SMIs on Ludwig, the host used for the experiments.
>
> 1) To the best of my knowledge, TSC readings were always continuously increasing; there were no strange holes, jumps, decreasing readings, or any other signs of manipulated TSC values.
>
> 2) Even in the unlikely case that some samples were corrupted by SMI-related TSC corruption, such events would be so rare that they would hardly have any noticeable impact on reported averages (due to the large number of samples).
>
> 3) Reported worst-case overheads can be classified into two groups: A) those affected by (regular, maskable) interrupts, and B) those known to be not affected by maskable interrupts due to interrupt masking. Notably, no outliers were observed in *any* trace of overheads in group B), whereas traces in group A) are commonly affected by outliers. This implies that all sources of interrupt-related noise in the traces are maskable (and hence are not SMIs). [Of course, this doesn't rule out the existence of SMIs that do no affect the TSC and cause less delay than regular interrupts.]
>
> 4) SMIs are commonly triggered by power management functionality. All those features were disabled on the experimental platform (since I was looking exclusively at identical multiprocessors).
>
> 5) This is pure speculation, but I think SMIs are most frequently encountered in cheap consumer class hardware (think netbooks, cheap laptops, etc.). The experimental platform was (at the time of purchase) a high-end system using server class hardware.
>
> There's also a pragmatic way to look at the SMI problem: for any real real-time system, you'd be well-advised to choose hardware that is fully documented and not subject to SMIs, since SMIs can cause much more damage than just fiddle with the TSC (e.g., flush the cache, arbitrarily delay tasks, etc.). If there are uncontrollable SMIs that change observable hardware state, then a platform is ill-suited for real-time purposes. Why should we support platforms that are fundamentally flawed in a real-time context?
>
> Nonetheless, if you have an alternative low-overhead tracing method, I'd be very interested to see it integrated with LITMUS^RT.
>
> Thanks,
> Björn
>
>
>
> _______________________________________________
> litmus-dev mailing list
> litmus-dev at lists.litmus-rt.org
> https://lists.litmus-rt.org/listinfo/litmus-dev