[LITMUS^RT] high miss ratio for low task utilization

Fri Nov 21 16:29:20 CET 2014

Hi Glenn,

>
>
> Hi JP,
>
> If you compiled Litmus with “CONFIG_SCHED_DEBUG_TRACE” enabled, then the
> experiment-scripts run.py should have created a file named “trace.slog”.
> Can you find such a file?  Take a look at the file and see if any error
> messages jump out at you.  It might give you a hint as to what is wrong.
> If you can’t make heads or tails of the log file (which is completely
> understandable), tgz it up and post it to the mailing list (unless its
> tremendously big).
>

Yes, I have the trace.slog.  I've attached the one from the VM that I've
used.  At a high level I've noticed a couple of things that jump out at me,
though I don't know if they're cause for concern:

1) There are hundreds of lines like the following (758 to be exact, over a
10 second run):
(rtspin/3181:170) scheduled_on = NO_CPU

2) I see roughly 500 lines like the following (the CPU number varies):
(rtspin/3180:503) linking to local CPU 2 to avoid IPI

3) On the VM only, I see about 170 instances of the following:
[gsnedf_get_nearest_available_cpu at litmus/sched_gsn_edf.c:275]: Could not
find an available CPU close to P4

>
> If I may inquire, what x86 system do you have that has 128 physical
> cores?  That’s pretty incredible!  I don’t believe that anyone has ever run
> Litmus on anything with more than 64 cores.  Come to think of it, some of
> liblitmus’s routines will break when P-EDF (and C-EDF with L1 clustering)
> is used on a system with more than 64 CPUs.  This is a limitation of the
> user-space, not the Litmus kernel.
>
>
This is running on top of an SGI UV100:
https://www.sgi.com/products/servers/uv/

Ours is an older generation of the machine than what I linked to.  But the
architecture hasn't changed that much.  It's a bunch of Xeon X7550 NUMAlink
providing cache coherence.

Once I kind of figure out what I'm doing, I'd be happy to try to run some
experiments on it if you're interested.  Feel free to follow up with me
off-list if this is of interest to you or your group.

> Here are links to the broken user-space code:
> https://github.com/LITMUS-RT/liblitmus/blob/master/src/migration.c#L105
> https://github.com/LITMUS-RT/liblitmus/blob/master/src/migration.c#L127
>
> The limitation is this: a 64-bit mask is used to report the mapping
> between CPU clusters (“domains”) and CPUs.  If you have more than 64 CPUs
> or more than 64 domains, then these routines will break.  You haven’t hit
> this limit yet since your VM is constrained to 8 cores.  I must say, I
> didn’t think Litmus users would hit the 64-CPU limit so soon.  Maybe it’s
> time to come up with a solution.  Björn, should we use a __uint128_t, a
> struct, or CPU_SET?  I prefer __uint128_t to keep things simple, but I
> admit that only buys us time.  We’d have to do something else if someone
> wanted to run Litmus on Xeon Phi (available today), which has 244 hardware
> threads.
>
> -Glenn
> _______________________________________________
> litmus-dev mailing list
> litmus-dev at lists.litmus-rt.org
> https://lists.litmus-rt.org/listinfo/litmus-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.litmus-rt.org/pipermail/litmus-dev/attachments/20141121/835c0c37/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: trace-vm.slog.gz
Type: application/x-gzip
Size: 55637 bytes
Desc: not available
URL: <http://lists.litmus-rt.org/pipermail/litmus-dev/attachments/20141121/835c0c37/attachment.bin>