[LITMUS^RT] high miss ratio for low task utilization
jp walters
mailinglistjp39 at gmail.com
Fri Nov 21 16:29:20 CET 2014
Hi Glenn,
>
>
> Hi JP,
>
> If you compiled Litmus with “CONFIG_SCHED_DEBUG_TRACE” enabled, then the
> experiment-scripts run.py should have created a file named “trace.slog”.
> Can you find such a file? Take a look at the file and see if any error
> messages jump out at you. It might give you a hint as to what is wrong.
> If you can’t make heads or tails of the log file (which is completely
> understandable), tgz it up and post it to the mailing list (unless its
> tremendously big).
>
Yes, I have the trace.slog. I've attached the one from the VM that I've
used. At a high level I've noticed a couple of things that jump out at me,
though I don't know if they're cause for concern:
1) There are hundreds of lines like the following (758 to be exact, over a
10 second run):
(rtspin/3181:170) scheduled_on = NO_CPU
2) I see roughly 500 lines like the following (the CPU number varies):
(rtspin/3180:503) linking to local CPU 2 to avoid IPI
3) On the VM only, I see about 170 instances of the following:
[gsnedf_get_nearest_available_cpu at litmus/sched_gsn_edf.c:275]: Could not
find an available CPU close to P4
>
> If I may inquire, what x86 system do you have that has 128 physical
> cores? That’s pretty incredible! I don’t believe that anyone has ever run
> Litmus on anything with more than 64 cores. Come to think of it, some of
> liblitmus’s routines will break when P-EDF (and C-EDF with L1 clustering)
> is used on a system with more than 64 CPUs. This is a limitation of the
> user-space, not the Litmus kernel.
>
>
This is running on top of an SGI UV100:
https://www.sgi.com/products/servers/uv/
Ours is an older generation of the machine than what I linked to. But the
architecture hasn't changed that much. It's a bunch of Xeon X7550 NUMAlink
providing cache coherence.
Once I kind of figure out what I'm doing, I'd be happy to try to run some
experiments on it if you're interested. Feel free to follow up with me
off-list if this is of interest to you or your group.
> Here are links to the broken user-space code:
> https://github.com/LITMUS-RT/liblitmus/blob/master/src/migration.c#L105
> https://github.com/LITMUS-RT/liblitmus/blob/master/src/migration.c#L127
>
> The limitation is this: a 64-bit mask is used to report the mapping
> between CPU clusters (“domains”) and CPUs. If you have more than 64 CPUs
> or more than 64 domains, then these routines will break. You haven’t hit
> this limit yet since your VM is constrained to 8 cores. I must say, I
> didn’t think Litmus users would hit the 64-CPU limit so soon. Maybe it’s
> time to come up with a solution. Björn, should we use a __uint128_t, a
> struct, or CPU_SET? I prefer __uint128_t to keep things simple, but I
> admit that only buys us time. We’d have to do something else if someone
> wanted to run Litmus on Xeon Phi (available today), which has 244 hardware
> threads.
>
> -Glenn
> _______________________________________________
> litmus-dev mailing list
> litmus-dev at lists.litmus-rt.org
> https://lists.litmus-rt.org/listinfo/litmus-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.litmus-rt.org/pipermail/litmus-dev/attachments/20141121/835c0c37/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: trace-vm.slog.gz
Type: application/x-gzip
Size: 55637 bytes
Desc: not available
URL: <http://lists.litmus-rt.org/pipermail/litmus-dev/attachments/20141121/835c0c37/attachment.bin>
More information about the litmus-dev
mailing list