[LITMUS^RT] Help with Understanding Scheduling Results
Glenn Elliott
gelliott at cs.unc.edu
Fri Apr 3 17:20:57 CEST 2015
> On Apr 2, 2015, at 12:16 PM, Geoffrey Tran <gtran at isi.edu> wrote:
>
> Hi Björn,
>
> Thank you again for your detailed reply!
>
> For the logs, I have disabled CONFIG_PREEMPT_STATE_TRACE since it added up to huge logs. Did you mean that I should turn this on
> before collecting logs? I just wanted to clear that up before I go recompiling the kernel to disable the affinity code.
>
> One of my colleagues did test with the affinity disabled, and while it did improve scheduling results for this two-task case, it
> did not solve the issue with a larger task set. However, that image is no longer available, and no logs were stored, so I will
> redo it after receiving your comments on the log settings.
>
> The in depth explanation of the log is again helpful. I am not sure on if Xen always delivers all IPIs.
>
> When you say enable SCHED_STATE, do you mean the above tracing parameter? Everything else is enabled in the LITMUS^RT->Tracing
> menu for the kernel config.
>
> Thanks,
> Geoffrey
>
>
> ----- Original Message -----
> From: "Björn Brandenburg" <bbb at mpi-sws.org>
> To: "Geoffrey Tran" <gtran at isi.edu>, "Glenn Elliott" <gelliott at cs.unc.edu>
> Cc: "Meng Xu" <xumengpanda at gmail.com>, litmus-dev at lists.litmus-rt.org, "Mikyung Kang" <mkkang at isi.edu>, "Stephen Crago" <crago at isi.edu>, "John Walters" <jwalters at isi.edu>
> Sent: Thursday, April 2, 2015 9:48:04 AM
> Subject: Re: [LITMUS^RT] Help with Understanding Scheduling Results
>
>
>> On 01 Apr 2015, at 04:22, Geoffrey Tran <gtran at isi.edu> wrote:
>>
>> Hi Meng,
>>
>> I received this email on the KVM setup:
>> <each VM>
>> 8 cores
>> 100% budget
>> raw qemu image
>> vda virtio
>> with the following kernel: Linux localhost 3.4.43-WR5.0.1.10_standard #1 SMP PREEMPT Sat Dec 21 16:28:51 EST 2013 x86_64 GNU/Linux
>>
>> Also, we observe similar anomalies with the credit scheduler.
>>
>>
>>
>> Hello again Björn,
>>
>> Would you happen to have any more comments or suggestions we could try?
>>
>> Thank you very much,
>> Geoffrey
>
> Hi Geoffrey,
>
> (CC'ing Glenn, who knows the affinity code best)
>
> sorry for the long pause. I finally had some time to look at the traces.
>
> First, it looks like your bm_log.txt trace contains ONLY messages related to SCHED_STATE tracing, which is not very useful.
>
> Second, could you please retrace what happens under Xen with CONFIG_SCHED_CPU_AFFINITY disabled? Something goes wrong around the affinity check. I'm not quite sure what's happening, but let's try disabling that code to narrow down the issue.
>
> The first job to miss a deadline is myapp/1535:15, so let's look at that.
>
> The predecessor myapp/1535:14 completes before its deadline; the task is hence added to the release queue.
>
> 278 P1 [gsnedf_schedule at litmus/sched_gsn_edf.c:442]: (myapp/1535:14) blocks:0 out_of_time:0 np:0 sleep:1 preempt:0 state:0 sig:0
> 279 P1 [job_completion at litmus/sched_gsn_edf.c:364]: (myapp/1535:14) job_completion().
> 280 P1 [__add_release at litmus/rt_domain.c:348]: (myapp/1535:15) add_release(), rel=620115000000
> 281 P1 [gsnedf_schedule at litmus/sched_gsn_edf.c:490]: (myapp/1535:15) scheduled_on = NO_CPU
>
> Some time later, the release interrupt is handled by CPU 1. It links myapp/1536:15 to itself...
>
> 286 P1 [check_for_preemptions at litmus/sched_gsn_edf.c:302]: (myapp/1536:15) linking to local CPU 1 to avoid IPI
>
> ... and then tries initially to link myapp/1535:15 to CPU 4, which however we never see in the trace.
>
> 287 P1 [check_for_preemptions at litmus/sched_gsn_edf.c:314]: check_for_preemptions: attempting to link task 1535 to 4
>
> Then the affinity code kicks in and determines that CPU 0 is much closer to CPU 1, which scheduled the task last.
>
> 288 P1 [gsnedf_get_nearest_available_cpu at litmus/sched_gsn_edf.c:275]: P0 is closest available CPU to P1
>
> So far so good. Now myapp/1535:15 gets linked to P0. However, we never hear from P0 in the trace until much later.
>
> 289 P1 [gsnedf_schedule at litmus/sched_gsn_edf.c:445]: (swapper/1/0:0) will be preempted by myapp/1536
> 290 P1 [gsnedf_schedule at litmus/sched_gsn_edf.c:485]: (myapp/1536:15) scheduled_on = P1
> 291 P1 [litmus_schedule at kernel/sched/litmus.c:51]: (myapp/1536:15) migrate from 0
> 292 P1 [litmus_schedule at kernel/sched/litmus.c:65]: (myapp/1536:15) stack_in_use=-1
> 293 P1 [gsnedf_schedule at litmus/sched_gsn_edf.c:442]: (myapp/1536:15) blocks:0 out_of_time:0 np:0 sleep:1 preempt:0 state:0 sig:0
> 294 P1 [job_completion at litmus/sched_gsn_edf.c:364]: (myapp/1536:15) job_completion().
> 295 P1 [__add_release at litmus/rt_domain.c:348]: (myapp/1536:16) add_release(), rel=620615000000
> 296 P1 [gsnedf_schedule at litmus/sched_gsn_edf.c:490]: (myapp/1536:16) scheduled_on = NO_CPU
> 297 P1 [check_for_preemptions at litmus/sched_gsn_edf.c:302]: (myapp/1536:16) linking to local CPU 1 to avoid IPI
> 298 P1 [gsnedf_schedule at litmus/sched_gsn_edf.c:445]: (swapper/1/0:0) will be preempted by myapp/1536
> 299 P1 [gsnedf_schedule at litmus/sched_gsn_edf.c:485]: (myapp/1536:16) scheduled_on = P1
> 300 P0 [gsnedf_schedule at litmus/sched_gsn_edf.c:445]: (swapper/0/0:0) will be preempted by myapp/1535
> 301 P0 [gsnedf_schedule at litmus/sched_gsn_edf.c:485]: (myapp/1535:15) scheduled_on = P0
>
> At this point, the **next** job of myapp/1536 has already been released, i.e., P0 "was out to lunch" for an entire period and so naturally myapp/1535 has a couple of tardy jobs while it catches up.
>
> The more I think about it, I don't think it has anything to do with the affinity code (but try disabling it anyway). Rather, this could be explained by a "lost" IPI, in which case P0 simply idled until it received an interrupt for some other reason. Are you sure Xen always delivers all IPIs?
>
> By the way, this would also explain why adding 'yes' background tasks helps: they prevent the CPUs from going to sleep and frequently cause I/O, which triggers the scheduler, so a lost IPI wouldn't be noticed.
>
> (Alternatively, it could also be the case that P0 was not scheduled by the hypervisor for an extended amount of time, but I'm sure you've checked this already.)
>
> To further investigate this issue, it would help to enable SCHED_STATE tracing in the kernel running under Xen. I'm happy to take another look once you've collected further traces.
>
> Thanks,
> Björn
Hi All,
I just want to reiterate what has already been said about the cache affinity aware CPU scheduling. With this feature enabled, I believe that it is more likely for a given scheduling decision to result in the scheduling of a remote processor rather than the processor that is making said the scheduling decision. If there is a delay in activating the remote processor (either in IPI delivery or in the hypervisor scheduling some virtual CPU thread), then the actual application of the scheduling decision for the remote processor will be delayed. There will be no delay if the local processor is scheduled.
An increase in the number of tasks will increase the number of scheduling decisions that need to be made—this may result in an increase in the number of remote scheduling decisions. Although, the affinity aware scheduling only kicks in if desired CPUs are available… more tasks may decrease the likelihood that desired CPUs are available, which would reduce the number of remote scheduling decisions. I would expect the behavior that you would see would depend greatly upon the task set properties (utilization, cache working set size, etc.). I’ve thought about making affinity-aware scheduling a per-task property—the user would selectively enable the behavior for tasks with large cache working set sizes. However, we generally want to avoid adding such complexity to a research kernel. We also lack any good benchmark for demonstrating effectiveness. Without any metrics, it’s hard to justify adding additional complexity.
It occurs to me that increasing the number of available CPUs may increase the number of remote scheduling decisions (maybe Björn already mentioned this). Perhaps this is the problem that you are seeing? Here’s something that you may want to try:
Using Linux’s CPU affinity masks, partition tasks among the CPUs. That is, map each task to one CPU. Try to balance CPU utilization or number of tasks per CPU. Do this even if you’re using global or clustered schedulers—just make sure that the selected CPU is within the task’s desired cluster. What we’re doing here is setting an _initial_ CPU affinity for each task. Next, transition the tasks into the real-time scheduler. Hopefully, with a little initial balancing, CPU migrations will be rare. I haven’t tried this technique with CPUs, but this approach was absolutely critical to obtaining good performance in my work with clustered GPU scheduling.
Geoffrey—I’ve lost track of which Litmus schedulers you have tried. Have you tried one of the partitioned schedulers (P-EDF or P-FP)? If not, please do try one of them out and report on wether you see the same weird behavior when you scale up the number of CPUs. If your troubles are due to delayed responses to remote scheduling decisions, then I would expect that you _won’t_ observe the weird scaling issues under P-EDF or P-FP, since remote scheduling decisions are impossible.
-Glenn
More information about the litmus-dev
mailing list