[LITMUS^RT] Questions about Process Migration

Sun Aug 5 01:08:09 CEST 2012

Something I forgot to mention: see also "Runqueue invariant." in
Björn's dissertation for additional information about why what you are
trying to do is difficult, and how LITMUS^RT abstracts away some of
the difficulties of working with runqueues for you.

On Sat, Aug 4, 2012 at 4:02 PM, Christopher Kenna <cjk at cs.unc.edu> wrote:
> Hi Hang,
>
> It is difficult for me to understand some of the code you have
> provided without more information about what your data structures are,
> which macros you are using (or if you've redefined them), what locks
> are taken when, etc. However, I might be able to provide you with some
> advise in general.
>
> You said that your global scheduler is protected by only a single
> global lock. Why do you have have one run queue per processor then?
> I'm not saying that this approach is wrong, but I am not sure that you
> are gaining additional concurrency in doing this. Also, it might make
> it more difficult for you to debug race conditions and deadlock in
> your code.
>
> A "retry-loop" is not the best way to handle process migration.
>
> The approach taken in LITMUS^RT for migrating processes is usually to
> use link-based scheduling. Björn's dissertation provides a good
> explanation of what this is in the "Lazy preemptions." section of his
> dissertation on page 201 (at the time of this writing, assuming that
> Björn does not change the PDF I link to):
> http://www.cs.unc.edu/~bbb/diss/brandenburg-diss.pdf
> Take a look at some of the link/unlink code in the existing LITMUS^RT
> plugins for another approach to handling migrations.
>
> I hope that this is helpful to you. Other group members might have
> additional advise beyond what I've given here.
>
>  -- Chris
>
> On Sat, Aug 4, 2012 at 3:05 PM, Hang Su <hangsu.cs at gmail.com> wrote:
>> Hi all:
>>
>> I am implementing a global scheduler. However,  a process migration problem
>> makes me confused for a couple of days. Let me first describe my problem
>> first.
>>
>> I have a processor with two cores. And each core maintains a rq. I want to
>> migrate a running processor(1401) from RQ1 to RQ0 in the following case.
>> When a tick occurs on CPU0, it goes into my scheduling algorithm, a piece of
>> code about process migration as followings cause my system frozen, since it
>> does not break while loop.
>>
>>                  int this_cpu = smp_processor_id();
>>                  src_rq = task_rq(job->task) ; //(my Target Task with
>> PID:1401, which is currently running on and located at CPU1)
>>                  des_rq = cpu_rq(job->mapping_info[0].cpu); //(my target
>> task's target cpu is CPU0)
>>
>>                  if(src_rq != des_rq){
>>                    //count  = 100
>>                     while( task_running(src_rq,job->task) /* && count >
>> 0*/){
>>                         smp_send_reschedule(src_rq->cpu);  or
>> //set_tsk_need_resched(src_rq->curr);
>>                         //count--;
>>                     }
>>                     /*if( task_running(src_rq,job->task) ){
>>                     snprintf(msg,MSG_SIZE,"zigzag_pack src_rq:%d des_rq:%d
>> src_rq_curr_pid:%d pid:%d ", src_rq->cpu, des_rq->cpu,src_rq->curr->pid,
>> job->task->pid);
>>                     registerr_event(sched_clock(), msg, this_cpu);
>>                     return 1;
>>                     }*/
>>                  }
>>
>> However, my tracing info shows that my target task(PID:1401) has switched
>> out and replaced by a process with PID:23. At that time, rq(1)->curr is 23.
>> So I do not know why my piece of code on CPU0 can not break the while loop.
>> My scheduling algorithm is protected by a single and global spin_lock.
>>
>> Time:63625164872 MSG:context_switch prev->(1401),next->(23)
>> Time:63625166671 MSG:rq:1  curr(23)
>>
>>
>> In order to avoid system frozen and print error information, I have to
>> change my code as:
>>
>>                  int this_cpu = smp_processor_id();
>>                  src_rq = task_rq(job->task) ; //(my Target Task with
>> PID:1401, which is currently running on and at CPU1)
>>                  des_rq = cpu_rq(job->mapping_info[0].cpu); //(my target
>> task's target cpu is CPU0)
>>
>>                  if(src_rq != des_rq){
>>                     count  = 1000;
>>                     while( task_running(src_rq,job->task)  && count > 0 ){
>>                         smp_send_reschedule(src_rq->cpu);  or
>> //set_tsk_need_resched(src_rq->curr);
>>                         count--;
>>                     }
>>                     if( task_running(src_rq,job->task) ){
>>                     snprintf(msg,MSG_SIZE,"zigzag_pack src_rq:%d des_rq:%d
>> src_rq_curr_pid:%d pid:%d ", src_rq->cpu, des_rq->cpu,src_rq->curr->pid,
>> job->task->pid);
>>                     register_event(sched_clock(), msg, this_cpu);
>>                     return 1;
>>                     }
>>                  }
>>
>> The tracing info at CPU0 is:
>> Time:63625166136 MSG:zigzag_pack src_rq:1 des_rq:0 src_rq_curr_pid:1401
>> pid:1401
>>
>> I try both of the solutions to trigger CPU1 to reschedule,
>> smp_send_reschedule(src_rq->cpu) and set_tsk_need_resched(src_rq->curr),
>> neither works.
>>
>>
>> If you guys who are experts on this issue, please help me some tips.
>>
>> Thanks.
>>
>>
>> _______________________________________________
>> litmus-dev mailing list
>> litmus-dev at lists.litmus-rt.org
>> https://lists.litmus-rt.org/listinfo/litmus-dev
>>