[LITMUS^RT] Questions about Process Migration

Sun Aug 5 01:17:47 CEST 2012

Hi Chris:

Thanks for your explanation. Although it is a global scheduler, it has a
run queue for each core. Just like PFair scheduling algorithm, we do
some calculation and arrange processes on each core for next quantum.

Let me first have a look at the  Björn's solution.

Thanks.

On Sat, Aug 4, 2012 at 6:02 PM, Christopher Kenna <cjk at cs.unc.edu> wrote:

> Hi Hang,
>
> It is difficult for me to understand some of the code you have
> provided without more information about what your data structures are,
> which macros you are using (or if you've redefined them), what locks
> are taken when, etc. However, I might be able to provide you with some
> advise in general.
>
> You said that your global scheduler is protected by only a single
> global lock. Why do you have have one run queue per processor then?
> I'm not saying that this approach is wrong, but I am not sure that you
> are gaining additional concurrency in doing this. Also, it might make
> it more difficult for you to debug race conditions and deadlock in
> your code.
>
> A "retry-loop" is not the best way to handle process migration.
>
> The approach taken in LITMUS^RT for migrating processes is usually to
> use link-based scheduling. Björn's dissertation provides a good
> explanation of what this is in the "Lazy preemptions." section of his
> dissertation on page 201 (at the time of this writing, assuming that
> Björn does not change the PDF I link to):
> http://www.cs.unc.edu/~bbb/diss/brandenburg-diss.pdf
> Take a look at some of the link/unlink code in the existing LITMUS^RT
> plugins for another approach to handling migrations.
>
> I hope that this is helpful to you. Other group members might have
> additional advise beyond what I've given here.
>
>  -- Chris
>
> On Sat, Aug 4, 2012 at 3:05 PM, Hang Su <hangsu.cs at gmail.com> wrote:
> > Hi all:
> >
> > I am implementing a global scheduler. However,  a process migration
> problem
> > makes me confused for a couple of days. Let me first describe my problem
> > first.
> >
> > I have a processor with two cores. And each core maintains a rq. I want
> to
> > migrate a running processor(1401) from RQ1 to RQ0 in the following case.
> > When a tick occurs on CPU0, it goes into my scheduling algorithm, a
> piece of
> > code about process migration as followings cause my system frozen, since
> it
> > does not break while loop.
> >
> >                  int this_cpu = smp_processor_id();
> >                  src_rq = task_rq(job->task) ; //(my Target Task with
> > PID:1401, which is currently running on and located at CPU1)
> >                  des_rq = cpu_rq(job->mapping_info[0].cpu); //(my target
> > task's target cpu is CPU0)
> >
> >                  if(src_rq != des_rq){
> >                    //count  = 100
> >                     while( task_running(src_rq,job->task) /* && count >
> > 0*/){
> >                         smp_send_reschedule(src_rq->cpu);  or
> > //set_tsk_need_resched(src_rq->curr);
> >                         //count--;
> >                     }
> >                     /*if( task_running(src_rq,job->task) ){
> >                     snprintf(msg,MSG_SIZE,"zigzag_pack src_rq:%d
> des_rq:%d
> > src_rq_curr_pid:%d pid:%d ", src_rq->cpu, des_rq->cpu,src_rq->curr->pid,
> > job->task->pid);
> >                     registerr_event(sched_clock(), msg, this_cpu);
> >                     return 1;
> >                     }*/
> >                  }
> >
> > However, my tracing info shows that my target task(PID:1401) has switched
> > out and replaced by a process with PID:23. At that time, rq(1)->curr is
> 23.
> > So I do not know why my piece of code on CPU0 can not break the while
> loop.
> > My scheduling algorithm is protected by a single and global spin_lock.
> >
> > Time:63625164872 MSG:context_switch prev->(1401),next->(23)
> > Time:63625166671 MSG:rq:1  curr(23)
> >
> >
> > In order to avoid system frozen and print error information, I have to
> > change my code as:
> >
> >                  int this_cpu = smp_processor_id();
> >                  src_rq = task_rq(job->task) ; //(my Target Task with
> > PID:1401, which is currently running on and at CPU1)
> >                  des_rq = cpu_rq(job->mapping_info[0].cpu); //(my target
> > task's target cpu is CPU0)
> >
> >                  if(src_rq != des_rq){
> >                     count  = 1000;
> >                     while( task_running(src_rq,job->task)  && count > 0
> ){
> >                         smp_send_reschedule(src_rq->cpu);  or
> > //set_tsk_need_resched(src_rq->curr);
> >                         count--;
> >                     }
> >                     if( task_running(src_rq,job->task) ){
> >                     snprintf(msg,MSG_SIZE,"zigzag_pack src_rq:%d
> des_rq:%d
> > src_rq_curr_pid:%d pid:%d ", src_rq->cpu, des_rq->cpu,src_rq->curr->pid,
> > job->task->pid);
> >                     register_event(sched_clock(), msg, this_cpu);
> >                     return 1;
> >                     }
> >                  }
> >
> > The tracing info at CPU0 is:
> > Time:63625166136 MSG:zigzag_pack src_rq:1 des_rq:0 src_rq_curr_pid:1401
> > pid:1401
> >
> > I try both of the solutions to trigger CPU1 to reschedule,
> > smp_send_reschedule(src_rq->cpu) and set_tsk_need_resched(src_rq->curr),
> > neither works.
> >
> >
> > If you guys who are experts on this issue, please help me some tips.
> >
> > Thanks.
> >
> >
> > _______________________________________________
> > litmus-dev mailing list
> > litmus-dev at lists.litmus-rt.org
> > https://lists.litmus-rt.org/listinfo/litmus-dev
> >
>
> _______________________________________________
> litmus-dev mailing list
> litmus-dev at lists.litmus-rt.org
> https://lists.litmus-rt.org/listinfo/litmus-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.litmus-rt.org/pipermail/litmus-dev/attachments/20120804/1b7bef8d/attachment.html>