[LITMUS^RT] Questions about Process Migration

Sun Aug 5 01:02:04 CEST 2012

Hi Hang,

It is difficult for me to understand some of the code you have
provided without more information about what your data structures are,
which macros you are using (or if you've redefined them), what locks
are taken when, etc. However, I might be able to provide you with some
advise in general.

You said that your global scheduler is protected by only a single
global lock. Why do you have have one run queue per processor then?
I'm not saying that this approach is wrong, but I am not sure that you
are gaining additional concurrency in doing this. Also, it might make
it more difficult for you to debug race conditions and deadlock in
your code.

A "retry-loop" is not the best way to handle process migration.

The approach taken in LITMUS^RT for migrating processes is usually to
use link-based scheduling. Björn's dissertation provides a good
explanation of what this is in the "Lazy preemptions." section of his
dissertation on page 201 (at the time of this writing, assuming that
Björn does not change the PDF I link to):
http://www.cs.unc.edu/~bbb/diss/brandenburg-diss.pdf
Take a look at some of the link/unlink code in the existing LITMUS^RT
plugins for another approach to handling migrations.

I hope that this is helpful to you. Other group members might have
additional advise beyond what I've given here.

 -- Chris

On Sat, Aug 4, 2012 at 3:05 PM, Hang Su <hangsu.cs at gmail.com> wrote:
> Hi all:
>
> I am implementing a global scheduler. However,  a process migration problem
> makes me confused for a couple of days. Let me first describe my problem
> first.
>
> I have a processor with two cores. And each core maintains a rq. I want to
> migrate a running processor(1401) from RQ1 to RQ0 in the following case.
> When a tick occurs on CPU0, it goes into my scheduling algorithm, a piece of
> code about process migration as followings cause my system frozen, since it
> does not break while loop.
>
>                  int this_cpu = smp_processor_id();
>                  src_rq = task_rq(job->task) ; //(my Target Task with
> PID:1401, which is currently running on and located at CPU1)
>                  des_rq = cpu_rq(job->mapping_info[0].cpu); //(my target
> task's target cpu is CPU0)
>
>                  if(src_rq != des_rq){
>                    //count  = 100
>                     while( task_running(src_rq,job->task) /* && count >
> 0*/){
>                         smp_send_reschedule(src_rq->cpu);  or
> //set_tsk_need_resched(src_rq->curr);
>                         //count--;
>                     }
>                     /*if( task_running(src_rq,job->task) ){
>                     snprintf(msg,MSG_SIZE,"zigzag_pack src_rq:%d des_rq:%d
> src_rq_curr_pid:%d pid:%d ", src_rq->cpu, des_rq->cpu,src_rq->curr->pid,
> job->task->pid);
>                     registerr_event(sched_clock(), msg, this_cpu);
>                     return 1;
>                     }*/
>                  }
>
> However, my tracing info shows that my target task(PID:1401) has switched
> out and replaced by a process with PID:23. At that time, rq(1)->curr is 23.
> So I do not know why my piece of code on CPU0 can not break the while loop.
> My scheduling algorithm is protected by a single and global spin_lock.
>
> Time:63625164872 MSG:context_switch prev->(1401),next->(23)
> Time:63625166671 MSG:rq:1  curr(23)
>
>
> In order to avoid system frozen and print error information, I have to
> change my code as:
>
>                  int this_cpu = smp_processor_id();
>                  src_rq = task_rq(job->task) ; //(my Target Task with
> PID:1401, which is currently running on and at CPU1)
>                  des_rq = cpu_rq(job->mapping_info[0].cpu); //(my target
> task's target cpu is CPU0)
>
>                  if(src_rq != des_rq){
>                     count  = 1000;
>                     while( task_running(src_rq,job->task)  && count > 0 ){
>                         smp_send_reschedule(src_rq->cpu);  or
> //set_tsk_need_resched(src_rq->curr);
>                         count--;
>                     }
>                     if( task_running(src_rq,job->task) ){
>                     snprintf(msg,MSG_SIZE,"zigzag_pack src_rq:%d des_rq:%d
> src_rq_curr_pid:%d pid:%d ", src_rq->cpu, des_rq->cpu,src_rq->curr->pid,
> job->task->pid);
>                     register_event(sched_clock(), msg, this_cpu);
>                     return 1;
>                     }
>                  }
>
> The tracing info at CPU0 is:
> Time:63625166136 MSG:zigzag_pack src_rq:1 des_rq:0 src_rq_curr_pid:1401
> pid:1401
>
> I try both of the solutions to trigger CPU1 to reschedule,
> smp_send_reschedule(src_rq->cpu) and set_tsk_need_resched(src_rq->curr),
> neither works.
>
>
> If you guys who are experts on this issue, please help me some tips.
>
> Thanks.
>
>
> _______________________________________________
> litmus-dev mailing list
> litmus-dev at lists.litmus-rt.org
> https://lists.litmus-rt.org/listinfo/litmus-dev
>