[LITMUS^RT] domU crashes (RT-Xen)

Mon Apr 28 17:24:47 CEST 2014

On Apr 28, 2014, at 10:30 AM, Thijs Raets <thijsraets at gmail.com> wrote:

> 
> No usable information: meaning I do have information but nothing out of the ordinary. I'm running GSN-EDF. I also looked at /var/log/kernel.log but nothing special. I'm new to kernel debugging so I'm having a hard time finding the right information. But firstly I would like to know if it is normal for Litmus-RT to crash the system when overloaded with RT tasks? 
> 
> If I only run the first 7 tasks on 6 VCPUs there's no crash, also if I run all the tasks on all of my 15 VCPUs there's no crash.  
> 
> 2014-04-28 15:02 GMT+02:00 Björn Brandenburg <bbb at mpi-sws.org>:
> On 28 Apr 2014, at 14:56, Thijs Raets <thijsraets at gmail.com> wrote:
> >
> > I'm running a virtual machine with 6 VCPU's with the following task set:
> > ./rtspin 12 601 120 -w &
> > ./rtspin 478 634 120 -w &
> > ./rtspin 334 636 120 -w &
> > ./rtspin 128 644 120 -w &
> > ./rtspin 460 645 120 -w &
> > ./rtspin 10 662 120 -w &
> > ./rtspin 103 666 120 -w &
> > ./rtspin 276 678 120 -w &
> > ./rtspin 137 691 120 -w &
> >
> > When I release this task set, my system crashes. I calculated the MPR interface and 6 VCPUs should be enough. Also /dev/litmus/log does not give me any usable information. I would like to test my virtual machine for real time behavior, so it's ok if deadlines are missed, a crash however is unacceptable. Can anyone help (how can I avoid this behavior or what could be the problem)?
> 
> We're happy to help debugging crashes, but you'll need to provide more information. Which scheduler plugin are you using? What do you mean by "crash"? Are you getting a kernel panic? An OOPS? Is a BUG_ON() triggering? Do you have a backtrace? Is it a lock-up?
> 
> Also, what do you mean by "usable information"? Are you getting any info at all?
> 
> - Björn
> 

Hi Thijs,

Litmus should not be crashing under overload.  It is possible that you’re not getting any useful information out of the logs because the kernel is crashing before they get written to disk or even printed to the console.

Would you be able to test in qemu, or is xen fundamental to your research?  With qemu, it is possible to attach gdb over a local socket connection.  This is extremely helpful because you can:

1) Break into the system to diagnose a deadlock
2) Set break points
3) Dump log buffers
4) Examine thread/task/scheduler state
5) And probably most important to you: Get back traces.  A back trace is the first step in figuring out what’s going wrong.

If you’re lucky, the bug is captured immediately in the backtrace.  However, crashes are often caused by an earlier state corruption (most often due to a race condition).  In this case, armed with a backtrace, you can go to the log from /dev/litmus/log and try to determine event that directly lead to the crash (it helps to know the particular CPU where the crash occurred—this should be in the backtrace).  Then you must meticulously trace backwards through the log to understand the sequence of events that led to the crash.  You’ll need kernel code to help you piece together the emitted log messages into sequences of executed operations.  You’ll start feeling like Cypher from The Matrix after a while.

Here’s some info that can get you started with qemu: https://wiki.litmus-rt.org/litmus/VirtualMachineImages

I’m not entirely sure the qemu command-line arguments in the above link are complete enough to debug with gdb.  Here are the command-line arguments I use:

qemu-system-x86_64 -enable-kvm -cpu host -smp 4 -hda ./<your disk image> -m 1024 -name “blah” -nographic -kernel litmus-rt/arch/x86/boot/bzImage -append “console=ttyS0,115200 root=/dev/hda1 nopat” -gdb tcp::12345 -net nic -net user,hostfwd::2222-:22

Then to attach with gdb:
gdb litmus-rt/vmlinux

(in the gdb console...)
> target remote localhost:12345
> <set break points>
> continue
(once crashed, on break points, or ctrl-c)
(to dump the printk buffer:)
> dump binary value printk.txt __log_buf
(to dump the litmus TRACE() buffer:)
> dump binary memory trace.txt debug_buffer.buf debug_buffer+<number of bytes in buffer—it’s 2^(CONFIG_SCHED_DEBUG_TRACE_SHIFT)>

You may want to run strings on the *.txt output to filter out garbage characters.  Note that both are ring buffers, so the “start” of the log may be somewhere in the middle of the file.  Scan through the files and find where the timestamp wraps around to determine the start/end of the trace.

Another useful debug tool is kdump—especially when used with RedHat’s crash tool (it’s a gdb wrapper with some kernel knowledge).  I use kdump when I must debug a natively run kernel (versus one within a virtual machine).  There are several tutorials on the internet for how to use kdump.  I’m not sure if you can use kdump within xen, but it might be worth a try.

In any case, don’t forget to compile your kernel with debug symbols and frame pointers (CONFIG_DEBUG_INFO and CONFIG_FRAME_POINTER).

Finally, I’ve attacked the .config that I use for debugging with qemu/gdb.  Hopefully it can help you out, but I cannot guarantee that it will work on your system.

-Glenn

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.litmus-rt.org/pipermail/litmus-dev/attachments/20140428/463882ef/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: example-litmus-kvm-config
Type: application/octet-stream
Size: 65041 bytes
Desc: not available
URL: <http://lists.litmus-rt.org/pipermail/litmus-dev/attachments/20140428/463882ef/attachment.obj>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.litmus-rt.org/pipermail/litmus-dev/attachments/20140428/463882ef/attachment-0001.html>