[LITMUS^RT] [PATCH] EXPLANATION: Architecture dependent uncachable control page.

Björn Brandenburg bbb at mpi-sws.org
Sat Nov 30 09:13:48 CET 2013


On 07 Oct 2012, at 21:25, Christopher Kenna <cjk at cs.unc.edu> wrote:

> I encountered some problems with the LITMUS^RT control page. To elaborate upon
> what is stated in the patch's commit message (maybe go read that now), imagine
> that we have a control page C that is mapped in both the kernel address space
> and the user address space (like we already do in LITMUS^RT), and the following
> kernel system call and userland code interact:
> 
> Kernel system call S():
>    1) Read C->val and print it using TRACE().
>    2) C->val := 12345
> 
> User code:
>    1) C->val := 98765
>    2) S()
>    3) printf(C->val)
> 
> On x86 (tested on an Intel i7), the result is that the kernel sees the value
> 98765 from the userland and prints it to the LITMUS^RT log using TRACE, while
> the userland code sees the kernel value 12345 and prints it using printf().
> This happened in 100% of my test cases.
> 
> On ARM (tested on an ODROID-X / Samsung Exynos4412), the result is
> non-deterministic. Usually, each address space prints only the value that it
> wrote (12345 in the kernel and 98765 in the userland), but adding sleep()
> between steps 1 and 2 of the userland code results in the kernel occasionally
> printing the "correct" value.
> 
> I explored a few solutions to the problem. One was the restrict the task to run
> on a single CPU. That did not work. I also tried inserting general memory
> barriers in various places, but this also had no effect.
> 
> What did work is to make the mapping of the control page uncachable in both the
> user- and kernelspace. One theory that could explain why this fixes the problem
> is due to caches indexed or tagged with virtual addresses. Shared memory
> accessed via different virtual addresses may need to be made uncached, because
> the cache would contain different entries for each virtual address range. If
> any of these mappings are writable, it causes a coherency problem because
> modifications made through one mapping aren't visible through the cache entries
> for other mappings. However, the documentation I found for the Cortex-A9 says
> that the data cache is Physically Index and Physically Tagged (PIPT), and
> that the instruction cache is Virtually Indexed and Physically Tagged
> (VIPT). Since the control page is data, I thought it should be using the
> PIPT cache, and the "aliasing" of virtual addresses should not be an issue.
> 
> Taking the above into account, if anyone can offer an explanation as to why
> this path fixes the problem, I am very interested to know.

It appears that we finally have a fix for this bug. Roy Spliet reproduced the issue on our ARM hardware and, with some help from Alex Züpke, tracked it down to incorrect bits in the page table. In short, the mapping in userspace indeed happened to be uncached, whereas the kernel mapping was cached. Thus updates in userspace went straight to memory, without evicting any cache contents, whereas the kernel would read stale cache lines.

The patch is in staging:

	https://github.com/LITMUS-RT/litmus-rt/commit/5b50be03ba9b9515e24924cc07f8c58d9baa0960

As usual in these cases, it's essentially only a two-line fix, but it took weeks to figure out exactly what was going on. A big thank you to Roy for finally tracking this tricky bug down, and thank you Alex for pointing us in the right direction! We owe you two beers. ;-)

Thanks,
Björn





More information about the litmus-dev mailing list