<html>
<head>
<base href="https://bugs.webkit.org/" />
</head>
<body><table border="1" cellspacing="0" cellpadding="8">
<tr>
<th>Bug ID</th>
<td><a class="bz_bug_link
bz_status_NEW "
title="NEW - WebAssembly: eliminate redundant ARM64 TLS load"
href="https://bugs.webkit.org/show_bug.cgi?id=169815">169815</a>
</td>
</tr>
<tr>
<th>Summary</th>
<td>WebAssembly: eliminate redundant ARM64 TLS load
</td>
</tr>
<tr>
<th>Classification</th>
<td>Unclassified
</td>
</tr>
<tr>
<th>Product</th>
<td>WebKit
</td>
</tr>
<tr>
<th>Version</th>
<td>WebKit Nightly Build
</td>
</tr>
<tr>
<th>Hardware</th>
<td>Unspecified
</td>
</tr>
<tr>
<th>OS</th>
<td>Unspecified
</td>
</tr>
<tr>
<th>Status</th>
<td>NEW
</td>
</tr>
<tr>
<th>Severity</th>
<td>Normal
</td>
</tr>
<tr>
<th>Priority</th>
<td>P2
</td>
</tr>
<tr>
<th>Component</th>
<td>JavaScriptCore
</td>
</tr>
<tr>
<th>Assignee</th>
<td>webkit-unassigned@lists.webkit.org
</td>
</tr>
<tr>
<th>Reporter</th>
<td>jfbastien@apple.com
</td>
</tr>
<tr>
<th>CC</th>
<td>fpizlo@apple.com, jfbastien@apple.com, keith_miller@apple.com, mark.lam@apple.com, msaboff@apple.com, sbarati@apple.com
</td>
</tr>
<tr>
<th>Depends on</th>
<td>169611
</td>
</tr>
<tr>
<th>Blocks</th>
<td>159775
</td>
</tr></table>
<p>
<div>
<pre>This is a small optimization, I'm not sure it'll pay off much but it's neat.
As part of <a class="bz_bug_link
bz_status_ASSIGNED "
title="ASSIGNED - WebAssembly: store state in TLS instead of on VM"
href="show_bug.cgi?id=169611">bug #169611</a> we're moving the WebAssembly context to a TLS slot. On x86 that's a single load / store off the segment register, but on ARM64 it uses mrs + mask + {load,store}. the `mrs TPIDRRO EL0` instruction, coupled with the mask and the address generation, simply return the location of our TLS slot (the offset is defined as WTF_WASM_CONTEXT_KEY in wtf/FastTls.h). That value is idempotent as long as we're executing in the same thread, and that's an invariant of WebAssembly: different instances are set in that context but the location is the same per thread.
Right now this mrs+mask+memory combo is generated by the ARM64 macro assembler. This is inefficient. We could instead teach the compiler about the idempotent part (i.e. "get TLS slot #x") and then split off the load / store from that slot. For x86 that could mean combining both operations after the fact or keeping the same model we have now. For ARM64 that would allow us to eliminate redundant mrs+mask if profitable, or dematerializing them under register pressure.</pre>
</div>
</p>
<hr>
<span>You are receiving this mail because:</span>
<ul>
<li>You are the assignee for the bug.</li>
</ul>
</body>
</html>