<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="3.9.5">Jekyll</generator><link href="https://adaptivecpp.github.io/feed.xml" rel="self" type="application/atom+xml" /><link href="https://adaptivecpp.github.io/" rel="alternate" type="text/html" /><updated>2024-03-18T21:19:41+01:00</updated><id>https://adaptivecpp.github.io/feed.xml</id><title type="html">AdaptiveCpp.</title><subtitle>The independent, community-driven platform for heterogeneous programming in C++</subtitle><author><name>Aksel Alpay</name></author><entry><title type="html">hipSYCL: The first single-pass SYCL implementation with unified code representation</title><link href="https://adaptivecpp.github.io/hipsycl/sscp/compiler/generic-sscp/" rel="alternate" type="text/html" title="hipSYCL: The first single-pass SYCL implementation with unified code representation" /><published>2023-02-01T20:00:00+01:00</published><updated>2023-02-01T20:00:00+01:00</updated><id>https://adaptivecpp.github.io/hipsycl/sscp/compiler/generic-sscp</id><content type="html" xml:base="https://adaptivecpp.github.io/hipsycl/sscp/compiler/generic-sscp/"><![CDATA[<h1 id="heterogeneous-utopia">Heterogeneous Utopia?</h1>

<ul>
  <li>Imagine you didn’t have to specify any targets when compiling with SYCL.</li>
  <li>Imagine compilation took not much longer than a regular clang compilation for your CPU.</li>
  <li>Imagine the resulting SYCL binary could run on CPUs - but also magically on any NVIDIA, Intel and AMD ROCm GPU. A <em>universal</em> binary.</li>
</ul>

<p>And even use multiple of these devices at the same time.</p>

<h1 id="hipsycl-reality">hipSYCL reality!</h1>

<p>What may sound like a science fiction is now reality in hipSYCL. hipSYCL now has a new compilation flow, that we call <em>generic SSCP compiler</em>. SSCP stands for <em>single-source, single compiler pass</em>. We will discuss what this means later.</p>

<p>To enable the magic, just compile like so:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>syclcc --hipsycl-targets=generic -o test test.cpp
</code></pre></div></div>
<p>In the future, the plan is to make <code class="language-plaintext highlighter-rouge">--hipsycl-targets=generic</code> the default, so that this can be omitted as well. But what does it do? Let’s first discuss the state of the art.</p>

<h1 id="current-status-quo">Current status quo</h1>

<p>Current SYCL implementations, including hipSYCL and DPC++ rely on the <em>single-source, multiple compiler passes</em> (SMCP) model. This means that they invoke a separate SYCL device compiler that parses and compiles the code, and generates a device binary. Afterwards, the usual host compiler is invoked, which again parses and compiles the code for the host, and also ensures that device binaries are embedded in the application. The result is a host binary with embedded device code containing kernels. Compilers for other programming models, such as NVIDIA’s nvcc CUDA compiler or AMD’s HIP compiler, work similarly.</p>

<p>Maybe you have spotted one issue already: The code is parsed multiple times – once for host, and once for device. And with C++ being C++, this can take time.</p>

<h2 id="but-it-gets-worse">But it gets worse</h2>

<p>But that’s not all there is to it. As it turns out, there is no unified code representation that Intel, NVIDIA and AMD GPU compute drivers can all understand. Intel GPUs want SPIR-V. SPIR-V is supported by neither AMD nor NVIDIA compute drivers. NVIDIA has their own code representation called PTX. AMD GPUs want amdgcn code.</p>

<p>So, if you want to have a binary that runs <em>everywhere</em>, SYCL implementations actually need to invoke the device compiler several times: Once for every code format that is required (PTX, SPIR-V, amdgcn). This means we are already looking at compiling the source code four times: Once for the host, and three times for the GPUs.</p>

<h2 id="but-it-gets-worse-again">But it gets worse (again)</h2>

<p>But it’s even worse than that. AMD’s ROCm compute platform does not have a device independent code representation. So, in order to support every AMD GPU, you actually need to compile seperately for each GPU supported by ROCm. My ROCm 5.3 installation can generate code for 38 different AMD GPU architectures (just count the number of ISA bitcode files in the <code class="language-plaintext highlighter-rouge">rocm/amdgcn/bitcode</code> directory). This means that in total we are now looking at parsing and compiling code once for the host, once for PTX, once for SPIR-V, and 38 times for AMD. 41 times in total. Clearly this is not practical.</p>

<p>And this approach also does not scale if we think about potentially supporting more backends in the future.</p>

<h1 id="hipsycl-generic-sscp-compiler-to-the-rescue">hipSYCL generic SSCP compiler to the rescue!</h1>

<p>So, what is this generic SSCP thing? It actually combines two ideas:</p>
<ol>
  <li>It is a single-pass compiler. This is what SSCP means, and implies that the device code will be extracted during the compilation for host. So, code is only parsed a single time.</li>
  <li>It introduces a generic, backend and device-independent code representation based on LLVM IR. The SSCP compiler stores this code representation in the application. At runtime, hipSYCL will then translate the generic code representation to whatever is needed: PTX, SPIR-V, or amdgcn code for one of the 38 AMD ROCm GPUs. Effectively, this means that we now have a unified code representation across all backends, even if they by themselves do not support one.</li>
</ol>

<p>The consequence is that we get a binary that can run on all supported devices, while only parsing the code exactly one time – just like a regular C++ compilation.</p>

<h1 id="what-are-the-costs-at-runtime">What are the costs at runtime?</h1>

<p>Now you might be asking: Hang on! You are now compiling at runtime, so you have just moved the cost to runtime! But that’s not true: Most likely you don’t have a system that has all of the 41 different GPU architectures installed, such that it wouldn’t have to actually generate code for all these targets. So, the runtime compilation would only compile based on the <em>individual need</em> of the user, but has the <em>capability</em> to run on all 41 GPUs. Additionally, even if you did run on all 41 devices, you’d still have saved parsing the code 41 additional times, because runtime compilation does not involve source code that needs to be parsed.</p>

<p>But there’s another important point: SYCL implementations effectively already do runtime compilation! If a SYCL implementation feeds PTX code to the CUDA driver, the CUDA driver will already compile this PTX code to machine code at runtime. The same is true for SPIR-V code. So, runtime compilation is no new behavior in SYCL, and something that SYCL applications already need to deal with today: It is quite likely that your first kernel launch will take longer due to drivers compiling the kernel on the fly. The additional step that we introduce roughly doubles that existing runtime compilation time. In other words, there’s additional overhead, but it does not change the fundamental order of magnitude of existing runtime compilation costs. If your SYCL application can tolerate current runtime compilation costs, likely it will be able to tolerate the additional step too.</p>

<h1 id="compile-time-improvements">Compile time improvements</h1>

<p>What did it bring? This is shown in this graph, where I’ve measured the time it takes to compile the <a href="https://github.com/uob-hpc/babelstream">BabelStream</a> benchmark with various compilation flows in hipSYCL:</p>

<p><img src="/assets/images/sscp_babelstream_compiletime.png" alt="Compile time improvements of the new generic SSCP compiler" /></p>

<p>The <em>host</em> case describes a regular clang compilation for CPU without specific SYCL compiler logic. This is our baseline. The <em>host,gfx900,…</em> cases correspond to compiling for 1, 2, and 3 AMD GPUs with the old multipass compiler based on the clang HIP toolchain. <em>nvc++</em> refers to the case where hipSYCL operates as a CUDA library for NVIDIA’s nvc++ compiler.</p>

<p>The <em>host,generic</em> bar shows the time when our new generic SSCP compiler is enabled. As can be seen, the new compiler takes only roughly 15% longer than the host compilation. But it is over twice as fast compared to compiling for the three AMD GPUs with the previous compiler. And remember that the resulting binary supports not only 3 GPUs, but 38 AMD GPUs, plus any NVIDIA GPU, plus any Intel GPU. You can imagine how long it would have taken to build a binary with equal portability with the older hipSYCL compiler, or any other SYCL implementation.</p>

<h1 id="performance">Performance</h1>

<p>How does performance look like with the new compiler? The boring answer is: It’s similar to the old one, typically within 10% performance in both directions. So you really get the same kernel performance, but with more portability of the resulting binary, and less compile times. And we have not even started optimizing for performance in particular, as the development focus until now was mainly on functionality.</p>

<h1 id="conclusion">Conclusion</h1>

<p>hipSYCL has a major new feature: A compiler that can generate ultra-portable binaries with less compile time than other approaches and without sacrificing performance. If you want to play with it, it is part of the main <a href="https://github.com/illuhad/hipSYCL">hipSYCL repository</a>. It can run some very complex applications, but be aware that a couple SYCL features are not yet implemented because they are still being worked on - in particular atomics, the SYCL 2020 group algorithm library and SYCL 2020 reductions.</p>]]></content><author><name>Aksel Alpay</name></author><category term="hipsycl" /><category term="sscp" /><category term="compiler" /><summary type="html"><![CDATA[Heterogeneous Utopia?]]></summary></entry><entry><title type="html">Benchmarking hipSYCL with HeCBench on AMD hardware</title><link href="https://adaptivecpp.github.io/hipsycl/amd/hecbench/hecbench-benchmarks/" rel="alternate" type="text/html" title="Benchmarking hipSYCL with HeCBench on AMD hardware" /><published>2022-07-20T21:00:00+02:00</published><updated>2022-07-20T21:00:00+02:00</updated><id>https://adaptivecpp.github.io/hipsycl/amd/hecbench/hecbench-benchmarks</id><content type="html" xml:base="https://adaptivecpp.github.io/hipsycl/amd/hecbench/hecbench-benchmarks/"><![CDATA[<h1 id="hecbench">HeCBench</h1>

<p><a href="https://github.com/zjin-lcf/hecbench">HeCBench</a> is a large benchmark collection that provides applications in various programming models, gathered from various sources. The fact that it does contain SYCL ports makes it interesting for the purpose of evaluating hipSYCL, as the performance of the SYCL version with hipSYCL can be compared to the native programming models. In this blog post, we’ll be looking into comparing the hipSYCL performance with native HIP performance on an AMD Radeon Pro VII.</p>

<h1 id="benchmark-selection">Benchmark selection</h1>

<p>HeCBench overall contains over 280 benchmarks, and hence evaluating all of them is very time-consuming. Some of them don’t run yet with hipSYCL, e.g. because they rely on DPC++-specific extensions or non-standard SYCL behavior (more details on these issues can be found in <a href="https://dl.acm.org/doi/10.1145/3529538.3530005">this paper</a>), but the majority works. So, to simplify the problem at hand, we select the first ~30 benchmarks in alphabetical order that work with hipSYCL. Additionally, we include four benchmarks that we already had data on from prior work: XSBench, RSBench, md5hash and nbody.</p>

<p>Following these criteria, we have selected the following applications:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>aligned-types
amgmk
aobench
asta
atomicCAS
atomicIntrinsics
atomicReduction
attention
babelstream
bezier-surface
binomial
bitonic-sort
bsearch
bspline-vgh
ccsd-trpdrv
clenergy
convolutionSeparable
crc64
damage
dp
dslash
expdist
extend2
extrema
fft
filter
floydwarshall
fpc
gamma-correction
XSBench
RSBench
md5hash
nbody
</code></pre></div></div>

<p>Some of these applications are more of functional tests rather than benchmarks (e.g. <code class="language-plaintext highlighter-rouge">aligned-types</code>), some are memory-bound (e.g. <code class="language-plaintext highlighter-rouge">babelstream</code>), and others are compute-bound (e.g. <code class="language-plaintext highlighter-rouge">fft</code>). So, we have a good mixture of different use cases at our hand, that are hopefully representative of common scenarios in the real world.</p>

<h1 id="results">Results</h1>

<p>The plot below shows the relative performance between the hipSYCL results and the native HIP results. Some applications return more than one result, in which case multiple results are shown for one application. This is prominently the case for BabelStream.
Where the application itself did not provide performance results (e.g. for some functional tests), the wall time of the application execution was measured. The vertical red lines indicate performance parity within 20%.</p>

<p>As can be seen, the vast majority of applications perform within 20% of the native HIP performance. Those applications that perform worse are almost exclusively applications that are not necessarily geared towards performance measurements such as aligned/unaligned copy microbenchmarks or functional tests.</p>

<p>On the other hand, there are also numerous cases where hipSYCL substantially outperforms HIP, such as <code class="language-plaintext highlighter-rouge">aobench</code> at almost twice the performance, and some CAS tests with an even higher relative performance. In fact, the CAS tests for an atomic maximum implementation even outperform HIP by over 20x, and are not shown in the plot in order to retain a reasonable axis range.</p>

<p><img src="/assets/images/hipsycl-relative-perf.png" alt="relative HeCBench performance between hipSYCL and HIP" /></p>

<h1 id="conclusion">Conclusion</h1>

<p>It is apparent that hipSYCL can reliably deliver good performance when looking at the HeCBench applications on the investigated AMD hardware. While there are (few) cases, where HIP outperforms hipSYCL, there are also cases where hipSYCL substantially outperforms HIP.</p>]]></content><author><name>Aksel Alpay</name></author><category term="hipsycl" /><category term="amd" /><category term="hecbench" /><summary type="html"><![CDATA[HeCBench]]></summary></entry><entry><title type="html">hipSYCL 0.9.2 - compiler-accelerated CPU backend, nvc++ support and more</title><link href="https://adaptivecpp.github.io/hipsycl/release/cpu/extension/nvc++/hipsycl-0.9.2/" rel="alternate" type="text/html" title="hipSYCL 0.9.2 - compiler-accelerated CPU backend, nvc++ support and more" /><published>2022-03-11T21:30:00+01:00</published><updated>2022-03-11T21:30:00+01:00</updated><id>https://adaptivecpp.github.io/hipsycl/release/cpu/extension/nvc++/hipsycl-0.9.2</id><content type="html" xml:base="https://adaptivecpp.github.io/hipsycl/release/cpu/extension/nvc++/hipsycl-0.9.2/"><![CDATA[<p>In february this year, hipSYCL 0.9.2 was <a href="https://github.com/illuhad/hipSYCL/releases/tag/v0.9.2">released</a>. This release includes major new features, some of which I want to discuss here.</p>

<h1 id="compiler-accelerated-cpu-support">Compiler-accelerated CPU support</h1>

<p>One of the major features of hipSYCL 0.9.2 is dedicated compiler support for the CPU backend. This can inrease performance by several orders of magnitudes in some cases, and deliver high performance on <em>any</em> CPU supported by LLVM. This is big news in the SYCL ecosystem, because until now, affected code could only be run efficiently on CPUs if an OpenCL implementation existed for the particular CPU.</p>

<p>Let me describe what this is about:</p>

<h2 id="previous-support-library-only-cpu-backend">Previous support: Library-only CPU backend</h2>

<p>hipSYCL’s CPU backend has traditionally been implemented as an OpenMP library. Consequently, it can be used with any OpenMP C++ compiler, which can be a portability advantage - it allows us to run SYCL code on <em>any</em> CPU for which an OpenMP compiler exists. Practically, this is everywhere.</p>

<p>The backend can provide good performance for kernels written in SYCL’s high-level <code class="language-plaintext highlighter-rouge">parallel_for(sycl::range&lt;Dimension&gt;, Kernel)</code> model. However, the lower-level <code class="language-plaintext highlighter-rouge">parallel_for(sycl::nd_range&lt;Dimension&gt;, Kernel)</code> model is quite different: While the high-level <code class="language-plaintext highlighter-rouge">parallel_for</code> does not allow for work group barriers to occur, the <code class="language-plaintext highlighter-rouge">nd_range</code> model allows users to have explicit barriers in their code:</p>

<figure class="highlight"><pre><code class="language-cpp" data-lang="cpp"><span class="cp">#include</span> <span class="cpf">&lt;sycl/sycl.hpp&gt;</span><span class="cp">
</span>
<span class="kt">int</span> <span class="nf">main</span><span class="p">()</span> <span class="p">{</span>
  <span class="n">sycl</span><span class="o">::</span><span class="n">queue</span> <span class="n">q</span><span class="p">;</span>
  
  <span class="n">std</span><span class="o">::</span><span class="kt">size_t</span> <span class="n">global_size</span> <span class="o">=</span> <span class="mi">1024</span><span class="p">;</span>
  <span class="n">std</span><span class="o">::</span><span class="kt">size_t</span> <span class="n">local_size</span> <span class="o">=</span> <span class="mi">128</span><span class="p">;</span>
  
  <span class="n">q</span><span class="p">.</span><span class="n">parallel_for</span><span class="p">(</span><span class="n">sycl</span><span class="o">::</span><span class="n">nd_range</span><span class="o">&lt;</span><span class="mi">1</span><span class="o">&gt;</span><span class="p">{</span><span class="n">global_size</span><span class="p">,</span> <span class="n">local_size</span><span class="p">},</span> 
    <span class="p">[</span><span class="o">=</span><span class="p">](</span><span class="k">auto</span> <span class="n">idx</span><span class="p">)</span> <span class="p">{</span>
    <span class="c1">// Code before barrier here</span>
    
    <span class="c1">// Waits until all items from the work group have</span>
    <span class="c1">// executed the previous code</span>
    <span class="n">sycl</span><span class="o">::</span><span class="n">group_barrier</span><span class="p">(</span><span class="n">idx</span><span class="p">.</span><span class="n">get_group</span><span class="p">());</span>
    
    <span class="c1">// Code after the barrier here - will be executed </span>
    <span class="c1">// once *all* items in the group</span>
    <span class="c1">// have finished the previous code.</span>
  <span class="p">});</span>

  <span class="n">q</span><span class="p">.</span><span class="n">wait</span><span class="p">();</span>
<span class="p">}</span></code></pre></figure>

<p>On CPU, for performance we generally want to employ multithreading across work groups, and then iterate across work items within a group. Ideally, the compiler can then (auto-)vectorize this inner loop across work items. This maps well to hierachical parellelism from SYCL 1.2.1, or hipSYCL’s <a href="https://github.com/illuhad/hipSYCL/blob/develop/doc/scoped-parallelism.md">scoped parallelism</a> programming model extension.</p>

<p>For <code class="language-plaintext highlighter-rouge">nd_range</code> barriers however, all code for all items in a group needs to finish before we can proceed. 
This is an issue for library implementations of SYCL, because it prevents us from implementing work items as iterations of an (auto-vectorized) loop. Instead, each work item must live within its own mechanism that provides concurrency, so that all work items can reach the barrier at the same time. hipSYCL uses fibers for this purpose. While fibers have much lower overhead compared to actual full-blown threads, there are still issues with this approach:</p>

<ul>
  <li>A barrier requires context-switching through all fibers to make sure they have reached the barrier. A fiber context switch is effectively a switch to a different stack. The relative cost of this operation is much higher compared to a barrier on e.g. a GPU, so typical GPU fine-grained parallelism patterns will not run efficiently on CPUs with this model. This is a performance portability issue.</li>
  <li>Additionally, code cannot be vectorized across multiple fibers since each fiber runs independently. Therefore, there is no vectorization across work items. Code that wants to benefit from vectorization has to explicitly employ inner loops for each work item that can be vectorized, e.g. by using the <code class="language-plaintext highlighter-rouge">sycl::vec</code> class. This is another performance portability issue, since this is not how typical GPU code is written.</li>
</ul>

<h2 id="alternative-sycl-implementations-cpu-support-via-opencl">Alternative SYCL implementations: CPU support via OpenCL</h2>

<p>So how do other SYCL implementations solve this issue? It is clear that if we can transform the kernel to something like this, the problem is solved:</p>

<figure class="highlight"><pre><code class="language-cpp" data-lang="cpp"><span class="cp">#include</span> <span class="cpf">&lt;sycl/sycl.hpp&gt;</span><span class="cp">
</span>
<span class="kt">int</span> <span class="nf">main</span><span class="p">()</span> <span class="p">{</span>
  <span class="n">sycl</span><span class="o">::</span><span class="n">queue</span> <span class="n">q</span><span class="p">;</span>
  
  <span class="n">std</span><span class="o">::</span><span class="kt">size_t</span> <span class="n">global_size</span> <span class="o">=</span> <span class="mi">1024</span><span class="p">;</span>
  <span class="n">std</span><span class="o">::</span><span class="kt">size_t</span> <span class="n">group_size</span> <span class="o">=</span> <span class="mi">128</span><span class="p">;</span>
  
  <span class="c1">// Parallelize across work groups</span>
  <span class="n">q</span><span class="p">.</span><span class="n">parallel_for</span><span class="p">(</span><span class="n">sycl</span><span class="o">::</span><span class="n">range</span><span class="o">&lt;</span><span class="mi">1</span><span class="o">&gt;</span><span class="p">{</span><span class="n">global_size</span><span class="o">/</span><span class="n">group_size</span><span class="p">},</span> 
    <span class="p">[</span><span class="o">=</span><span class="p">](</span><span class="k">auto</span> <span class="n">group_idx</span><span class="p">)</span> <span class="p">{</span>
    <span class="c1">// Compiler should vectorize this loop</span>
    <span class="k">for</span><span class="p">(</span><span class="kt">int</span> <span class="n">work_item</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="n">group_size</span><span class="p">;</span> <span class="o">++</span><span class="n">work_item</span><span class="p">){</span>
      <span class="c1">// Code before barrier here</span>
    <span class="p">}</span>
    <span class="c1">// The barrier is now a no-op since the loop already guarantees</span>
    <span class="c1">// barrier semantics</span>
    <span class="c1">// sycl::group_barrier(idx.get_group());</span>
    
    <span class="c1">// Compiler should vectorize this loop</span>
    <span class="k">for</span><span class="p">(</span><span class="kt">int</span> <span class="n">work_item</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="n">group_size</span><span class="p">;</span> <span class="o">++</span><span class="n">work_item</span><span class="p">){</span>
      <span class="c1">// Code after barrier here</span>
    <span class="p">}</span>
  <span class="p">});</span>

  <span class="n">q</span><span class="p">.</span><span class="n">wait</span><span class="p">();</span>
<span class="p">}</span></code></pre></figure>

<p><em>Remark: This effectively means automatically transforming nd_range kernels to patterns that resembles SYCL hierarchical parallelism or hipSYCL scoped parallelism models</em></p>

<p>And this transformation is basically what existing CPU OpenCL implementations do when compiling code. Since other SYCL implementations such as DPC++ oder ComputeCpp mostly rely on OpenCL to deliver performance on CPUs, these SYCL implementations have effectively offloaded the issue to the OpenCL implementation.</p>

<p>However, there is one problem: There are very few CPU vendors that actually provide an OpenCL implementation for their hardware. So, unless we are only interested in running on Intel CPUs, we can have a portability issue at our hand.
Additionally, what if we don’t want to use OpenCL as SYCL runtime backend, but OpenMP, or TBB for CPUs? Wouldn’t it make sense to pull the required compiler transformations from the OpenCL layer into the layer of the SYCL compiler?</p>

<h2 id="our-solution-combining-the-advantages-of-both">Our solution: Combining the advantages of both</h2>

<p>This is exactly what we have done in hipSYCL. We have integrated these compiler transformations into the hipSYCL infrastructure. If this feature is enabled, it will apply those transformations to the regular host compilation pass - which currently uses OpenMP, but could just as well work with other runtimes such as TBB.</p>

<p>The consequence: We can support efficient <code class="language-plaintext highlighter-rouge">nd_range</code> parallel for on any CPU supported by LLVM. No need for an OpenCL implementation anymore, as the transformations run as part of clang’s regular compilation for the host CPU.</p>

<p>If you want to use this feature, you can just pass <code class="language-plaintext highlighter-rouge">omp.accelerated</code> as target to the <code class="language-plaintext highlighter-rouge">--hipsycl-targets</code> argument. Details on using it can be found <a href="https://github.com/illuhad/hipSYCL/blob/develop/doc/using-hipsycl.md">here</a>.</p>

<p>More technical details on how it works <em>exactly</em> can be found <a href="https://github.com/illuhad/hipSYCL/blob/develop/doc/compilation.md#compiler-support-to-accelerate-nd_range-parallel_for-on-cpus-ompaccelerated">here</a>.</p>

<p>And <a href="https://github.com/illuhad/hipSYCL/pull/682">benchmark results</a> can be found in the original pull request.</p>

<h2 id="immediate-support-for-new-nvidia-hardware-nvc-backend">Immediate support for new NVIDIA hardware: NVC++ backend</h2>

<p>Another large, new feature in hipSYCL 0.9.2 is nvc++ support. We have added <code class="language-plaintext highlighter-rouge">cuda-nvcxx</code> as an option that can be passed to <code class="language-plaintext highlighter-rouge">--hipsycl-targets</code>. In this case, the nvc++ compilation flow is activated, in which hipSYCL acts as a regular CUDA library for nvc++ - without any compiler magic.</p>

<p>Since nvc++ is part of NVIDIA’s HPC SDK, and hence an officially supported compiler from NVIDIA, this means that with hipSYCL’s nvc++ backend, it is possible to use hipSYCL on NVIDIA GPUs with the very latest CUDA versions, or latest hardware from day one after release.</p>

<p>Currently, all SYCL implementations with CUDA backends (including hipSYCL) rely on clang, which may not always support the latest CUDA versions immediately, or just assumes that they behave similarly as older versions. With hipSYCL’s nvc++ backend, the SYCL ecosystem becomes independent of the CUDA support level in clang.</p>

<p>Additionally, the nvc++ backend does not require LLVM at all. Therefore, if only the nvc++ backend is required, hipSYCL can now be deployed <a href="https://github.com/illuhad/hipSYCL/blob/develop/doc/install-cuda.md#if-using-nvc">without LLVM dependency</a>. This can significantly simplify deployment e.g. on existing NVIDIA HPC systems, where nvc++ and the NVIDIA HPC SDK might already be preinstalled. Just point hipSYCL to nvc++ and you are good to go.</p>

<p>nvc++ works slightly differently on a technical level compared to clang-based compilation flows. clang parses source code multiple times (for host and all targeted devices). Macros can then be used to detect which compilation pass currently takes place, and code paths can be specialized accordingly.
nvc++ on the other hand parses the code only once. It is therefore not possible in nvc++ to use macros to detect e.g. whether host or device is currently targeted.
<em>Note: This behavior does not violate the SYCL specification, which defines both the Single-source multiple compiler pass (SMCP) and single-source single compiler pass (SSCP) models. SMCP is what clang does, while nvc++ follows SSCP.</em></p>

<p>Consequently, the recommended way to detect the targeted backend in source code is no longer using macros such as <code class="language-plaintext highlighter-rouge">__SYCL_DEVICE_ONLY__</code>. Instead, we have introduced the <code class="language-plaintext highlighter-rouge">__hipsycl_if_target</code> mechanism which generalizes both to the clang as well as nvc++ cases. See <a href="https://github.com/illuhad/hipSYCL/blob/develop/doc/macros.md#macros-to-specialize-code-paths-based-on-backend">here</a> for details.</p>

<h2 id="scoped-parallelism-v2">Scoped parallelism v2</h2>

<p>Scoped parallelism is a hipSYCL-specific programming model that is designed to expose all the low-level control that the <code class="language-plaintext highlighter-rouge">nd_range</code> parallel for model provides, while additionally remaining more performance portable. This affects in particular library-only compilation flows, such as hipSYCL’s OpenMP backend when the new <code class="language-plaintext highlighter-rouge">omp.accelerated</code> flow is not used.</p>

<p>hipSYCL has already had the scoped parallelism programming model in earlier versions. hipSYCL 0.9.2 cranks it up to the next level and significantly improves and extends the model (<a href="https://github.com/illuhad/hipSYCL/blob/develop/doc/scoped-parallelism.md">documentation</a>).
For example, it now allows the implementation to expose structure below sub-group granularity by allowing infinite nesting of groups - even in multiple dimensions.</p>

<p>Here is an example:</p>

<figure class="highlight"><pre><code class="language-cpp" data-lang="cpp"><span class="n">sycl</span><span class="o">::</span><span class="n">queue</span> <span class="n">q</span><span class="p">;</span>

<span class="n">q</span><span class="p">.</span><span class="n">parallel</span><span class="p">(</span>
  <span class="n">sycl</span><span class="o">::</span><span class="n">range</span><span class="p">{</span><span class="n">global_range</span> <span class="o">/</span> <span class="n">local_range</span><span class="p">},</span> <span class="n">sycl</span><span class="o">::</span><span class="n">range</span><span class="p">{</span><span class="n">local_range</span><span class="p">},</span>
  <span class="p">[</span><span class="o">=</span><span class="p">](</span><span class="k">auto</span> <span class="n">group</span><span class="p">)</span> <span class="p">{</span>
    <span class="c1">// Optionally, groups can be decomposed into subunits</span>
    <span class="n">sycl</span><span class="o">::</span><span class="n">distribute_groups</span><span class="p">(</span><span class="n">group</span><span class="p">,</span> <span class="p">[</span><span class="o">&amp;</span><span class="p">](</span><span class="k">auto</span> <span class="n">subgroup</span><span class="p">)</span> <span class="p">{</span>
      <span class="c1">// This can be nested arbitrarily deep</span>
      <span class="n">sycl</span><span class="o">::</span><span class="n">distribute_groups</span><span class="p">(</span><span class="n">subgroup</span><span class="p">,</span> <span class="p">[</span><span class="o">&amp;</span><span class="p">](</span><span class="k">auto</span> <span class="n">subsubgroup</span><span class="p">)</span> <span class="p">{</span>
        <span class="n">sycl</span><span class="o">::</span><span class="n">distribute_items</span><span class="p">(</span><span class="n">subsubgroup</span><span class="p">,</span> <span class="p">[</span><span class="o">&amp;</span><span class="p">](</span><span class="n">s</span><span class="o">::</span><span class="n">s_item</span><span class="o">&lt;</span><span class="n">Dim</span><span class="o">&gt;</span> <span class="n">idx</span><span class="p">)</span> <span class="p">{</span>
          <span class="c1">// execute code for each work item</span>
        <span class="p">});</span>
        <span class="c1">// Explicit group barriers and group algorithms are allowed</span>
        <span class="n">sycl</span><span class="o">::</span><span class="n">group_barrier</span><span class="p">(</span><span class="n">subgroup</span><span class="p">);</span>
      <span class="p">});</span>
    <span class="p">});</span>
  <span class="p">});</span></code></pre></figure>

<p>Details and more examples can be found in the <a href="https://github.com/illuhad/hipSYCL/blob/develop/doc/scoped-parallelism.md">documentation</a>.</p>

<h2 id="but-wait-theres-more">But wait, there’s more!</h2>

<p>hipSYCL 0.9.2 contains more new features, such as</p>
<ul>
  <li><code class="language-plaintext highlighter-rouge">atomic_ref</code></li>
  <li>better explicit multipass support</li>
  <li>New extensions such as asynchronous buffers</li>
  <li>More :-)</li>
</ul>

<p>The release can be found <a href="https://github.com/illuhad/hipSYCL/releases/tag/v0.9.2">here</a>. Of course, you can also always just clone the latest develop branch for even more new features and fixes!</p>]]></content><author><name>Aksel Alpay</name></author><category term="hipsycl" /><category term="release" /><category term="cpu" /><category term="extension" /><category term="nvc++" /><summary type="html"><![CDATA[In february this year, hipSYCL 0.9.2 was released. This release includes major new features, some of which I want to discuss here.]]></summary></entry><entry><title type="html">hipSYCL 0.9.1 features: buffer-USM interoperability</title><link href="https://adaptivecpp.github.io/hipsycl/extension/hipsycl-091-buffer-usm-interop/" rel="alternate" type="text/html" title="hipSYCL 0.9.1 features: buffer-USM interoperability" /><published>2021-05-26T19:30:00+02:00</published><updated>2021-05-26T19:30:00+02:00</updated><id>https://adaptivecpp.github.io/hipsycl/extension/hipsycl-091-buffer-usm-interop</id><content type="html" xml:base="https://adaptivecpp.github.io/hipsycl/extension/hipsycl-091-buffer-usm-interop/"><![CDATA[<p>This post is part of a series where we discuss some features of hipSYCL 0.9.1. Today’s topic is interoperability between buffers and USM pointers.</p>

<h1 id="why-it-matters">Why it matters</h1>

<p>SYCL 2020 features two major memory management models, both of which are supported by hipSYCL:</p>
<ol>
  <li>The traditional buffer-accessor model that has already been available in the old SYCL 1.2.1. In this model, a task graph is constructed automatically based on access conflicts between access specifications described by <code class="language-plaintext highlighter-rouge">accessor</code> objects. These <code class="language-plaintext highlighter-rouge">accessor</code> objects are also used to access data in kernels. The buffer-accessor model provides the SYCL runtime with a lot of information about how much data is used and how it is used. This can help scheduling, and enables automatic optimizations such as overlap of data transfers and kernels.</li>
  <li>The pointer-based USM model that was introduced in SYCL 2020. Here, allocations are managed explicitly and (unless shared allocations are used) data must be copied explicitly between host and device. The USM model provides more control to the user compared to the buffer-USM model, at the cost of requiring the user to do work that the runtime can do automatically in the buffer-accessor model. It also forces the programmer to think in a model of a host-device dichotomy, which may not be an ideal fit when CPUs are targeted. On the other hand, it is usually considerably easier to port existing pointer-based code to SYCL using the USM model compared to the buffer-accessor model.</li>
</ol>

<p>It is apparent that both models have valid use cases and are complementary. However, in SYCL 2020, there is barely any interoperability between the two. Accessing data that is stored in a buffer using a USM pointer requires launching a custom kernel that explicitly copies all data elements from the buffer into a USM pointer. This is both cumbersome and comes at a performance cost.</p>

<p>Consequently, once a codebase has started using one particular model, it is effectively locked into it. This is problematic for several reasons:</p>

<ol>
  <li>As the SYCL software ecosystem grows, there is a <strong>real danger of ecosystem bifurcation</strong> if no mechanisms are provided to cross from USM-land to buffer-land and vice versa. A SYCL library with a USM pointer API will be of little use for a SYCL application that is written using buffers and accessors.</li>
  <li>SYCL is all about taking control when you want it, and letting SYCL do what it thinks is best otherwise. This allows to combine the best of two worlds: Low-level kernel optimizations for critical code paths, and the convenience of a high-level programming model for the remaining program. Consequently, <strong>it should be possible to use USM pointers whenever we want detailed low-level control, and move to a more high-level model for other parts of the program</strong>. Not having interoperability between them <strong>can block potential incremental optimization paths during software development</strong>.</li>
  <li>Which model will be better in terms of performance or clarity is not always apparent, and might be different for different parts of the program. As outlined above, both have strengths and weaknesses, and are complementary. <strong>We should therefore be able to mix buffers and USM pointers.</strong></li>
</ol>

<h1 id="buffer-usm-interoperability">buffer-USM interoperability</h1>

<p>To address these issues, hipSYCL 0.9.1 has introduced a comprehensive API for interoperability between USM pointers and buffers. In hipSYCL, you can always construct a buffer on top of existing USM pointers, or extract a USM pointer from a buffer – completely without additional data copies.</p>

<p>hipSYCL is the first SYCL implementation to expose such a feature, and the reason is found easily: Buffer-USM interoperability in a meaningful, convenient and efficient way requires guarantees about the internal buffer behavior and SYCL implementation design that far exceed anything the SYCL specification guarantees.</p>

<p>We have therefore introduced an additional <a href="https://github.com/illuhad/hipSYCL/blob/develop/doc/runtime-spec.md">hipSYCL runtime specification</a> that more rigorously defines buffer behavior. In particular hipSYCL makes the following guarantees that are crucial for buffer-USM interoperability:</p>
<ul>
  <li>Buffers use USM pointers internally. All allocations a buffer performs are USM allocations, and buffers are entirely implemented on top of USM pointers.</li>
  <li>Allocations are persistent. Buffers guarantee that allocations, once they have been made, will remain valid at least until the end of buffer lifetime. Buffers will manage exactly one allocation per (physical) device.</li>
  <li>Buffers allocate lazily. When the buffer is used for the first time on a particular device, it will allocate memory large enough for all of the data such that no reallocations are needed for the lifetime of the buffer.</li>
</ul>

<p>There are two cases to distinguish for buffer-USM interoperability:</p>
<ol>
  <li>Temporal composition: Here we just move memory allocations from USM pointers into a buffer or vice versa; at each point in time only either a USM pointer or a buffer exists for a given allocation.</li>
  <li>The more complex case: Simultaneously accessing the same allocation as USM pointer and buffer. This is more complicated as it requires some correctness considerations by the programmer.</li>
</ol>

<h2 id="temporal-composition">Temporal composition</h2>

<p>Let’s focus on the simple case first: Assume we only want to turn an existing buffer into a USM pointer (or vice versa), but don’t want to use them simultaneously. hipSYCL has a fairly intuitive API for that: <code class="language-plaintext highlighter-rouge">buffer::get_pointer()</code> to extract USM pointers and a special buffer constructor that accepts USM pointers.</p>

<figure class="highlight"><pre><code class="language-cpp" data-lang="cpp"><span class="n">sycl</span><span class="o">::</span><span class="n">queue</span> <span class="n">q</span><span class="p">;</span>
<span class="n">std</span><span class="o">::</span><span class="kt">size_t</span> <span class="n">s</span> <span class="o">=</span> <span class="mi">1024</span><span class="p">;</span>
<span class="kt">int</span><span class="o">*</span> <span class="n">mem</span> <span class="o">=</span> <span class="n">sycl</span><span class="o">::</span><span class="n">malloc_device</span><span class="o">&lt;</span><span class="kt">int</span><span class="o">&gt;</span><span class="p">(</span><span class="n">s</span><span class="p">,</span> <span class="n">q</span><span class="p">);</span>

<span class="c1">// Use mem as USM pointer</span>
<span class="n">q</span><span class="p">.</span><span class="n">parallel_for</span><span class="p">(</span><span class="n">sycl</span><span class="o">::</span><span class="n">range</span><span class="p">{</span><span class="n">s</span><span class="p">},</span> 
    <span class="p">[</span><span class="o">=</span><span class="p">](</span><span class="n">sycl</span><span class="o">::</span><span class="n">id</span><span class="o">&lt;</span><span class="mi">1</span><span class="o">&gt;</span> <span class="n">idx</span><span class="p">){</span> <span class="n">mem</span><span class="p">[</span><span class="n">idx</span><span class="p">[</span><span class="mi">0</span><span class="p">]]</span> <span class="o">=</span> <span class="n">idx</span><span class="p">[</span><span class="mi">0</span><span class="p">];</span> <span class="p">});</span>
<span class="c1">// Make sure that USM operations terminate before</span>
<span class="c1">// using mem as buffer</span>
<span class="n">q</span><span class="p">.</span><span class="n">wait</span><span class="p">();</span>

<span class="c1">// Construct buffer on top of existing USM pointer</span>
<span class="p">{</span>
  <span class="n">sycl</span><span class="o">::</span><span class="n">device</span> <span class="n">dev</span> <span class="o">=</span> <span class="n">q</span><span class="p">.</span><span class="n">get_device</span><span class="p">();</span>
  <span class="c1">// Use mem for all operations for device dev. view() assumes</span>
  <span class="c1">// that the pointer holds valid data. If it should be considered empty,</span>
  <span class="c1">// use empty_view() instead.</span>
  <span class="c1">// Note the {} around the view: This is because we are actually passing</span>
  <span class="c1">// an std::vector. You can feed multiple USM pointers (one for each device)</span>
  <span class="c1">// into a buffer! Here, we only use one device.</span>
  <span class="n">sycl</span><span class="o">::</span><span class="n">buffer</span><span class="o">&lt;</span><span class="kt">int</span><span class="o">&gt;</span> <span class="n">buff</span><span class="p">{</span>
    <span class="p">{</span><span class="n">sycl</span><span class="o">::</span><span class="n">buffer_allocation</span><span class="o">::</span><span class="n">view</span><span class="p">(</span><span class="n">mem</span><span class="p">,</span> <span class="n">dev</span><span class="p">)},</span> <span class="n">sycl</span><span class="o">::</span><span class="n">range</span><span class="p">{</span><span class="n">s</span><span class="p">}};</span>
  
  <span class="n">q</span><span class="p">.</span><span class="n">submit</span><span class="p">([</span><span class="o">&amp;</span><span class="p">](</span><span class="n">sycl</span><span class="o">::</span><span class="n">handler</span><span class="o">&amp;</span> <span class="n">cgh</span><span class="p">){</span>
    <span class="n">sycl</span><span class="o">::</span><span class="n">accessor</span> <span class="n">acc</span><span class="p">{</span><span class="n">buff</span><span class="p">,</span> <span class="n">cgh</span><span class="p">};</span>
    <span class="n">cgh</span><span class="p">.</span><span class="n">parallel_for</span><span class="p">(</span><span class="n">sycl</span><span class="o">::</span><span class="n">range</span><span class="p">{</span><span class="n">s</span><span class="p">},</span> <span class="p">[</span><span class="o">=</span><span class="p">](</span><span class="n">sycl</span><span class="o">::</span><span class="n">id</span><span class="o">&lt;</span><span class="mi">1</span><span class="o">&gt;</span> <span class="n">idx</span><span class="p">){</span>
      <span class="n">acc</span><span class="p">[</span><span class="n">idx</span><span class="p">]</span> <span class="o">+=</span> <span class="mi">1</span><span class="p">;</span>
    <span class="p">});</span>
  <span class="p">});</span>
  
  <span class="c1">// Turn buffer into USM pointer again.</span>
  <span class="c1">// Note: get_pointer() returns nullptr if no allocation is available on a device,</span>
  <span class="c1">// e.g. if a buffer hasn't yet been used on a device (remember: lazy allocation!) </span>
  <span class="c1">// or was not initialized with an appropriate view() object.</span>
  <span class="c1">// In this example, we know that the buffer has an allocation for this</span>
  <span class="c1">// device because we have given one in the constructor.</span>
  <span class="kt">int</span><span class="o">*</span> <span class="n">mem_extracted</span> <span class="o">=</span> <span class="n">buff</span><span class="p">.</span><span class="n">get_pointer</span><span class="p">(</span><span class="n">dev</span><span class="p">);</span>
  <span class="n">assert</span><span class="p">(</span><span class="n">mem_extracted</span> <span class="o">==</span> <span class="n">mem</span><span class="p">);</span>
  
  <span class="c1">// This makes sure that the buffer won't delete the allocation when</span>
  <span class="c1">// it goes out of scope, so we can use it afterwards.</span>
  <span class="c1">// By default, view() is non-owning, so in this example it's</span>
  <span class="c1">// not strictly necessary.</span>
  <span class="n">buff</span><span class="p">.</span><span class="n">disown_allocation</span><span class="p">(</span><span class="n">dev</span><span class="p">);</span>
<span class="p">}</span> <span class="c1">// Closing scope synchronizes all tasks operating on the buffer.</span>

<span class="c1">// Use USM pointer again</span>
<span class="n">q</span><span class="p">.</span><span class="n">parallel_for</span><span class="p">(</span><span class="n">sycl</span><span class="o">::</span><span class="n">range</span><span class="p">{</span><span class="n">s</span><span class="p">},</span> <span class="p">...).</span><span class="n">wait</span><span class="p">();</span>

<span class="n">sycl</span><span class="o">::</span><span class="n">free</span><span class="p">(</span><span class="n">mem</span><span class="p">,</span> <span class="n">q</span><span class="p">);</span></code></pre></figure>

<h2 id="simultaneous-usm-pointers-and-buffers">Simultaneous USM pointers and buffers</h2>

<p>If we want to have both USM pointers and buffers accessing the same allocation simultaneously, things get more complicated. In this scenario, it is crucial to understand that</p>
<ol>
  <li>Buffers automatically calculate dependencies to other operations by detecting conflicting accessors. If operations use the same allocation but without going through accessors, buffers cannot know about these additional dependencies – the programmer must insert them manually.</li>
  <li>Buffers automatically calculate necessary data transfers by tracking whether data is valid or outdated on a particular device. If data is modified through USM pointers without the buffer knowing of it, the internal data tracking of the buffer is off and no longer reflects reality. This can cause the buffer to emit data transfers that shouldn’t take place, or omit data transfers when they might actually be required. To avoid this, we need to manually update the buffer’s data tracking.</li>
</ol>

<p>Here’s an example that shows how it’s done.</p>

<figure class="highlight"><pre><code class="language-cpp" data-lang="cpp"><span class="n">sycl</span><span class="o">::</span><span class="n">queue</span> <span class="n">q</span><span class="p">;</span>
<span class="c1">// Queue on a different device for later use</span>
<span class="n">sycl</span><span class="o">::</span><span class="n">device</span> <span class="n">other_dev</span> <span class="o">=</span> <span class="p">...;</span>
<span class="n">sycl</span><span class="o">::</span><span class="n">queue</span> <span class="n">q2</span><span class="p">{</span><span class="n">other_dev</span><span class="p">};</span>

<span class="n">std</span><span class="o">::</span><span class="kt">size_t</span> <span class="n">s</span> <span class="o">=</span> <span class="mi">1024</span><span class="p">;</span>
<span class="n">sycl</span><span class="o">::</span><span class="n">buffer</span><span class="o">&lt;</span><span class="kt">int</span><span class="o">&gt;</span> <span class="n">buff</span><span class="p">{</span><span class="n">sycl</span><span class="o">::</span><span class="n">range</span><span class="p">{</span><span class="n">s</span><span class="p">}};</span>

<span class="c1">// Extract USM pointer - at this point we are not yet guaranteed</span>
<span class="c1">// that an allocation exists because memory is allocated lazily.</span>
<span class="c1">// We can however force preallocation of memory using the hipSYCL </span>
<span class="c1">// handler::update extension (Not yet in hipSYCL 0.9.1, but in </span>
<span class="c1">// current develop branch on github).</span>
<span class="n">q</span><span class="p">.</span><span class="n">submit</span><span class="p">([</span><span class="o">&amp;</span><span class="p">](</span><span class="n">sycl</span><span class="o">::</span><span class="n">handler</span><span class="o">&amp;</span> <span class="n">cgh</span><span class="p">){</span>
  <span class="n">sycl</span><span class="o">::</span><span class="n">accessor</span> <span class="n">acc</span><span class="p">{</span><span class="n">buff</span><span class="p">,</span> <span class="n">cgh</span><span class="p">};</span>
  <span class="n">cgh</span><span class="p">.</span><span class="n">update</span><span class="p">(</span><span class="n">acc</span><span class="p">);</span>
<span class="p">});</span>
<span class="c1">// Also preallocate on another device for later use.</span>
<span class="n">q2</span><span class="p">.</span><span class="n">submit</span><span class="p">([</span><span class="o">&amp;</span><span class="p">](</span><span class="n">sycl</span><span class="o">::</span><span class="n">handler</span><span class="o">&amp;</span> <span class="n">cgh</span><span class="p">){</span>
  <span class="n">sycl</span><span class="o">::</span><span class="n">accessor</span> <span class="n">acc</span><span class="p">{</span><span class="n">buff</span><span class="p">,</span> <span class="n">cgh</span><span class="p">};</span>
  <span class="n">cgh</span><span class="p">.</span><span class="n">update</span><span class="p">(</span><span class="n">acc</span><span class="p">);</span>
<span class="p">});</span>
<span class="n">q</span><span class="p">.</span><span class="n">wait</span><span class="p">();</span> <span class="n">q2</span><span class="p">.</span><span class="n">wait</span><span class="p">();</span>

<span class="c1">// Since memory has now been allocated by the buffer, we can now extract</span>
<span class="c1">// an USM pointer.</span>
<span class="kt">int</span><span class="o">*</span> <span class="n">usm_ptr</span> <span class="o">=</span> <span class="n">buff</span><span class="p">.</span><span class="n">get_pointer</span><span class="p">(</span><span class="n">q</span><span class="p">.</span><span class="n">get_device</span><span class="p">());</span>

<span class="c1">// Submit a kernel operating on buff</span>
<span class="n">sycl</span><span class="o">::</span><span class="n">event</span> <span class="n">evt</span> <span class="o">=</span> <span class="n">q</span><span class="p">.</span><span class="n">submit</span><span class="p">([</span><span class="o">&amp;</span><span class="p">](</span><span class="n">sycl</span><span class="o">::</span><span class="n">handler</span><span class="o">&amp;</span> <span class="n">cgh</span><span class="p">){</span>
  <span class="n">sycl</span><span class="o">::</span><span class="n">accessor</span> <span class="n">acc</span><span class="p">{</span><span class="n">buff</span><span class="p">,</span> <span class="n">cgh</span><span class="p">};</span>
  <span class="n">cgh</span><span class="p">.</span><span class="n">parallel_for</span><span class="p">(</span><span class="n">sycl</span><span class="o">::</span><span class="n">range</span><span class="p">{</span><span class="n">s</span><span class="p">},</span> <span class="p">[</span><span class="o">=</span><span class="p">](</span><span class="n">sycl</span><span class="o">::</span><span class="n">id</span><span class="o">&lt;</span><span class="mi">1</span><span class="o">&gt;</span> <span class="n">idx</span><span class="p">){</span>
    <span class="c1">// Use acc here</span>
  <span class="p">});</span>
<span class="p">});</span>
<span class="c1">// Submit a USM kernel</span>
<span class="n">sycl</span><span class="o">::</span><span class="n">event</span> <span class="n">evt2</span> <span class="o">=</span> <span class="n">q</span><span class="p">.</span><span class="n">submit</span><span class="p">([</span><span class="o">&amp;</span><span class="p">](</span><span class="n">sycl</span><span class="o">::</span><span class="n">handler</span><span class="o">&amp;</span> <span class="n">cgh</span><span class="p">){</span>
  <span class="c1">// Important: Add dependency to the other kernel!</span>
  <span class="n">cgh</span><span class="p">.</span><span class="n">depends_on</span><span class="p">(</span><span class="n">evt</span><span class="p">);</span>
  <span class="n">cgh</span><span class="p">.</span><span class="n">parallel_for</span><span class="p">(</span><span class="n">sycl</span><span class="o">::</span><span class="n">range</span><span class="p">{</span><span class="n">s</span><span class="p">},</span> <span class="p">[</span><span class="o">=</span><span class="p">](</span><span class="n">sycl</span><span class="o">::</span><span class="n">id</span><span class="o">&lt;</span><span class="mi">1</span><span class="o">&gt;</span> <span class="n">idx</span><span class="p">){</span>
    <span class="c1">// Use usm_ptr here</span>
  <span class="p">});</span>
<span class="p">});</span></code></pre></figure>

<p>So far no surprises – we just had to insert dependencies manually as expected. Let’s now look at submitting work to a different device. When submitting USM operations to another device, we need to inform the buffer that there are writes taking place on that device, and that it should consider allocations on other devices as outdated after this point. We again use <code class="language-plaintext highlighter-rouge">handler::update()</code> for this.</p>

<figure class="highlight"><pre><code class="language-cpp" data-lang="cpp"><span class="c1">// This is necessary to allow the buffer to infer necessary data transfers correctly.</span>
<span class="n">sycl</span><span class="o">::</span><span class="n">event</span> <span class="n">evt3</span> <span class="o">=</span> <span class="n">q2</span><span class="p">.</span><span class="n">submit</span><span class="p">([</span><span class="o">&amp;</span><span class="p">](</span><span class="n">sycl</span><span class="o">::</span><span class="n">handler</span><span class="o">&amp;</span> <span class="n">cgh</span><span class="p">){</span>
  <span class="c1">// Depend on previous USM operation</span>
  <span class="n">cgh</span><span class="p">.</span><span class="n">depends_on</span><span class="p">(</span><span class="n">evt2</span><span class="p">);</span>
  <span class="c1">// This is a read-write accessor - it's important that there's</span>
  <span class="c1">// a write in the access mode if we want to write to usm_ptr</span>
  <span class="c1">// in the next kernel.</span>
  <span class="n">sycl</span><span class="o">::</span><span class="n">accessor</span> <span class="n">acc</span><span class="p">{</span><span class="n">buff</span><span class="p">,</span> <span class="n">cgh</span><span class="p">};</span>
  <span class="n">cgh</span><span class="p">.</span><span class="n">update</span><span class="p">(</span><span class="n">acc</span><span class="p">);</span>
<span class="p">})</span>
<span class="kt">int</span><span class="o">*</span> <span class="n">usm_ptr2</span> <span class="o">=</span> <span class="n">buff</span><span class="p">.</span><span class="n">get_pointer</span><span class="p">(</span><span class="n">q2</span><span class="p">.</span><span class="n">get_device</span><span class="p">());</span>
<span class="n">sycl</span><span class="o">::</span><span class="n">event</span> <span class="n">evt4</span> <span class="o">=</span> <span class="n">q2</span><span class="p">.</span><span class="n">submit</span><span class="p">([</span><span class="o">&amp;</span><span class="p">](</span><span class="n">sycl</span><span class="o">::</span><span class="n">handler</span><span class="o">&amp;</span> <span class="n">cgh</span><span class="p">){</span>
  <span class="n">cgh</span><span class="p">.</span><span class="n">depends_on</span><span class="p">(</span><span class="n">evt3</span><span class="p">);</span>
  <span class="n">cgh</span><span class="p">.</span><span class="n">parallel_for</span><span class="p">(</span><span class="n">sycl</span><span class="o">::</span><span class="n">range</span><span class="p">{</span><span class="n">s</span><span class="p">},</span> <span class="p">[</span><span class="o">=</span><span class="p">](</span><span class="n">sycl</span><span class="o">::</span><span class="n">id</span><span class="o">&lt;</span><span class="mi">1</span><span class="o">&gt;</span> <span class="n">idx</span><span class="p">){</span>
    <span class="c1">// Use usm_ptr2 here</span>
  <span class="p">});</span>
<span class="p">});</span>
<span class="c1">// End with operation on first device</span>
<span class="n">q</span><span class="p">.</span><span class="n">submit</span><span class="p">([</span><span class="o">&amp;</span><span class="p">](</span><span class="n">sycl</span><span class="o">::</span><span class="n">handler</span><span class="o">&amp;</span> <span class="n">cgh</span><span class="p">){</span>
  <span class="c1">// Buffer cannot know that USM kernel operates on same data,</span>
  <span class="c1">// so we need to manually insert a dependency.</span>
  <span class="n">cgh</span><span class="p">.</span><span class="n">depends_on</span><span class="p">(</span><span class="n">evt4</span><span class="p">);</span>
  <span class="c1">// This accessor will trigger data migration back to</span>
  <span class="c1">// the first device because we are submitting to q</span>
  <span class="c1">// instead of q2</span>
  <span class="n">sycl</span><span class="o">::</span><span class="n">accessor</span> <span class="n">acc</span><span class="p">{</span><span class="n">buff</span><span class="p">,</span> <span class="n">cgh</span><span class="p">};</span>
  <span class="n">cgh</span><span class="p">.</span><span class="n">parallel_for</span><span class="p">(</span><span class="n">sycl</span><span class="o">::</span><span class="n">range</span><span class="p">{</span><span class="n">s</span><span class="p">},</span> <span class="p">[</span><span class="o">=</span><span class="p">](</span><span class="n">sycl</span><span class="o">::</span><span class="n">id</span><span class="o">&lt;</span><span class="mi">1</span><span class="o">&gt;</span> <span class="n">idx</span><span class="p">){</span>
    <span class="c1">// Use acc here</span>
  <span class="p">});</span>
<span class="p">});</span></code></pre></figure>

<p>In summary, even using buffers and USM pointers simultaneously for the same data is possible, but requires a solid understanding of SYCL and the guarantees that hipSYCL makes specifically.</p>

<p>Remember that buffers cannot know about USM kernels that utilize the same allocations, so always, always make sure to insert correct dependencies. Also, make sure to inform the buffer that an allocation has been <em>modified</em> so that it can correctly emit data transfers when an accessor is used for the buffer on a different device (including the host device). This can be done by constructing a  accessor with a suitable access mode – either by using <code class="language-plaintext highlighter-rouge">handler::update()</code>, or by submitting a kernel that uses accessors.</p>

<p>In practice, this might be much simpler. If you are not working with complex task graphs, you could just use a SYCL 2020 in-order queue to avoid having to insert all those dependencies manually. And if you are only working on a single device, your <code class="language-plaintext highlighter-rouge">handler::update()</code> calls might not be required anymore.</p>

<h2 id="api-reference">API reference</h2>

<p>For the full API reference, see the <a href="https://github.com/illuhad/hipSYCL/blob/develop/doc/buffer-usm-interop.md">hipSYCL documentation</a>.</p>]]></content><author><name>Aksel Alpay</name></author><category term="hipsycl" /><category term="extension" /><summary type="html"><![CDATA[This post is part of a series where we discuss some features of hipSYCL 0.9.1. Today’s topic is interoperability between buffers and USM pointers.]]></summary></entry><entry><title type="html">hipSYCL 0.9.1 features: Asynchronous buffers and explicit buffer policies</title><link href="https://adaptivecpp.github.io/hipsycl/extension/hipsycl-091-buffer-policies/" rel="alternate" type="text/html" title="hipSYCL 0.9.1 features: Asynchronous buffers and explicit buffer policies" /><published>2021-04-11T20:28:06+02:00</published><updated>2021-04-11T20:28:06+02:00</updated><id>https://adaptivecpp.github.io/hipsycl/extension/hipsycl-091-buffer-policies</id><content type="html" xml:base="https://adaptivecpp.github.io/hipsycl/extension/hipsycl-091-buffer-policies/"><![CDATA[<p>This post is part of a series where we discuss some features of the brandnew hipSYCL 0.9.1. Today I want to take a closer look at</p>

<h1 id="asynchronous-buffers-and-explicit-buffer-policies">Asynchronous buffers and explicit buffer policies</h1>

<p>This is a new extension in hipSYCL that can make code using <code class="language-plaintext highlighter-rouge">sycl::buffer</code> objects <strong>much clearer while also improving performance</strong>. Interested? Then this blog post is for you.</p>

<h2 id="motivation-1-buffers-are-complicated">Motivation 1: Buffers are complicated</h2>

<p>A <code class="language-plaintext highlighter-rouge">sycl::buffer</code> is a very complicated object. Depending on a combination of multiple factors the semantics of a <code class="language-plaintext highlighter-rouge">sycl::buffer</code> can be very different. Will it operate directly on input pointers or will it copy input data to some internal storage? Will it submit a writeback in the destructor to copy data back to host?</p>

<p>I have frequently noticed users getting this wrong. This can either lead to correctness issues, for example</p>
<ul>
  <li>the buffer operates directly on the input pointer, while the user has only intended to provide it as a source of initial data and wanted to reuse it after buffer construction</li>
  <li>no writeback is issued even though the user expected data to be copied back to host.</li>
</ul>

<p>Or performance bugs might be introduced - these are arguably even worse because you might not notice them right away and they might be difficult to find. Some performance bugs that I have seen in user code are:</p>
<ul>
  <li>The buffer issued an unexpected writeback, and thus copied data back to host without the user intending it</li>
  <li>The buffer did not operate directly on the pointer provided in the constructor, but instead first copied the data to internal storage which broke performance assumptions on the CPU backend.</li>
</ul>

<h2 id="motivation-2-the-buffer-destructor-antipattern">Motivation 2: The buffer destructor antipattern</h2>

<p>In addition, there is a related performance antipattern that I have noticed frequently. Consider the following code:</p>

<figure class="highlight"><pre><code class="language-cpp" data-lang="cpp"><span class="n">T</span><span class="o">*</span> <span class="n">ptr1</span> <span class="o">=</span> <span class="p">...;</span>
<span class="n">T</span><span class="o">*</span> <span class="n">ptr2</span> <span class="o">=</span> <span class="p">...;</span>
<span class="n">T</span><span class="o">*</span> <span class="n">ptr3</span> <span class="o">=</span> <span class="p">...;</span>
<span class="n">sycl</span><span class="o">::</span><span class="n">range</span><span class="o">&lt;</span><span class="mi">1</span><span class="o">&gt;</span> <span class="n">size</span> <span class="o">=</span> <span class="p">...;</span>

<span class="p">{</span>
  <span class="n">sycl</span><span class="o">::</span><span class="n">buffer</span><span class="o">&lt;</span><span class="n">T</span><span class="o">&gt;</span> <span class="n">b1</span><span class="p">{</span><span class="n">ptr1</span><span class="p">,</span> <span class="n">size</span><span class="p">};</span>
  <span class="n">sycl</span><span class="o">::</span><span class="n">buffer</span><span class="o">&lt;</span><span class="n">T</span><span class="o">&gt;</span> <span class="n">b2</span><span class="p">{</span><span class="n">ptr2</span><span class="p">,</span> <span class="n">size</span><span class="p">};</span>
  <span class="n">sycl</span><span class="o">::</span><span class="n">buffer</span><span class="o">&lt;</span><span class="n">T</span><span class="o">&gt;</span> <span class="n">b3</span><span class="p">{</span><span class="n">ptr3</span><span class="p">,</span> <span class="n">size</span><span class="p">};</span>

  <span class="c1">// Kernels using b1, b2, b3</span>

<span class="p">}</span> <span class="c1">// Destructors issue write-back</span></code></pre></figure>

<p>We construct three buffers that get an input pointer and then, when the scope closes, issue a writeback in their destructors. The problem here is that the execution of writebacks is not really efficient. The SYCL specification requires that in the destructor, a <code class="language-plaintext highlighter-rouge">buffer</code> has to wait for the completion of all operations that use it. This means that the following sequence of operations will be executed:</p>

<ol>
  <li><code class="language-plaintext highlighter-rouge">b3.~buffer()</code> runs: submit writeback, wait for completion</li>
  <li><code class="language-plaintext highlighter-rouge">b2.~buffer()</code> runs: submit writeback, wait for completion</li>
  <li><code class="language-plaintext highlighter-rouge">b1.~buffer()</code> runs: submit writeback, wait for completion</li>
</ol>

<p>Here we have multiple unnecessary cases of synchronization. For performance it is always better to submit all available work asynchronously, and then wait as late as possible with as few wait calls as possible. So, something like the following will in general perform better:</p>

<ol>
  <li><code class="language-plaintext highlighter-rouge">b3.~buffer()</code> runs: submit writeback asynchronously</li>
  <li><code class="language-plaintext highlighter-rouge">b2.~buffer()</code> runs: submit writeback asynchronously</li>
  <li><code class="language-plaintext highlighter-rouge">b1.~buffer()</code> runs: submit writeback asynchronously</li>
  <li>Maybe do some other work while the writebacks are being processed</li>
  <li>Wait for all writebacks to complete</li>
</ol>

<p>This has multiple advantages:</p>
<ol>
  <li>The SYCL implementation can process a larger task graph consisting of multiple writebacks as well as any other operations that might have been submitted previously, allowing for more optimization opportunities</li>
  <li>There is less latency between the writebacks when they are processed by the SYCL backend and hardware, because there is no synchronization in between them.</li>
  <li>The execution of writeback can be overlapped with other work on the host if the wait is executed later.</li>
</ol>

<p><em>Note:</em> While the worst case is clearly when the buffers submit writebacks as in this example, even if the buffers do not submit a writeback, there might still be a negative performance impact: Because all buffer destructors need to wait individually, they cause individual and potentially unnecessary flushes of the SYCL task graph.</p>

<h2 id="enter-explicit-buffer-policies">Enter explicit buffer policies</h2>

<p>To address both the destructor antipattern as well as the complexity of buffers, hipSYCL 0.9.1 introduces <em>explicit buffer policies</em>, which allow the user to explicitly specify the desired behavior of a buffer. We introduce the following terminology:</p>

<table>
  <thead>
    <tr>
      <th> </th>
      <th>Destructor blocks?</th>
      <th>Writes back ?</th>
      <th>Uses external storage?</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>yes</td>
      <td><code class="language-plaintext highlighter-rouge">sync_</code></td>
      <td><code class="language-plaintext highlighter-rouge">_writeback_</code></td>
      <td><code class="language-plaintext highlighter-rouge">view</code></td>
    </tr>
    <tr>
      <td>no</td>
      <td><code class="language-plaintext highlighter-rouge">async_</code></td>
      <td>-</td>
      <td><code class="language-plaintext highlighter-rouge">buffer</code></td>
    </tr>
  </tbody>
</table>

<p>For example, a <code class="language-plaintext highlighter-rouge">sync_writeback_view</code> refers to the behavior where the destructor blocks (<code class="language-plaintext highlighter-rouge">sync</code>), a writeback will be issued in the destructor (<code class="language-plaintext highlighter-rouge">writeback</code>)  and the buffer will operate directly on provided input data pointers (<code class="language-plaintext highlighter-rouge">view</code>).</p>

<p>These behaviors are not expressed as new C++ types, but as regular <code class="language-plaintext highlighter-rouge">sycl::buffer</code> objects that are initialized with special buffer properties. buffers with explicit behaviors are constructed using factory functions such as  <code class="language-plaintext highlighter-rouge">buffer&lt;T, Dim&gt; make_sync_buffer(...)</code>.
Since these functions still return a <code class="language-plaintext highlighter-rouge">sycl::buffer&lt;T, Dim&gt;</code>, explicit buffer behaviors integrate well with existing SYCL code that relies on the <code class="language-plaintext highlighter-rouge">sycl::buffer</code> type.</p>

<p>Using those factory functions instead of directly constructing <code class="language-plaintext highlighter-rouge">sycl::buffer</code> objects significantly improves code clarity - the programmer can now see with one quick glance at the function call what is going to happen, and what performance implications there are.</p>

<h3 id="view">View</h3>

<p>Buffers of <code class="language-plaintext highlighter-rouge">view</code> behavior operate directly on the provided input pointer when running on the CPU backend. The pointer must be considered as being in use by the buffer until all operations that the buffer is involved in have completed, including potential writebacks.</p>

<h3 id="buffer">Buffer</h3>

<p>Buffers of <code class="language-plaintext highlighter-rouge">buffer</code> behavior will not operate directly on optionally provided input pointers. If an input data pointer is provided, the data content will be copied to internal storage. The pointer is safe to use (or delete) as desired by the user after the buffer constructor returns.</p>

<h3 id="writeback">Writeback</h3>

<p>Buffers of <code class="language-plaintext highlighter-rouge">writeback</code> behavior will submit a writeback operation to migrate data back to host in the destructor. This will only lead to an actual data copy if the data on the host is outdated. With hipSYCL explicit buffer behaviors, a writeback needs to be explicitly requested by invoking a buffer factory function with <code class="language-plaintext highlighter-rouge">writeback</code> in its name. This prevents users accidentally introducing performance bugs by means of unnecessary writebacks.</p>

<h3 id="syncasync">sync/async</h3>

<p>Only buffers with <code class="language-plaintext highlighter-rouge">sync</code> behavior block in their destructor. Buffers of <code class="language-plaintext highlighter-rouge">async</code> behavior do not - and therefore can be used to solve the buffer destructor performance antipattern:</p>

<figure class="highlight"><pre><code class="language-cpp" data-lang="cpp"><span class="n">sycl</span><span class="o">::</span><span class="n">queue</span> <span class="n">q</span><span class="p">;</span>
<span class="p">{</span>
  <span class="k">auto</span> <span class="n">b1</span> <span class="o">=</span> <span class="n">sycl</span><span class="o">::</span><span class="n">make_async_writeback_view</span><span class="p">(</span><span class="n">ptr1</span><span class="p">,</span> <span class="n">size</span><span class="p">,</span> <span class="n">q</span><span class="p">);</span>
  <span class="k">auto</span> <span class="n">b2</span> <span class="o">=</span> <span class="n">sycl</span><span class="o">::</span><span class="n">make_async_writeback_view</span><span class="p">(</span><span class="n">ptr2</span><span class="p">,</span> <span class="n">size</span><span class="p">,</span> <span class="n">q</span><span class="p">);</span>
  <span class="k">auto</span> <span class="n">b3</span> <span class="o">=</span> <span class="n">sycl</span><span class="o">::</span><span class="n">make_async_writeback_view</span><span class="p">(</span><span class="n">ptr3</span><span class="p">,</span> <span class="n">size</span><span class="p">,</span> <span class="n">q</span><span class="p">);</span>
  
  <span class="c1">// Submit kernels operating on b1,b2,b3 here</span>
<span class="p">}</span> <span class="c1">// Non-blocking buffer destructors</span>

<span class="c1">// At some later point, use q.wait() to wait</span>
<span class="c1">// for all writebacks</span>
<span class="n">q</span><span class="p">.</span><span class="n">wait</span><span class="p">();</span></code></pre></figure>

<p>Here async writeback views are used that do not block in their destructor. hipSYCL guarantees that memory allocated by buffer objects will not be freed if there are still operations in flight utilizing those allocations, so kernels and other operations using the buffer objects will complete successfully even if the user-facing buffer object has already been destroyed.</p>

<p><strong>For performance it should be considered best practice to use the async behaviors by default and only use the sync variants when it is absolutely necessary.</strong></p>

<h2 id="api-reference">API reference</h2>

<p>Not every combination of buffer behaviors makes sense. hipSYCL currently supports the following factory functions:</p>

<figure class="highlight"><pre><code class="language-cpp" data-lang="cpp"><span class="c1">/// Only uses internal storage, </span>
<span class="c1">/// no writeback, </span>
<span class="c1">/// blocking destructor</span>
<span class="k">template</span> <span class="o">&lt;</span><span class="k">class</span> <span class="nc">T</span><span class="p">,</span> <span class="kt">int</span> <span class="n">Dim</span><span class="p">&gt;</span>
<span class="n">buffer</span><span class="o">&lt;</span><span class="n">T</span><span class="p">,</span> <span class="n">Dim</span><span class="o">&gt;</span> <span class="n">make_sync_buffer</span><span class="p">(</span>
    <span class="n">sycl</span><span class="o">::</span><span class="n">range</span><span class="o">&lt;</span><span class="n">Dim</span><span class="o">&gt;</span> <span class="n">r</span><span class="p">);</span>

<span class="c1">/// Only uses internal storage,</span>
<span class="c1">/// no writeback,</span>
<span class="c1">/// blocking destructor.</span>
<span class="c1">/// Data pointed to by ptr is copied to internal storage.</span>
<span class="k">template</span> <span class="o">&lt;</span><span class="k">class</span> <span class="nc">T</span><span class="p">,</span> <span class="kt">int</span> <span class="n">Dim</span><span class="p">&gt;</span>
<span class="n">buffer</span><span class="o">&lt;</span><span class="n">T</span><span class="p">,</span> <span class="n">Dim</span><span class="o">&gt;</span> <span class="n">make_sync_buffer</span><span class="p">(</span>
    <span class="k">const</span> <span class="n">T</span><span class="o">*</span> <span class="n">ptr</span><span class="p">,</span> <span class="n">sycl</span><span class="o">::</span><span class="n">range</span><span class="o">&lt;</span><span class="n">Dim</span><span class="o">&gt;</span> <span class="n">r</span><span class="p">);</span>

<span class="c1">/// Only internal storage, </span>
<span class="c1">/// no writeback,</span>
<span class="c1">/// non-blocking destructor</span>
<span class="k">template</span> <span class="o">&lt;</span><span class="k">class</span> <span class="nc">T</span><span class="p">,</span> <span class="kt">int</span> <span class="n">Dim</span><span class="p">&gt;</span>
<span class="n">buffer</span><span class="o">&lt;</span><span class="n">T</span><span class="p">,</span> <span class="n">Dim</span><span class="o">&gt;</span> <span class="n">make_async_buffer</span><span class="p">(</span>
    <span class="n">sycl</span><span class="o">::</span><span class="n">range</span><span class="o">&lt;</span><span class="n">Dim</span><span class="o">&gt;</span> <span class="n">r</span><span class="p">);</span>

<span class="c1">/// Only internal storage,</span>
<span class="c1">/// no writeback,</span>
<span class="c1">/// non-blocking destructor.</span>
<span class="c1">/// Data pointed to by ptr is copied to internal storage.</span>
<span class="k">template</span> <span class="o">&lt;</span><span class="k">class</span> <span class="nc">T</span><span class="p">,</span> <span class="kt">int</span> <span class="n">Dim</span><span class="p">&gt;</span>
<span class="n">buffer</span><span class="o">&lt;</span><span class="n">T</span><span class="p">,</span> <span class="n">Dim</span><span class="o">&gt;</span> <span class="n">make_async_buffer</span><span class="p">(</span>
    <span class="k">const</span> <span class="n">T</span><span class="o">*</span> <span class="n">ptr</span><span class="p">,</span> <span class="n">sycl</span><span class="o">::</span><span class="n">range</span><span class="o">&lt;</span><span class="n">Dim</span><span class="o">&gt;</span> <span class="n">r</span><span class="p">);</span>

<span class="c1">/// Uses provided storage,</span>
<span class="c1">/// writes back,</span>
<span class="c1">/// blocking destructor.</span>
<span class="c1">/// Directly operates on host_view_ptr.</span>
<span class="k">template</span> <span class="o">&lt;</span><span class="k">class</span> <span class="nc">T</span><span class="p">,</span> <span class="kt">int</span> <span class="n">Dim</span><span class="p">&gt;</span>
<span class="n">buffer</span><span class="o">&lt;</span><span class="n">T</span><span class="p">,</span> <span class="n">Dim</span><span class="o">&gt;</span> <span class="n">make_sync_writeback_view</span><span class="p">(</span>
    <span class="n">T</span><span class="o">*</span> <span class="n">host_view_ptr</span><span class="p">,</span> <span class="n">sycl</span><span class="o">::</span><span class="n">range</span><span class="o">&lt;</span><span class="n">Dim</span><span class="o">&gt;</span> <span class="n">r</span><span class="p">);</span>

<span class="c1">/// Uses provided storage,</span>
<span class="c1">/// writes back,</span>
<span class="c1">/// non-blocking destructor.</span>
<span class="c1">/// Directly operates on host_view_ptr.</span>
<span class="c1">/// The provided queue can be used by the user to </span>
<span class="c1">/// wait for the writeback to complete.</span>
<span class="k">template</span> <span class="o">&lt;</span><span class="k">class</span> <span class="nc">T</span><span class="p">,</span> <span class="kt">int</span> <span class="n">Dim</span><span class="p">&gt;</span>
<span class="n">buffer</span><span class="o">&lt;</span><span class="n">T</span><span class="p">,</span> <span class="n">Dim</span><span class="o">&gt;</span> <span class="n">make_async_writeback_view</span><span class="p">(</span>
    <span class="n">T</span><span class="o">*</span> <span class="n">host_view_ptr</span><span class="p">,</span> <span class="n">sycl</span><span class="o">::</span><span class="n">range</span><span class="o">&lt;</span><span class="n">Dim</span><span class="o">&gt;</span> <span class="n">r</span><span class="p">,</span>
    <span class="k">const</span> <span class="n">sycl</span><span class="o">::</span><span class="n">queue</span><span class="o">&amp;</span> <span class="n">q</span><span class="p">);</span>

<span class="c1">/// Uses provided storage,</span>
<span class="c1">/// does not write back,</span>
<span class="c1">/// blocking destructor.</span>
<span class="c1">/// Directly operates on host_view_ptr.</span>
<span class="k">template</span> <span class="o">&lt;</span><span class="k">class</span> <span class="nc">T</span><span class="p">,</span> <span class="kt">int</span> <span class="n">Dim</span><span class="p">&gt;</span>
<span class="n">buffer</span><span class="o">&lt;</span><span class="n">T</span><span class="p">,</span> <span class="n">Dim</span><span class="o">&gt;</span> <span class="n">make_sync_view</span><span class="p">(</span>
    <span class="n">T</span><span class="o">*</span> <span class="n">host_view_ptr</span><span class="p">,</span> <span class="n">sycl</span><span class="o">::</span><span class="n">range</span><span class="o">&lt;</span><span class="n">Dim</span><span class="o">&gt;</span> <span class="n">r</span><span class="p">);</span>

<span class="c1">/// Uses provided storage,</span>
<span class="c1">/// does not write back,</span>
<span class="c1">/// non-blocking destructor.</span>
<span class="c1">/// Directly operates on host_view_ptr.</span>
<span class="k">template</span> <span class="o">&lt;</span><span class="k">class</span> <span class="nc">T</span><span class="p">,</span> <span class="kt">int</span> <span class="n">Dim</span><span class="p">&gt;</span>
<span class="n">buffer</span><span class="o">&lt;</span><span class="n">T</span><span class="p">,</span> <span class="n">Dim</span><span class="o">&gt;</span> <span class="n">make_async_view</span><span class="p">(</span>
    <span class="n">T</span><span class="o">*</span> <span class="n">host_view_ptr</span><span class="p">,</span> <span class="n">sycl</span><span class="o">::</span><span class="n">range</span><span class="o">&lt;</span><span class="n">Dim</span><span class="o">&gt;</span> <span class="n">r</span><span class="p">);</span>

<span class="c1">/// Additional factory functions exist for </span>
<span class="c1">/// buffer-USM interoperability.</span>
<span class="c1">/// Those will be covered in more detail in a future blog post.</span></code></pre></figure>

<p>For the full API reference, see the <a href="https://github.com/illuhad/hipSYCL/blob/develop/doc/explicit-buffer-policies.md">hipSYCL documentation</a>.</p>]]></content><author><name>Aksel Alpay</name></author><category term="hipsycl" /><category term="extension" /><summary type="html"><![CDATA[This post is part of a series where we discuss some features of the brandnew hipSYCL 0.9.1. Today I want to take a closer look at]]></summary></entry><entry><title type="html">hipSYCL 0.9.0 - SYCL 2020 and oneAPI DPC++ features coming to hipSYCL</title><link href="https://adaptivecpp.github.io/hipsycl/sycl2020/release/hipsycl-0.9/" rel="alternate" type="text/html" title="hipSYCL 0.9.0 - SYCL 2020 and oneAPI DPC++ features coming to hipSYCL" /><published>2021-02-22T13:38:06+01:00</published><updated>2021-02-22T13:38:06+01:00</updated><id>https://adaptivecpp.github.io/hipsycl/sycl2020/release/hipsycl-0.9</id><content type="html" xml:base="https://adaptivecpp.github.io/hipsycl/sycl2020/release/hipsycl-0.9/"><![CDATA[<p>On december 10 2020, hipSYCL 0.9.0 was <a href="https://github.com/illuhad/hipSYCL/releases/tag/v0.9.0">released</a>. This release is significant for several reasons. As we are now on the final trajectory to releasing hipSYCL 0.9.1 as another big update, I felt that it is useful to take a step back and look at some of the highlights of what is already in 0.9.0 - ready for everybody to use.</p>

<h1 id="support-for-key-sycl-2020-features">Support for key SYCL 2020 features</h1>

<p>hipSYCL 0.9.0 is the first release that incorporates <a href="https://github.com/hipSYCL/featuresupport">features</a> from the SYCL 2020 specification.</p>

<p>SYCL 2020 is a major update on the older SYCL 1.2.1. Its highlights include a substantial amount of features that originally came from oneAPI DPC++ and have since been contributed to the SYCL 2020 specification. In particular, this includes:</p>
<ul>
  <li><em>Unified shared memory</em>, a pointer-based memory management interface as an alternative to the buffer-accessor model;</li>
  <li><em>Parallel reductions</em>;</li>
  <li><em>subgroups</em> that can expose the inner workings of the hardware below work group level;</li>
  <li>Optimized <em>work group and subgroup primitives</em> such as reductions or scans;</li>
  <li><em>in-order queues</em> and the ability to explicitly specify dependencies in the DAG;</li>
  <li><em>Unnamed kernel lambdas</em> reduce verbosity and simplify development.</li>
</ul>

<p>These are important features, as they allow more control over the hardware, enable more flexible usage patterns, or can reduce verbosity for programmers. It is therefore important that these features are well supported across implementations, such that developers can rely on them without limiting code portability.</p>

<p>This is why we felt that it is was important for hipSYCL 0.9.0 to move towards SYCL 2020. Developers can now write code using SYCL 2020 features with say, DPC++ and maybe initially target Intel devices, but then seamlessly transition to using hipSYCL when, for example, AMD GPUs need to be targeted.</p>

<p>Of course, switching between multiple implementations as needed is something that only works because SYCL is an open standard. Without open standards, it is difficult to imagine ecosystems with multiple strong implementations, and the SYCL implementation ecosystem serves as a great example of the power of standards - both in terms of the extremely broad hardware range that SYCL implementations target in summary, but also because of the various design differences between SYCL implementations which provides each one with unique strengths and weaknesses. For each use case, there is most likely a SYCL implementation that is a great fit or was maybe even designed explicitly with that use case in mind.</p>

<h2 id="code-example-with-sycl-2020">Code example with SYCL 2020</h2>

<p>In the past SYCL was sometimes criticized for being too verbose. The following example uses unified shared memory, unnamed kernel lambdas and queue shortcuts from SYCL 2020. It’s hard to figure out how this code could be any <em>less</em> verbose.</p>

<figure class="highlight"><pre><code class="language-cpp" data-lang="cpp"><span class="cp">#include</span> <span class="cpf">&lt;iostream&gt;</span><span class="cp">
#include</span> <span class="cpf">&lt;SYCL/sycl.hpp&gt;</span><span class="cp">
</span>
<span class="kt">int</span> <span class="nf">main</span><span class="p">()</span> <span class="p">{</span>
  <span class="n">sycl</span><span class="o">::</span><span class="n">queue</span> <span class="n">q</span><span class="p">;</span>
  
  <span class="k">const</span> <span class="n">std</span><span class="o">::</span><span class="kt">size_t</span> <span class="n">size</span> <span class="o">=</span> <span class="mi">4096</span><span class="p">;</span>
  <span class="kt">int</span> <span class="o">*</span><span class="n">shared_allocation</span> <span class="o">=</span> <span class="n">sycl</span><span class="o">::</span><span class="n">malloc_shared</span><span class="o">&lt;</span><span class="kt">int</span><span class="o">&gt;</span><span class="p">(</span><span class="n">size</span><span class="p">,</span> <span class="n">q</span><span class="p">);</span>

  <span class="n">q</span><span class="p">.</span><span class="n">parallel_for</span><span class="p">(</span><span class="n">sycl</span><span class="o">::</span><span class="n">range</span><span class="p">{</span><span class="n">size</span><span class="p">},</span> <span class="p">[</span><span class="o">=</span><span class="p">](</span><span class="n">sycl</span><span class="o">::</span><span class="n">id</span><span class="o">&lt;</span><span class="mi">1</span><span class="o">&gt;</span> <span class="n">idx</span><span class="p">)</span> <span class="p">{</span>
    <span class="c1">// Do some meaningful computation here instead of this :-)</span>
    <span class="kt">size_t</span> <span class="n">gid</span> <span class="o">=</span> <span class="n">idx</span><span class="p">.</span><span class="n">get</span><span class="p">(</span><span class="mi">0</span><span class="p">);</span>
    <span class="n">shared_allocation</span><span class="p">[</span><span class="n">gid</span><span class="p">]</span> <span class="o">=</span> <span class="n">gid</span><span class="p">;</span>
  <span class="p">});</span>

  <span class="n">q</span><span class="p">.</span><span class="n">wait</span><span class="p">();</span>

  <span class="c1">// Access result of your computation here</span>
  <span class="k">for</span><span class="p">(</span><span class="n">std</span><span class="o">::</span><span class="kt">size_t</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="n">size</span><span class="p">;</span> <span class="o">++</span><span class="n">i</span><span class="p">)</span>
    <span class="n">std</span><span class="o">::</span><span class="n">cout</span> <span class="o">&lt;&lt;</span> <span class="n">shared_allocation</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">&lt;&lt;</span> <span class="n">std</span><span class="o">::</span><span class="n">endl</span><span class="p">;</span>
  
  <span class="n">sycl</span><span class="o">::</span><span class="n">free</span><span class="p">(</span><span class="n">shared_allocation</span><span class="p">,</span> <span class="n">q</span><span class="p">);</span>
<span class="p">}</span></code></pre></figure>

<p>With hipSYCL 0.9.0, this kind of code works on all hardware that it supports: Any CPU, NVIDIA GPUs and AMD GPUs.</p>

<p><em>(As a sidenote, it’s important to realize that the ability to write such code does not mean that the old buffer-accessor model is obsolete and should never be used again - it is still great if you require the features that the buffer-accessor model additionally and automatically provides. For example, the buffer-accessor model provides automatic task graph construction which allows for automatic overlap of data transfers and kernels.)</em></p>

<h1 id="new-runtime-and-architecture">New runtime and architecture</h1>

<p>hipSYCL 0.9.0 is also the first release containing a new runtime library, entirely rewritten from scratch. As part of this work, so much of hipSYCL was changed and restructured that working with it now really <em>feels</em> like a completely different SYCL implementation. If you have some experience with the earlier 0.8 series, now might be a good time to check on hipSYCL again.</p>

<p>Looking at the stats between the previous release, 0.8.2, and 0.9.0 shows that pretty much every file was modified, with a net increase of almost 20000 lines of code.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ git diff v0.8.2 v0.9.0 --stat
...
289 files changed, 28297 insertions(+), 11641 deletions(-)
</code></pre></div></div>

<p>This new runtime library was redesigned from the ground with a multi-backend architecture in mind that allows using multiple backends simultaneously, with the final goal of being able to compile source files to a single binary that can run on all of hipSYCL targets: CPUs, NVIDIA GPUs and AMD GPUs. This is in contrast to earlier hipSYCL versions, where the user had to decide at compile time which backend was targeted.</p>

<p>While hipSYCL 0.9.0 contains all the necessary runtime and SYCL kernel header support to target all backends simultaneously, it still misses some compiler components. As a consequence, hipSYCL 0.9.0 can target CPUs and <em>either</em> AMD <em>or</em> NVIDIA GPUs at the same time.</p>

<p>As the last missing piece of the puzzle, the required compiler support will be part of hipSYCL 0.9.1, and is in fact already merged and available in the <code class="language-plaintext highlighter-rouge">develop</code> branch on github. In short: If you install the latest hipSYCL git version, you will already be able to compile to a single binary that can run on CPUs, NVIDIA GPUs, and AMD GPUs.</p>

<p>To express the ability to target multiple backends simultaneously, hipSYCL 0.9.0 deprecates the old <code class="language-plaintext highlighter-rouge">--hipsycl-platform</code> and <code class="language-plaintext highlighter-rouge">--hipsycl-gpu-arch</code> arguments and introduces a new, unified way to specify compilation targets using the new <code class="language-plaintext highlighter-rouge">--hipsycl-targets</code> argument. For example, to compile kernels for the OpenMP CPU backend as well as AMD gfx906 chips (Radeon VII/Instinct MI50 GPUs), the compiler argument <code class="language-plaintext highlighter-rouge">--hipsycl-targets=omp;hip:gfx906</code> can be used.</p>

<p>The new runtime also introduces a lot of other features, such as memory management at a granularity below buffer size, and a different model for SYCL queues. These features are mainly important for future development and, at least for now, have limited impact for end users. The <a href="https://github.com/illuhad/hipSYCL/releases/tag/v0.9.0">release page</a> lists some more features.</p>

<h1 id="big-performance-improvements-for-nd_range-parallel_for-on-cpus">Big performance improvements for nd_range parallel_for on CPUs</h1>

<p>One of the more apparent improvements is the performance of <code class="language-plaintext highlighter-rouge">nd_range</code> <code class="language-plaintext highlighter-rouge">parallel_for</code> on CPUs. <code class="language-plaintext highlighter-rouge">nd_range</code> <code class="language-plaintext highlighter-rouge">parallel_for</code> is notoriously difficult to implement for pure-library CPU backends such as hipSYCL’s OpenMP backend, because the <code class="language-plaintext highlighter-rouge">nd_range</code> model allows for explicit work group barriers and collective group algorithms. This in turn requires independent forward-progress guarantees for each work item, which does not map well to CPUs - at least with the methods that pure C++ provides.</p>

<p>In hipSYCL 0.9.0, we transition from using threads for work items to a hybrid approach where multithreading is only used across work groups, and work items are represented using fibers (lightweight userspace threads with cooperative scheduling).</p>

<p>As an additional optimization, hipSYCL first attempts to execute a work group with a single fiber and a  loop across work items, which might be vectorized by the compiler. Only when independent forward progress for work items is actually needed, for example when a work group barrier or collective group algorithm is encountered, will hipSYCL dynamically switch to a model where each work item is mapped to its own fiber. 
If no barrier is encountered, performance is similar to hierarchical or basic parallel for execution models that can be implemented very efficiently in pure-library SYCL implementation backends.</p>

<p>Overall, compared to hipSYCL 0.8.0, performance can increase by several orders of magnitude for typical workloads.</p>

<h1 id="extensions">Extensions</h1>

<p>hipSYCL 0.9.0 also introduces a couple of new SYCL extensions, most notably a new execution model: Scoped parallelism allows for a performance portable formulation of kernels across backends that still provides access to lower-level features such as local memory or work group barriers that are otherwise only available in <code class="language-plaintext highlighter-rouge">nd_range</code> parallel for. See the <a href="https://github.com/illuhad/hipSYCL/blob/develop/doc/scoped-parallelism.md">documentation</a> for more information. 
In short, the idea behind scoped parallelism is to distinguish between a user-requested <em>logical parallelism</em> within a work group that describes the number of work items that should be processed in a group, and an implementation-provided <em>physical parallelism</em> which refers to the actual work group parallelism running in the backend. In scoped parallelism, the SYCL implementation will decide on a degree of physical parallelism that is well suited for the hardware, and then distribute the logical work items across the physical resources. The additional freedom for the SYCL implementation to choose the actual work group parallelism is what makes this execution model more performance portable than <code class="language-plaintext highlighter-rouge">nd_range</code> parallel for.</p>

<p>As a brief teaser, this is what work group reduction using local memory looks like in scoped parallelism:</p>

<figure class="highlight"><pre><code class="language-cpp" data-lang="cpp"><span class="cp">#include</span> <span class="cpf">&lt;SYCL/sycl.hpp&gt;</span><span class="cp">
</span>
<span class="kt">int</span> <span class="nf">main</span><span class="p">(){</span>
  
  <span class="n">sycl</span><span class="o">::</span><span class="n">queue</span> <span class="n">q</span><span class="p">;</span>
  
  <span class="n">std</span><span class="o">::</span><span class="kt">size_t</span> <span class="n">input_size</span> <span class="o">=</span> <span class="mi">1024</span><span class="p">;</span>
  <span class="kt">int</span> <span class="o">*</span><span class="n">data</span> <span class="o">=</span> <span class="n">sycl</span><span class="o">::</span><span class="n">malloc_shared</span><span class="o">&lt;</span><span class="kt">int</span><span class="o">&gt;</span><span class="p">(</span><span class="n">input_size</span><span class="p">,</span> <span class="n">q</span><span class="p">);</span>
  
  <span class="k">for</span><span class="p">(</span><span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="n">input_size</span><span class="p">;</span> <span class="o">++</span><span class="n">i</span><span class="p">)</span>
    <span class="n">data</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">=</span> <span class="n">i</span><span class="p">;</span>
  
  <span class="k">constexpr</span> <span class="kt">size_t</span> <span class="n">Group_size</span> <span class="o">=</span> <span class="mi">128</span><span class="p">;</span>
  <span class="n">q</span><span class="p">.</span><span class="n">parallel</span><span class="p">(</span>
    <span class="c1">//number of groups</span>
    <span class="n">sycl</span><span class="o">::</span><span class="n">range</span><span class="o">&lt;</span><span class="mi">1</span><span class="o">&gt;</span><span class="p">{</span><span class="n">input_size</span> <span class="o">/</span> <span class="n">Group_size</span><span class="p">},</span>
    <span class="n">sycl</span><span class="o">::</span><span class="n">range</span><span class="o">&lt;</span><span class="mi">1</span><span class="o">&gt;</span><span class="p">{</span><span class="n">Group_size</span><span class="p">},</span> <span class="c1">//logical group size</span>
    <span class="p">[</span><span class="o">=</span><span class="p">](</span><span class="n">sycl</span><span class="o">::</span><span class="n">group</span><span class="o">&lt;</span><span class="mi">1</span><span class="o">&gt;</span> <span class="n">grp</span><span class="p">,</span> 
        <span class="n">sycl</span><span class="o">::</span><span class="n">physical_item</span><span class="o">&lt;</span><span class="mi">1</span><span class="o">&gt;</span> <span class="n">physical_idx</span><span class="p">){</span>
      <span class="c1">// Code in this scope is executed</span>
      <span class="c1">// within the implementation-defined</span>
      <span class="c1">// physical iteration space.</span>

      <span class="c1">// Local memory can be allocated using the</span>
      <span class="c1">// sycl::local_memory extension.</span>
      <span class="n">sycl</span><span class="o">::</span><span class="n">local_memory</span><span class="o">&lt;</span><span class="kt">int</span> <span class="p">[</span><span class="n">Group_size</span><span class="p">]</span><span class="o">&gt;</span> <span class="n">scratch</span><span class="p">{</span><span class="n">grp</span><span class="p">};</span>
      
      <span class="c1">// `distribute_for` distributes the logical,</span>
      <span class="c1">// user-provided iteration space across the</span>
      <span class="c1">// physical one from the outer scope</span>
      <span class="n">grp</span><span class="p">.</span><span class="n">distribute_for</span><span class="p">([</span><span class="o">&amp;</span><span class="p">](</span><span class="n">sycl</span><span class="o">::</span><span class="n">sub_group</span> <span class="n">sg</span><span class="p">,</span>
                             <span class="n">sycl</span><span class="o">::</span><span class="n">logical_item</span><span class="o">&lt;</span><span class="mi">1</span><span class="o">&gt;</span> <span class="n">idx</span><span class="p">){</span>
          <span class="n">scratch</span><span class="p">[</span><span class="n">idx</span><span class="p">.</span><span class="n">get_local_id</span><span class="p">(</span><span class="mi">0</span><span class="p">)]</span> <span class="o">=</span>
                <span class="n">data</span><span class="p">[</span><span class="n">idx</span><span class="p">.</span><span class="n">get_global_id</span><span class="p">(</span><span class="mi">0</span><span class="p">)];</span>
      <span class="p">});</span> 
      <span class="c1">// implicit barrier here</span>

      <span class="k">for</span><span class="p">(</span><span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="n">Group_size</span> <span class="o">/</span> <span class="mi">2</span><span class="p">;</span> <span class="n">i</span> <span class="o">&gt;</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o">/=</span> <span class="mi">2</span><span class="p">){</span>
        <span class="n">grp</span><span class="p">.</span><span class="n">distribute_for</span><span class="p">([</span><span class="o">&amp;</span><span class="p">](</span><span class="n">sycl</span><span class="o">::</span><span class="n">sub_group</span> <span class="n">sg</span><span class="p">,</span>
                               <span class="n">sycl</span><span class="o">::</span><span class="n">logical_item</span><span class="o">&lt;</span><span class="mi">1</span><span class="o">&gt;</span> <span class="n">idx</span><span class="p">){</span>
          <span class="kt">size_t</span> <span class="n">lid</span> <span class="o">=</span> <span class="n">idx</span><span class="p">.</span><span class="n">get_local_id</span><span class="p">(</span><span class="mi">0</span><span class="p">);</span>
          <span class="k">if</span><span class="p">(</span><span class="n">lid</span> <span class="o">&lt;</span> <span class="n">i</span><span class="p">)</span>
            <span class="n">scratch</span><span class="p">[</span><span class="n">lid</span><span class="p">]</span> <span class="o">+=</span> <span class="n">scratch</span><span class="p">[</span><span class="n">lid</span><span class="o">+</span><span class="n">i</span><span class="p">];</span>
        <span class="p">});</span>
      <span class="p">}</span>
      
      <span class="n">grp</span><span class="p">.</span><span class="n">single_item</span><span class="p">([</span><span class="o">&amp;</span><span class="p">](){</span>
        <span class="n">data</span><span class="p">[</span><span class="n">grp</span><span class="p">.</span><span class="n">get_id</span><span class="p">(</span><span class="mi">0</span><span class="p">)</span><span class="o">*</span><span class="n">Group_size</span><span class="p">]</span> <span class="o">=</span> <span class="n">scratch</span><span class="p">[</span><span class="mi">0</span><span class="p">];</span>
      <span class="p">});</span>
    <span class="p">});</span>
  <span class="p">});</span>
  
  <span class="c1">// Use results here</span>
  <span class="c1">// ...</span>
  <span class="n">sycl</span><span class="o">::</span><span class="n">free</span><span class="p">(</span><span class="n">data</span><span class="p">,</span> <span class="n">q</span><span class="p">);</span>
<span class="p">}</span></code></pre></figure>

<p>Scoped parallelism can be implemented both on top of <code class="language-plaintext highlighter-rouge">nd_range</code> parallel for for backends that support it well, as well as on top of hierarchical parallel for from SYCL 1.2.1. It can therefore be seen as a generalization or abstraction of both models.</p>

<p><em>(Note: We might adapt the interface of scoped parallelism slightly in the future to align better with some of the patterns found in the final SYCL 2020 specification)</em></p>

<h1 id="get-it">Get it!</h1>

<p>If I have piqued your interest in hipSYCL, head over to the <a href="https://github.com/illuhad/hipSYCL">github repository</a> and download the <a href="https://github.com/illuhad/hipSYCL/releases/tag/v0.9.0">release</a>, or for even more new features, clone the repository from the <code class="language-plaintext highlighter-rouge">develop</code> branch!</p>]]></content><author><name>Aksel Alpay</name></author><category term="hipsycl" /><category term="sycl2020" /><category term="release" /><summary type="html"><![CDATA[On december 10 2020, hipSYCL 0.9.0 was released. This release is significant for several reasons. As we are now on the final trajectory to releasing hipSYCL 0.9.1 as another big update, I felt that it is useful to take a step back and look at some of the highlights of what is already in 0.9.0 - ready for everybody to use.]]></summary></entry></feed>