AdaptiveCpp.

hipSYCL: The first single-pass SYCL implementation with unified code representation

2023-02-01T20:00:00+01:00

Heterogeneous Utopia?

Imagine you didn’t have to specify any targets when compiling with SYCL.
Imagine compilation took not much longer than a regular clang compilation for your CPU.
Imagine the resulting SYCL binary could run on CPUs - but also magically on any NVIDIA, Intel and AMD ROCm GPU. A universal binary.

And even use multiple of these devices at the same time.

hipSYCL reality!

What may sound like a science fiction is now reality in hipSYCL. hipSYCL now has a new compilation flow, that we call generic SSCP compiler. SSCP stands for single-source, single compiler pass. We will discuss what this means later.

To enable the magic, just compile like so:

syclcc --hipsycl-targets=generic -o test test.cpp

In the future, the plan is to make --hipsycl-targets=generic the default, so that this can be omitted as well. But what does it do? Let’s first discuss the state of the art.

Current status quo

Current SYCL implementations, including hipSYCL and DPC++ rely on the single-source, multiple compiler passes (SMCP) model. This means that they invoke a separate SYCL device compiler that parses and compiles the code, and generates a device binary. Afterwards, the usual host compiler is invoked, which again parses and compiles the code for the host, and also ensures that device binaries are embedded in the application. The result is a host binary with embedded device code containing kernels. Compilers for other programming models, such as NVIDIA’s nvcc CUDA compiler or AMD’s HIP compiler, work similarly.

Maybe you have spotted one issue already: The code is parsed multiple times – once for host, and once for device. And with C++ being C++, this can take time.

But it gets worse

But that’s not all there is to it. As it turns out, there is no unified code representation that Intel, NVIDIA and AMD GPU compute drivers can all understand. Intel GPUs want SPIR-V. SPIR-V is supported by neither AMD nor NVIDIA compute drivers. NVIDIA has their own code representation called PTX. AMD GPUs want amdgcn code.

So, if you want to have a binary that runs everywhere, SYCL implementations actually need to invoke the device compiler several times: Once for every code format that is required (PTX, SPIR-V, amdgcn). This means we are already looking at compiling the source code four times: Once for the host, and three times for the GPUs.

But it gets worse (again)

But it’s even worse than that. AMD’s ROCm compute platform does not have a device independent code representation. So, in order to support every AMD GPU, you actually need to compile seperately for each GPU supported by ROCm. My ROCm 5.3 installation can generate code for 38 different AMD GPU architectures (just count the number of ISA bitcode files in the rocm/amdgcn/bitcode directory). This means that in total we are now looking at parsing and compiling code once for the host, once for PTX, once for SPIR-V, and 38 times for AMD. 41 times in total. Clearly this is not practical.

And this approach also does not scale if we think about potentially supporting more backends in the future.

hipSYCL generic SSCP compiler to the rescue!

So, what is this generic SSCP thing? It actually combines two ideas:

It is a single-pass compiler. This is what SSCP means, and implies that the device code will be extracted during the compilation for host. So, code is only parsed a single time.
It introduces a generic, backend and device-independent code representation based on LLVM IR. The SSCP compiler stores this code representation in the application. At runtime, hipSYCL will then translate the generic code representation to whatever is needed: PTX, SPIR-V, or amdgcn code for one of the 38 AMD ROCm GPUs. Effectively, this means that we now have a unified code representation across all backends, even if they by themselves do not support one.

The consequence is that we get a binary that can run on all supported devices, while only parsing the code exactly one time – just like a regular C++ compilation.

What are the costs at runtime?

Now you might be asking: Hang on! You are now compiling at runtime, so you have just moved the cost to runtime! But that’s not true: Most likely you don’t have a system that has all of the 41 different GPU architectures installed, such that it wouldn’t have to actually generate code for all these targets. So, the runtime compilation would only compile based on the individual need of the user, but has the capability to run on all 41 GPUs. Additionally, even if you did run on all 41 devices, you’d still have saved parsing the code 41 additional times, because runtime compilation does not involve source code that needs to be parsed.

But there’s another important point: SYCL implementations effectively already do runtime compilation! If a SYCL implementation feeds PTX code to the CUDA driver, the CUDA driver will already compile this PTX code to machine code at runtime. The same is true for SPIR-V code. So, runtime compilation is no new behavior in SYCL, and something that SYCL applications already need to deal with today: It is quite likely that your first kernel launch will take longer due to drivers compiling the kernel on the fly. The additional step that we introduce roughly doubles that existing runtime compilation time. In other words, there’s additional overhead, but it does not change the fundamental order of magnitude of existing runtime compilation costs. If your SYCL application can tolerate current runtime compilation costs, likely it will be able to tolerate the additional step too.

Compile time improvements

What did it bring? This is shown in this graph, where I’ve measured the time it takes to compile the BabelStream benchmark with various compilation flows in hipSYCL:

The host case describes a regular clang compilation for CPU without specific SYCL compiler logic. This is our baseline. The host,gfx900,… cases correspond to compiling for 1, 2, and 3 AMD GPUs with the old multipass compiler based on the clang HIP toolchain. nvc++ refers to the case where hipSYCL operates as a CUDA library for NVIDIA’s nvc++ compiler.

The host,generic bar shows the time when our new generic SSCP compiler is enabled. As can be seen, the new compiler takes only roughly 15% longer than the host compilation. But it is over twice as fast compared to compiling for the three AMD GPUs with the previous compiler. And remember that the resulting binary supports not only 3 GPUs, but 38 AMD GPUs, plus any NVIDIA GPU, plus any Intel GPU. You can imagine how long it would have taken to build a binary with equal portability with the older hipSYCL compiler, or any other SYCL implementation.

Performance

How does performance look like with the new compiler? The boring answer is: It’s similar to the old one, typically within 10% performance in both directions. So you really get the same kernel performance, but with more portability of the resulting binary, and less compile times. And we have not even started optimizing for performance in particular, as the development focus until now was mainly on functionality.

Conclusion

hipSYCL has a major new feature: A compiler that can generate ultra-portable binaries with less compile time than other approaches and without sacrificing performance. If you want to play with it, it is part of the main hipSYCL repository. It can run some very complex applications, but be aware that a couple SYCL features are not yet implemented because they are still being worked on - in particular atomics, the SYCL 2020 group algorithm library and SYCL 2020 reductions.

Benchmarking hipSYCL with HeCBench on AMD hardware

2022-07-20T21:00:00+02:00

HeCBench

HeCBench is a large benchmark collection that provides applications in various programming models, gathered from various sources. The fact that it does contain SYCL ports makes it interesting for the purpose of evaluating hipSYCL, as the performance of the SYCL version with hipSYCL can be compared to the native programming models. In this blog post, we’ll be looking into comparing the hipSYCL performance with native HIP performance on an AMD Radeon Pro VII.

Benchmark selection

HeCBench overall contains over 280 benchmarks, and hence evaluating all of them is very time-consuming. Some of them don’t run yet with hipSYCL, e.g. because they rely on DPC++-specific extensions or non-standard SYCL behavior (more details on these issues can be found in this paper), but the majority works. So, to simplify the problem at hand, we select the first ~30 benchmarks in alphabetical order that work with hipSYCL. Additionally, we include four benchmarks that we already had data on from prior work: XSBench, RSBench, md5hash and nbody.

Following these criteria, we have selected the following applications:

aligned-types
amgmk
aobench
asta
atomicCAS
atomicIntrinsics
atomicReduction
attention
babelstream
bezier-surface
binomial
bitonic-sort
bsearch
bspline-vgh
ccsd-trpdrv
clenergy
convolutionSeparable
crc64
damage
dp
dslash
expdist
extend2
extrema
fft
filter
floydwarshall
fpc
gamma-correction
XSBench
RSBench
md5hash
nbody

Some of these applications are more of functional tests rather than benchmarks (e.g. aligned-types), some are memory-bound (e.g. babelstream), and others are compute-bound (e.g. fft). So, we have a good mixture of different use cases at our hand, that are hopefully representative of common scenarios in the real world.

Results

The plot below shows the relative performance between the hipSYCL results and the native HIP results. Some applications return more than one result, in which case multiple results are shown for one application. This is prominently the case for BabelStream. Where the application itself did not provide performance results (e.g. for some functional tests), the wall time of the application execution was measured. The vertical red lines indicate performance parity within 20%.

As can be seen, the vast majority of applications perform within 20% of the native HIP performance. Those applications that perform worse are almost exclusively applications that are not necessarily geared towards performance measurements such as aligned/unaligned copy microbenchmarks or functional tests.

On the other hand, there are also numerous cases where hipSYCL substantially outperforms HIP, such as aobench at almost twice the performance, and some CAS tests with an even higher relative performance. In fact, the CAS tests for an atomic maximum implementation even outperform HIP by over 20x, and are not shown in the plot in order to retain a reasonable axis range.

Conclusion

It is apparent that hipSYCL can reliably deliver good performance when looking at the HeCBench applications on the investigated AMD hardware. While there are (few) cases, where HIP outperforms hipSYCL, there are also cases where hipSYCL substantially outperforms HIP.

hipSYCL 0.9.2 - compiler-accelerated CPU backend, nvc++ support and more

2022-03-11T21:30:00+01:00

In february this year, hipSYCL 0.9.2 was released. This release includes major new features, some of which I want to discuss here.

Compiler-accelerated CPU support

One of the major features of hipSYCL 0.9.2 is dedicated compiler support for the CPU backend. This can inrease performance by several orders of magnitudes in some cases, and deliver high performance on any CPU supported by LLVM. This is big news in the SYCL ecosystem, because until now, affected code could only be run efficiently on CPUs if an OpenCL implementation existed for the particular CPU.

Let me describe what this is about:

Previous support: Library-only CPU backend

hipSYCL’s CPU backend has traditionally been implemented as an OpenMP library. Consequently, it can be used with any OpenMP C++ compiler, which can be a portability advantage - it allows us to run SYCL code on any CPU for which an OpenMP compiler exists. Practically, this is everywhere.

The backend can provide good performance for kernels written in SYCL’s high-level parallel_for(sycl::range, Kernel) model. However, the lower-level parallel_for(sycl::nd_range, Kernel) model is quite different: While the high-level parallel_for does not allow for work group barriers to occur, the nd_range model allows users to have explicit barriers in their code:

#include 

int main() {
  sycl::queue q;
  
  std::size_t global_size = 1024;
  std::size_t local_size = 128;
  
  q.parallel_for(sycl::nd_range<1>{global_size, local_size}, 
    [=](auto idx) {
    // Code before barrier here
    
    // Waits until all items from the work group have
    // executed the previous code
    sycl::group_barrier(idx.get_group());
    
    // Code after the barrier here - will be executed 
    // once *all* items in the group
    // have finished the previous code.
  });

  q.wait();
}

On CPU, for performance we generally want to employ multithreading across work groups, and then iterate across work items within a group. Ideally, the compiler can then (auto-)vectorize this inner loop across work items. This maps well to hierachical parellelism from SYCL 1.2.1, or hipSYCL’s scoped parallelism programming model extension.

For nd_range barriers however, all code for all items in a group needs to finish before we can proceed. This is an issue for library implementations of SYCL, because it prevents us from implementing work items as iterations of an (auto-vectorized) loop. Instead, each work item must live within its own mechanism that provides concurrency, so that all work items can reach the barrier at the same time. hipSYCL uses fibers for this purpose. While fibers have much lower overhead compared to actual full-blown threads, there are still issues with this approach:

A barrier requires context-switching through all fibers to make sure they have reached the barrier. A fiber context switch is effectively a switch to a different stack. The relative cost of this operation is much higher compared to a barrier on e.g. a GPU, so typical GPU fine-grained parallelism patterns will not run efficiently on CPUs with this model. This is a performance portability issue.
Additionally, code cannot be vectorized across multiple fibers since each fiber runs independently. Therefore, there is no vectorization across work items. Code that wants to benefit from vectorization has to explicitly employ inner loops for each work item that can be vectorized, e.g. by using the sycl::vec class. This is another performance portability issue, since this is not how typical GPU code is written.

Alternative SYCL implementations: CPU support via OpenCL

So how do other SYCL implementations solve this issue? It is clear that if we can transform the kernel to something like this, the problem is solved:

#include 

int main() {
  sycl::queue q;
  
  std::size_t global_size = 1024;
  std::size_t group_size = 128;
  
  // Parallelize across work groups
  q.parallel_for(sycl::range<1>{global_size/group_size}, 
    [=](auto group_idx) {
    // Compiler should vectorize this loop
    for(int work_item = 0; i < group_size; ++work_item){
      // Code before barrier here
    }
    // The barrier is now a no-op since the loop already guarantees
    // barrier semantics
    // sycl::group_barrier(idx.get_group());
    
    // Compiler should vectorize this loop
    for(int work_item = 0; i < group_size; ++work_item){
      // Code after barrier here
    }
  });

  q.wait();
}

Remark: This effectively means automatically transforming nd_range kernels to patterns that resembles SYCL hierarchical parallelism or hipSYCL scoped parallelism models

And this transformation is basically what existing CPU OpenCL implementations do when compiling code. Since other SYCL implementations such as DPC++ oder ComputeCpp mostly rely on OpenCL to deliver performance on CPUs, these SYCL implementations have effectively offloaded the issue to the OpenCL implementation.

However, there is one problem: There are very few CPU vendors that actually provide an OpenCL implementation for their hardware. So, unless we are only interested in running on Intel CPUs, we can have a portability issue at our hand. Additionally, what if we don’t want to use OpenCL as SYCL runtime backend, but OpenMP, or TBB for CPUs? Wouldn’t it make sense to pull the required compiler transformations from the OpenCL layer into the layer of the SYCL compiler?

Our solution: Combining the advantages of both

This is exactly what we have done in hipSYCL. We have integrated these compiler transformations into the hipSYCL infrastructure. If this feature is enabled, it will apply those transformations to the regular host compilation pass - which currently uses OpenMP, but could just as well work with other runtimes such as TBB.

The consequence: We can support efficient nd_range parallel for on any CPU supported by LLVM. No need for an OpenCL implementation anymore, as the transformations run as part of clang’s regular compilation for the host CPU.

If you want to use this feature, you can just pass omp.accelerated as target to the --hipsycl-targets argument. Details on using it can be found here.

More technical details on how it works exactly can be found here.

And benchmark results can be found in the original pull request.

Immediate support for new NVIDIA hardware: NVC++ backend

Another large, new feature in hipSYCL 0.9.2 is nvc++ support. We have added cuda-nvcxx as an option that can be passed to --hipsycl-targets. In this case, the nvc++ compilation flow is activated, in which hipSYCL acts as a regular CUDA library for nvc++ - without any compiler magic.

Since nvc++ is part of NVIDIA’s HPC SDK, and hence an officially supported compiler from NVIDIA, this means that with hipSYCL’s nvc++ backend, it is possible to use hipSYCL on NVIDIA GPUs with the very latest CUDA versions, or latest hardware from day one after release.

Currently, all SYCL implementations with CUDA backends (including hipSYCL) rely on clang, which may not always support the latest CUDA versions immediately, or just assumes that they behave similarly as older versions. With hipSYCL’s nvc++ backend, the SYCL ecosystem becomes independent of the CUDA support level in clang.

Additionally, the nvc++ backend does not require LLVM at all. Therefore, if only the nvc++ backend is required, hipSYCL can now be deployed without LLVM dependency. This can significantly simplify deployment e.g. on existing NVIDIA HPC systems, where nvc++ and the NVIDIA HPC SDK might already be preinstalled. Just point hipSYCL to nvc++ and you are good to go.

nvc++ works slightly differently on a technical level compared to clang-based compilation flows. clang parses source code multiple times (for host and all targeted devices). Macros can then be used to detect which compilation pass currently takes place, and code paths can be specialized accordingly. nvc++ on the other hand parses the code only once. It is therefore not possible in nvc++ to use macros to detect e.g. whether host or device is currently targeted. Note: This behavior does not violate the SYCL specification, which defines both the Single-source multiple compiler pass (SMCP) and single-source single compiler pass (SSCP) models. SMCP is what clang does, while nvc++ follows SSCP.

Consequently, the recommended way to detect the targeted backend in source code is no longer using macros such as __SYCL_DEVICE_ONLY__. Instead, we have introduced the __hipsycl_if_target mechanism which generalizes both to the clang as well as nvc++ cases. See here for details.

Scoped parallelism v2

Scoped parallelism is a hipSYCL-specific programming model that is designed to expose all the low-level control that the nd_range parallel for model provides, while additionally remaining more performance portable. This affects in particular library-only compilation flows, such as hipSYCL’s OpenMP backend when the new omp.accelerated flow is not used.

hipSYCL has already had the scoped parallelism programming model in earlier versions. hipSYCL 0.9.2 cranks it up to the next level and significantly improves and extends the model (documentation). For example, it now allows the implementation to expose structure below sub-group granularity by allowing infinite nesting of groups - even in multiple dimensions.

Here is an example:

sycl::queue q;

q.parallel(
  sycl::range{global_range / local_range}, sycl::range{local_range},
  [=](auto group) {
    // Optionally, groups can be decomposed into subunits
    sycl::distribute_groups(group, [&](auto subgroup) {
      // This can be nested arbitrarily deep
      sycl::distribute_groups(subgroup, [&](auto subsubgroup) {
        sycl::distribute_items(subsubgroup, [&](s::s_item<Dim> idx) {
          // execute code for each work item
        });
        // Explicit group barriers and group algorithms are allowed
        sycl::group_barrier(subgroup);
      });
    });
  });

Details and more examples can be found in the documentation.

But wait, there’s more!

hipSYCL 0.9.2 contains more new features, such as

atomic_ref
better explicit multipass support
New extensions such as asynchronous buffers
More :-)

The release can be found here. Of course, you can also always just clone the latest develop branch for even more new features and fixes!

hipSYCL 0.9.1 features: buffer-USM interoperability

2021-05-26T19:30:00+02:00

This post is part of a series where we discuss some features of hipSYCL 0.9.1. Today’s topic is interoperability between buffers and USM pointers.

Why it matters

SYCL 2020 features two major memory management models, both of which are supported by hipSYCL:

The traditional buffer-accessor model that has already been available in the old SYCL 1.2.1. In this model, a task graph is constructed automatically based on access conflicts between access specifications described by accessor objects. These accessor objects are also used to access data in kernels. The buffer-accessor model provides the SYCL runtime with a lot of information about how much data is used and how it is used. This can help scheduling, and enables automatic optimizations such as overlap of data transfers and kernels.
The pointer-based USM model that was introduced in SYCL 2020. Here, allocations are managed explicitly and (unless shared allocations are used) data must be copied explicitly between host and device. The USM model provides more control to the user compared to the buffer-USM model, at the cost of requiring the user to do work that the runtime can do automatically in the buffer-accessor model. It also forces the programmer to think in a model of a host-device dichotomy, which may not be an ideal fit when CPUs are targeted. On the other hand, it is usually considerably easier to port existing pointer-based code to SYCL using the USM model compared to the buffer-accessor model.

It is apparent that both models have valid use cases and are complementary. However, in SYCL 2020, there is barely any interoperability between the two. Accessing data that is stored in a buffer using a USM pointer requires launching a custom kernel that explicitly copies all data elements from the buffer into a USM pointer. This is both cumbersome and comes at a performance cost.

Consequently, once a codebase has started using one particular model, it is effectively locked into it. This is problematic for several reasons:

As the SYCL software ecosystem grows, there is a real danger of ecosystem bifurcation if no mechanisms are provided to cross from USM-land to buffer-land and vice versa. A SYCL library with a USM pointer API will be of little use for a SYCL application that is written using buffers and accessors.
SYCL is all about taking control when you want it, and letting SYCL do what it thinks is best otherwise. This allows to combine the best of two worlds: Low-level kernel optimizations for critical code paths, and the convenience of a high-level programming model for the remaining program. Consequently, it should be possible to use USM pointers whenever we want detailed low-level control, and move to a more high-level model for other parts of the program. Not having interoperability between them can block potential incremental optimization paths during software development.
Which model will be better in terms of performance or clarity is not always apparent, and might be different for different parts of the program. As outlined above, both have strengths and weaknesses, and are complementary. We should therefore be able to mix buffers and USM pointers.

buffer-USM interoperability

To address these issues, hipSYCL 0.9.1 has introduced a comprehensive API for interoperability between USM pointers and buffers. In hipSYCL, you can always construct a buffer on top of existing USM pointers, or extract a USM pointer from a buffer – completely without additional data copies.

hipSYCL is the first SYCL implementation to expose such a feature, and the reason is found easily: Buffer-USM interoperability in a meaningful, convenient and efficient way requires guarantees about the internal buffer behavior and SYCL implementation design that far exceed anything the SYCL specification guarantees.

We have therefore introduced an additional hipSYCL runtime specification that more rigorously defines buffer behavior. In particular hipSYCL makes the following guarantees that are crucial for buffer-USM interoperability:

Buffers use USM pointers internally. All allocations a buffer performs are USM allocations, and buffers are entirely implemented on top of USM pointers.
Allocations are persistent. Buffers guarantee that allocations, once they have been made, will remain valid at least until the end of buffer lifetime. Buffers will manage exactly one allocation per (physical) device.
Buffers allocate lazily. When the buffer is used for the first time on a particular device, it will allocate memory large enough for all of the data such that no reallocations are needed for the lifetime of the buffer.

There are two cases to distinguish for buffer-USM interoperability:

Temporal composition: Here we just move memory allocations from USM pointers into a buffer or vice versa; at each point in time only either a USM pointer or a buffer exists for a given allocation.
The more complex case: Simultaneously accessing the same allocation as USM pointer and buffer. This is more complicated as it requires some correctness considerations by the programmer.

Temporal composition

Let’s focus on the simple case first: Assume we only want to turn an existing buffer into a USM pointer (or vice versa), but don’t want to use them simultaneously. hipSYCL has a fairly intuitive API for that: buffer::get_pointer() to extract USM pointers and a special buffer constructor that accepts USM pointers.

sycl::queue q;
std::size_t s = 1024;
int* mem = sycl::malloc_device<int>(s, q);

// Use mem as USM pointer
q.parallel_for(sycl::range{s}, 
    [=](sycl::id<1> idx){ mem[idx[0]] = idx[0]; });
// Make sure that USM operations terminate before
// using mem as buffer
q.wait();

// Construct buffer on top of existing USM pointer
{
  sycl::device dev = q.get_device();
  // Use mem for all operations for device dev. view() assumes
  // that the pointer holds valid data. If it should be considered empty,
  // use empty_view() instead.
  // Note the {} around the view: This is because we are actually passing
  // an std::vector. You can feed multiple USM pointers (one for each device)
  // into a buffer! Here, we only use one device.
  sycl::buffer<int> buff{
    {sycl::buffer_allocation::view(mem, dev)}, sycl::range{s}};
  
  q.submit([&](sycl::handler& cgh){
    sycl::accessor acc{buff, cgh};
    cgh.parallel_for(sycl::range{s}, [=](sycl::id<1> idx){
      acc[idx] += 1;
    });
  });
  
  // Turn buffer into USM pointer again.
  // Note: get_pointer() returns nullptr if no allocation is available on a device,
  // e.g. if a buffer hasn't yet been used on a device (remember: lazy allocation!) 
  // or was not initialized with an appropriate view() object.
  // In this example, we know that the buffer has an allocation for this
  // device because we have given one in the constructor.
  int* mem_extracted = buff.get_pointer(dev);
  assert(mem_extracted == mem);
  
  // This makes sure that the buffer won't delete the allocation when
  // it goes out of scope, so we can use it afterwards.
  // By default, view() is non-owning, so in this example it's
  // not strictly necessary.
  buff.disown_allocation(dev);
} // Closing scope synchronizes all tasks operating on the buffer.

// Use USM pointer again
q.parallel_for(sycl::range{s}, ...).wait();

sycl::free(mem, q);

Simultaneous USM pointers and buffers

If we want to have both USM pointers and buffers accessing the same allocation simultaneously, things get more complicated. In this scenario, it is crucial to understand that

Buffers automatically calculate dependencies to other operations by detecting conflicting accessors. If operations use the same allocation but without going through accessors, buffers cannot know about these additional dependencies – the programmer must insert them manually.
Buffers automatically calculate necessary data transfers by tracking whether data is valid or outdated on a particular device. If data is modified through USM pointers without the buffer knowing of it, the internal data tracking of the buffer is off and no longer reflects reality. This can cause the buffer to emit data transfers that shouldn’t take place, or omit data transfers when they might actually be required. To avoid this, we need to manually update the buffer’s data tracking.

Here’s an example that shows how it’s done.

sycl::queue q;
// Queue on a different device for later use
sycl::device other_dev = ...;
sycl::queue q2{other_dev};

std::size_t s = 1024;
sycl::buffer<int> buff{sycl::range{s}};

// Extract USM pointer - at this point we are not yet guaranteed
// that an allocation exists because memory is allocated lazily.
// We can however force preallocation of memory using the hipSYCL 
// handler::update extension (Not yet in hipSYCL 0.9.1, but in 
// current develop branch on github).
q.submit([&](sycl::handler& cgh){
  sycl::accessor acc{buff, cgh};
  cgh.update(acc);
});
// Also preallocate on another device for later use.
q2.submit([&](sycl::handler& cgh){
  sycl::accessor acc{buff, cgh};
  cgh.update(acc);
});
q.wait(); q2.wait();

// Since memory has now been allocated by the buffer, we can now extract
// an USM pointer.
int* usm_ptr = buff.get_pointer(q.get_device());

// Submit a kernel operating on buff
sycl::event evt = q.submit([&](sycl::handler& cgh){
  sycl::accessor acc{buff, cgh};
  cgh.parallel_for(sycl::range{s}, [=](sycl::id<1> idx){
    // Use acc here
  });
});
// Submit a USM kernel
sycl::event evt2 = q.submit([&](sycl::handler& cgh){
  // Important: Add dependency to the other kernel!
  cgh.depends_on(evt);
  cgh.parallel_for(sycl::range{s}, [=](sycl::id<1> idx){
    // Use usm_ptr here
  });
});

So far no surprises – we just had to insert dependencies manually as expected. Let’s now look at submitting work to a different device. When submitting USM operations to another device, we need to inform the buffer that there are writes taking place on that device, and that it should consider allocations on other devices as outdated after this point. We again use handler::update() for this.

// This is necessary to allow the buffer to infer necessary data transfers correctly.
sycl::event evt3 = q2.submit([&](sycl::handler& cgh){
  // Depend on previous USM operation
  cgh.depends_on(evt2);
  // This is a read-write accessor - it's important that there's
  // a write in the access mode if we want to write to usm_ptr
  // in the next kernel.
  sycl::accessor acc{buff, cgh};
  cgh.update(acc);
})
int* usm_ptr2 = buff.get_pointer(q2.get_device());
sycl::event evt4 = q2.submit([&](sycl::handler& cgh){
  cgh.depends_on(evt3);
  cgh.parallel_for(sycl::range{s}, [=](sycl::id<1> idx){
    // Use usm_ptr2 here
  });
});
// End with operation on first device
q.submit([&](sycl::handler& cgh){
  // Buffer cannot know that USM kernel operates on same data,
  // so we need to manually insert a dependency.
  cgh.depends_on(evt4);
  // This accessor will trigger data migration back to
  // the first device because we are submitting to q
  // instead of q2
  sycl::accessor acc{buff, cgh};
  cgh.parallel_for(sycl::range{s}, [=](sycl::id<1> idx){
    // Use acc here
  });
});

In summary, even using buffers and USM pointers simultaneously for the same data is possible, but requires a solid understanding of SYCL and the guarantees that hipSYCL makes specifically.

Remember that buffers cannot know about USM kernels that utilize the same allocations, so always, always make sure to insert correct dependencies. Also, make sure to inform the buffer that an allocation has been modified so that it can correctly emit data transfers when an accessor is used for the buffer on a different device (including the host device). This can be done by constructing a accessor with a suitable access mode – either by using handler::update(), or by submitting a kernel that uses accessors.

In practice, this might be much simpler. If you are not working with complex task graphs, you could just use a SYCL 2020 in-order queue to avoid having to insert all those dependencies manually. And if you are only working on a single device, your handler::update() calls might not be required anymore.

API reference

For the full API reference, see the hipSYCL documentation.

hipSYCL 0.9.1 features: Asynchronous buffers and explicit buffer policies

2021-04-11T20:28:06+02:00

This post is part of a series where we discuss some features of the brandnew hipSYCL 0.9.1. Today I want to take a closer look at

Asynchronous buffers and explicit buffer policies

This is a new extension in hipSYCL that can make code using sycl::buffer objects much clearer while also improving performance. Interested? Then this blog post is for you.

Motivation 1: Buffers are complicated

A sycl::buffer is a very complicated object. Depending on a combination of multiple factors the semantics of a sycl::buffer can be very different. Will it operate directly on input pointers or will it copy input data to some internal storage? Will it submit a writeback in the destructor to copy data back to host?

I have frequently noticed users getting this wrong. This can either lead to correctness issues, for example

the buffer operates directly on the input pointer, while the user has only intended to provide it as a source of initial data and wanted to reuse it after buffer construction
no writeback is issued even though the user expected data to be copied back to host.

Or performance bugs might be introduced - these are arguably even worse because you might not notice them right away and they might be difficult to find. Some performance bugs that I have seen in user code are:

The buffer issued an unexpected writeback, and thus copied data back to host without the user intending it
The buffer did not operate directly on the pointer provided in the constructor, but instead first copied the data to internal storage which broke performance assumptions on the CPU backend.

Motivation 2: The buffer destructor antipattern

In addition, there is a related performance antipattern that I have noticed frequently. Consider the following code:

T* ptr1 = ...;
T* ptr2 = ...;
T* ptr3 = ...;
sycl::range<1> size = ...;

{
  sycl::buffer<T> b1{ptr1, size};
  sycl::buffer<T> b2{ptr2, size};
  sycl::buffer<T> b3{ptr3, size};

  // Kernels using b1, b2, b3

} // Destructors issue write-back

We construct three buffers that get an input pointer and then, when the scope closes, issue a writeback in their destructors. The problem here is that the execution of writebacks is not really efficient. The SYCL specification requires that in the destructor, a buffer has to wait for the completion of all operations that use it. This means that the following sequence of operations will be executed:

b3.~buffer() runs: submit writeback, wait for completion
b2.~buffer() runs: submit writeback, wait for completion
b1.~buffer() runs: submit writeback, wait for completion

Here we have multiple unnecessary cases of synchronization. For performance it is always better to submit all available work asynchronously, and then wait as late as possible with as few wait calls as possible. So, something like the following will in general perform better:

b3.~buffer() runs: submit writeback asynchronously
b2.~buffer() runs: submit writeback asynchronously
b1.~buffer() runs: submit writeback asynchronously
Maybe do some other work while the writebacks are being processed
Wait for all writebacks to complete

This has multiple advantages:

The SYCL implementation can process a larger task graph consisting of multiple writebacks as well as any other operations that might have been submitted previously, allowing for more optimization opportunities
There is less latency between the writebacks when they are processed by the SYCL backend and hardware, because there is no synchronization in between them.
The execution of writeback can be overlapped with other work on the host if the wait is executed later.

Note: While the worst case is clearly when the buffers submit writebacks as in this example, even if the buffers do not submit a writeback, there might still be a negative performance impact: Because all buffer destructors need to wait individually, they cause individual and potentially unnecessary flushes of the SYCL task graph.

Enter explicit buffer policies

To address both the destructor antipattern as well as the complexity of buffers, hipSYCL 0.9.1 introduces explicit buffer policies, which allow the user to explicitly specify the desired behavior of a buffer. We introduce the following terminology:

	Destructor blocks?	Writes back ?	Uses external storage?
yes	`sync_`	`_writeback_`	`view`
no	`async_`	-	`buffer`

For example, a sync_writeback_view refers to the behavior where the destructor blocks (sync), a writeback will be issued in the destructor (writeback) and the buffer will operate directly on provided input data pointers (view).

These behaviors are not expressed as new C++ types, but as regular sycl::buffer objects that are initialized with special buffer properties. buffers with explicit behaviors are constructed using factory functions such as buffer make_sync_buffer(...). Since these functions still return a sycl::buffer, explicit buffer behaviors integrate well with existing SYCL code that relies on the sycl::buffer type.

Using those factory functions instead of directly constructing sycl::buffer objects significantly improves code clarity - the programmer can now see with one quick glance at the function call what is going to happen, and what performance implications there are.

View

Buffers of view behavior operate directly on the provided input pointer when running on the CPU backend. The pointer must be considered as being in use by the buffer until all operations that the buffer is involved in have completed, including potential writebacks.

Buffer

Buffers of buffer behavior will not operate directly on optionally provided input pointers. If an input data pointer is provided, the data content will be copied to internal storage. The pointer is safe to use (or delete) as desired by the user after the buffer constructor returns.

Writeback

Buffers of writeback behavior will submit a writeback operation to migrate data back to host in the destructor. This will only lead to an actual data copy if the data on the host is outdated. With hipSYCL explicit buffer behaviors, a writeback needs to be explicitly requested by invoking a buffer factory function with writeback in its name. This prevents users accidentally introducing performance bugs by means of unnecessary writebacks.

sync/async

Only buffers with sync behavior block in their destructor. Buffers of async behavior do not - and therefore can be used to solve the buffer destructor performance antipattern:

sycl::queue q;
{
  auto b1 = sycl::make_async_writeback_view(ptr1, size, q);
  auto b2 = sycl::make_async_writeback_view(ptr2, size, q);
  auto b3 = sycl::make_async_writeback_view(ptr3, size, q);
  
  // Submit kernels operating on b1,b2,b3 here
} // Non-blocking buffer destructors

// At some later point, use q.wait() to wait
// for all writebacks
q.wait();

Here async writeback views are used that do not block in their destructor. hipSYCL guarantees that memory allocated by buffer objects will not be freed if there are still operations in flight utilizing those allocations, so kernels and other operations using the buffer objects will complete successfully even if the user-facing buffer object has already been destroyed.

For performance it should be considered best practice to use the async behaviors by default and only use the sync variants when it is absolutely necessary.

API reference

Not every combination of buffer behaviors makes sense. hipSYCL currently supports the following factory functions:

/// Only uses internal storage, 
/// no writeback, 
/// blocking destructor
template <class T, int Dim>
buffer<T, Dim> make_sync_buffer(
    sycl::range<Dim> r);

/// Only uses internal storage,
/// no writeback,
/// blocking destructor.
/// Data pointed to by ptr is copied to internal storage.
template <class T, int Dim>
buffer<T, Dim> make_sync_buffer(
    const T* ptr, sycl::range<Dim> r);

/// Only internal storage, 
/// no writeback,
/// non-blocking destructor
template <class T, int Dim>
buffer<T, Dim> make_async_buffer(
    sycl::range<Dim> r);

/// Only internal storage,
/// no writeback,
/// non-blocking destructor.
/// Data pointed to by ptr is copied to internal storage.
template <class T, int Dim>
buffer<T, Dim> make_async_buffer(
    const T* ptr, sycl::range<Dim> r);

/// Uses provided storage,
/// writes back,
/// blocking destructor.
/// Directly operates on host_view_ptr.
template <class T, int Dim>
buffer<T, Dim> make_sync_writeback_view(
    T* host_view_ptr, sycl::range<Dim> r);

/// Uses provided storage,
/// writes back,
/// non-blocking destructor.
/// Directly operates on host_view_ptr.
/// The provided queue can be used by the user to 
/// wait for the writeback to complete.
template <class T, int Dim>
buffer<T, Dim> make_async_writeback_view(
    T* host_view_ptr, sycl::range<Dim> r,
    const sycl::queue& q);

/// Uses provided storage,
/// does not write back,
/// blocking destructor.
/// Directly operates on host_view_ptr.
template <class T, int Dim>
buffer<T, Dim> make_sync_view(
    T* host_view_ptr, sycl::range<Dim> r);

/// Uses provided storage,
/// does not write back,
/// non-blocking destructor.
/// Directly operates on host_view_ptr.
template <class T, int Dim>
buffer<T, Dim> make_async_view(
    T* host_view_ptr, sycl::range<Dim> r);

/// Additional factory functions exist for 
/// buffer-USM interoperability.
/// Those will be covered in more detail in a future blog post.

For the full API reference, see the hipSYCL documentation.

hipSYCL 0.9.0 - SYCL 2020 and oneAPI DPC++ features coming to hipSYCL

2021-02-22T13:38:06+01:00

On december 10 2020, hipSYCL 0.9.0 was released. This release is significant for several reasons. As we are now on the final trajectory to releasing hipSYCL 0.9.1 as another big update, I felt that it is useful to take a step back and look at some of the highlights of what is already in 0.9.0 - ready for everybody to use.

Support for key SYCL 2020 features

hipSYCL 0.9.0 is the first release that incorporates features from the SYCL 2020 specification.

SYCL 2020 is a major update on the older SYCL 1.2.1. Its highlights include a substantial amount of features that originally came from oneAPI DPC++ and have since been contributed to the SYCL 2020 specification. In particular, this includes:

Unified shared memory, a pointer-based memory management interface as an alternative to the buffer-accessor model;
Parallel reductions;
subgroups that can expose the inner workings of the hardware below work group level;
Optimized work group and subgroup primitives such as reductions or scans;
in-order queues and the ability to explicitly specify dependencies in the DAG;
Unnamed kernel lambdas reduce verbosity and simplify development.

These are important features, as they allow more control over the hardware, enable more flexible usage patterns, or can reduce verbosity for programmers. It is therefore important that these features are well supported across implementations, such that developers can rely on them without limiting code portability.

This is why we felt that it is was important for hipSYCL 0.9.0 to move towards SYCL 2020. Developers can now write code using SYCL 2020 features with say, DPC++ and maybe initially target Intel devices, but then seamlessly transition to using hipSYCL when, for example, AMD GPUs need to be targeted.

Of course, switching between multiple implementations as needed is something that only works because SYCL is an open standard. Without open standards, it is difficult to imagine ecosystems with multiple strong implementations, and the SYCL implementation ecosystem serves as a great example of the power of standards - both in terms of the extremely broad hardware range that SYCL implementations target in summary, but also because of the various design differences between SYCL implementations which provides each one with unique strengths and weaknesses. For each use case, there is most likely a SYCL implementation that is a great fit or was maybe even designed explicitly with that use case in mind.

Code example with SYCL 2020

In the past SYCL was sometimes criticized for being too verbose. The following example uses unified shared memory, unnamed kernel lambdas and queue shortcuts from SYCL 2020. It’s hard to figure out how this code could be any less verbose.

#include 
#include 

int main() {
  sycl::queue q;
  
  const std::size_t size = 4096;
  int *shared_allocation = sycl::malloc_shared<int>(size, q);

  q.parallel_for(sycl::range{size}, [=](sycl::id<1> idx) {
    // Do some meaningful computation here instead of this :-)
    size_t gid = idx.get(0);
    shared_allocation[gid] = gid;
  });

  q.wait();

  // Access result of your computation here
  for(std::size_t i = 0; i < size; ++i)
    std::cout << shared_allocation[i] << std::endl;
  
  sycl::free(shared_allocation, q);
}

With hipSYCL 0.9.0, this kind of code works on all hardware that it supports: Any CPU, NVIDIA GPUs and AMD GPUs.

(As a sidenote, it’s important to realize that the ability to write such code does not mean that the old buffer-accessor model is obsolete and should never be used again - it is still great if you require the features that the buffer-accessor model additionally and automatically provides. For example, the buffer-accessor model provides automatic task graph construction which allows for automatic overlap of data transfers and kernels.)

New runtime and architecture

hipSYCL 0.9.0 is also the first release containing a new runtime library, entirely rewritten from scratch. As part of this work, so much of hipSYCL was changed and restructured that working with it now really feels like a completely different SYCL implementation. If you have some experience with the earlier 0.8 series, now might be a good time to check on hipSYCL again.

Looking at the stats between the previous release, 0.8.2, and 0.9.0 shows that pretty much every file was modified, with a net increase of almost 20000 lines of code.

$ git diff v0.8.2 v0.9.0 --stat
...
289 files changed, 28297 insertions(+), 11641 deletions(-)

This new runtime library was redesigned from the ground with a multi-backend architecture in mind that allows using multiple backends simultaneously, with the final goal of being able to compile source files to a single binary that can run on all of hipSYCL targets: CPUs, NVIDIA GPUs and AMD GPUs. This is in contrast to earlier hipSYCL versions, where the user had to decide at compile time which backend was targeted.

While hipSYCL 0.9.0 contains all the necessary runtime and SYCL kernel header support to target all backends simultaneously, it still misses some compiler components. As a consequence, hipSYCL 0.9.0 can target CPUs and either AMD or NVIDIA GPUs at the same time.

As the last missing piece of the puzzle, the required compiler support will be part of hipSYCL 0.9.1, and is in fact already merged and available in the develop branch on github. In short: If you install the latest hipSYCL git version, you will already be able to compile to a single binary that can run on CPUs, NVIDIA GPUs, and AMD GPUs.

To express the ability to target multiple backends simultaneously, hipSYCL 0.9.0 deprecates the old --hipsycl-platform and --hipsycl-gpu-arch arguments and introduces a new, unified way to specify compilation targets using the new --hipsycl-targets argument. For example, to compile kernels for the OpenMP CPU backend as well as AMD gfx906 chips (Radeon VII/Instinct MI50 GPUs), the compiler argument --hipsycl-targets=omp;hip:gfx906 can be used.

The new runtime also introduces a lot of other features, such as memory management at a granularity below buffer size, and a different model for SYCL queues. These features are mainly important for future development and, at least for now, have limited impact for end users. The release page lists some more features.

Big performance improvements for nd_range parallel_for on CPUs

One of the more apparent improvements is the performance of nd_range parallel_for on CPUs. nd_range parallel_for is notoriously difficult to implement for pure-library CPU backends such as hipSYCL’s OpenMP backend, because the nd_range model allows for explicit work group barriers and collective group algorithms. This in turn requires independent forward-progress guarantees for each work item, which does not map well to CPUs - at least with the methods that pure C++ provides.

In hipSYCL 0.9.0, we transition from using threads for work items to a hybrid approach where multithreading is only used across work groups, and work items are represented using fibers (lightweight userspace threads with cooperative scheduling).

As an additional optimization, hipSYCL first attempts to execute a work group with a single fiber and a loop across work items, which might be vectorized by the compiler. Only when independent forward progress for work items is actually needed, for example when a work group barrier or collective group algorithm is encountered, will hipSYCL dynamically switch to a model where each work item is mapped to its own fiber. If no barrier is encountered, performance is similar to hierarchical or basic parallel for execution models that can be implemented very efficiently in pure-library SYCL implementation backends.

Overall, compared to hipSYCL 0.8.0, performance can increase by several orders of magnitude for typical workloads.

Extensions

hipSYCL 0.9.0 also introduces a couple of new SYCL extensions, most notably a new execution model: Scoped parallelism allows for a performance portable formulation of kernels across backends that still provides access to lower-level features such as local memory or work group barriers that are otherwise only available in nd_range parallel for. See the documentation for more information. In short, the idea behind scoped parallelism is to distinguish between a user-requested logical parallelism within a work group that describes the number of work items that should be processed in a group, and an implementation-provided physical parallelism which refers to the actual work group parallelism running in the backend. In scoped parallelism, the SYCL implementation will decide on a degree of physical parallelism that is well suited for the hardware, and then distribute the logical work items across the physical resources. The additional freedom for the SYCL implementation to choose the actual work group parallelism is what makes this execution model more performance portable than nd_range parallel for.

As a brief teaser, this is what work group reduction using local memory looks like in scoped parallelism:

#include 

int main(){
  
  sycl::queue q;
  
  std::size_t input_size = 1024;
  int *data = sycl::malloc_shared<int>(input_size, q);
  
  for(int i = 0; i < input_size; ++i)
    data[i] = i;
  
  constexpr size_t Group_size = 128;
  q.parallel(
    //number of groups
    sycl::range<1>{input_size / Group_size},
    sycl::range<1>{Group_size}, //logical group size
    [=](sycl::group<1> grp, 
        sycl::physical_item<1> physical_idx){
      // Code in this scope is executed
      // within the implementation-defined
      // physical iteration space.

      // Local memory can be allocated using the
      // sycl::local_memory extension.
      sycl::local_memory<int [Group_size]> scratch{grp};
      
      // `distribute_for` distributes the logical,
      // user-provided iteration space across the
      // physical one from the outer scope
      grp.distribute_for([&](sycl::sub_group sg,
                             sycl::logical_item<1> idx){
          scratch[idx.get_local_id(0)] =
                data[idx.get_global_id(0)];
      }); 
      // implicit barrier here

      for(int i = Group_size / 2; i > 0; i /= 2){
        grp.distribute_for([&](sycl::sub_group sg,
                               sycl::logical_item<1> idx){
          size_t lid = idx.get_local_id(0);
          if(lid < i)
            scratch[lid] += scratch[lid+i];
        });
      }
      
      grp.single_item([&](){
        data[grp.get_id(0)*Group_size] = scratch[0];
      });
    });
  });
  
  // Use results here
  // ...
  sycl::free(data, q);
}

Scoped parallelism can be implemented both on top of nd_range parallel for for backends that support it well, as well as on top of hierarchical parallel for from SYCL 1.2.1. It can therefore be seen as a generalization or abstraction of both models.

(Note: We might adapt the interface of scoped parallelism slightly in the future to align better with some of the patterns found in the final SYCL 2020 specification)

Get it!

If I have piqued your interest in hipSYCL, head over to the github repository and download the release, or for even more new features, clone the repository from the develop branch!