Will Trojak

Dubiner Basis for Regular Tetrahedrons

2022-03-30T15:00:12+00:00

I have recently been working on extending stable flux reconstruction methods to polygons and polyhedra. One of the difficulties in doing this is imposing the symmetry conditions. This can be done, but is made much easier if a regular reference element is used. However for tetrahedron, it is common to use an irregular reference tetrahedron, specifically one with nodes:

\[[-1, -1, -1], \; [1, -1, -1],\; [-1,1,-1],\; [-1,-1,1]\]

The basis set out in Hesthaven and Warburton (2008) uses these points, but as I say this isn’t very helpful for imposing rotational symmetries. Instead, I would like to use the regular tetrahedron defined by the points:

\[[\sqrt{8/9}, 0, -1/3],\; [-\sqrt{2/9}, \sqrt{2/3}, -1/3],\; [-\sqrt{2/9},-\sqrt{2/3}, -1/3],\; [0,0,1]\]

I couldn’t find a reference that had considered a basis on these points before, so as making the basis was a bit fiddly I decided to write it down for others.

\[\phi_{i,j,k}(x,y,z) = \frac{2(c-1)^{i+j}(1-a)^j}{\sqrt{3}}J^{(0,0)}_{i}(a){J}^{(2i+1,0)}_j(b){J}^{(2i+2j+2,0)}_k(c)\]

where $J$ are normalised Jacobi polynomials and for

\[a = -\frac{(4\sqrt{2}x + z - 1)}{3(z - 1)},\; b = -\frac{2\sqrt{3}y}{2x + \sqrt{2}(z - 1)},\; c = \frac{3z - 1}{2}.\]

In the transformed coordinate system of $a$, $b$, and $c$, the domain is $[-1,1]^3$. One disadvantage of this basis is that it isn’t quite normalised as I couldn’t work that out. The normalisation is probably non-trivial due to the $x$ and $z$ dependence in $a$ and $b$. However, this isn’t such a big issue as it just means you have to calculate the mass matrix explicitly. Which is a trade-off I’m willing to make for the utility this basis offers.

PyFR performance on Mac M1 Chip

2021-06-17T04:26:25+00:00

The new processor being used by Apple is, as many will know by now, there own custom ARM architecture. Claims a plenty have been made about its performance, but when I hear these claims all I really want to know is: how fast can it run a Taylor–Green vortex.

To answer this I am going to turn to the high-order numerical method, flux reconstruction. The reason being that I like it and have access to some performant codes for it, in particular, PyFR. Now getting PyFR to work on MacOS 11 with the M1 chip was a bit of a faff, and for my own record in case I brick my machine, I’ll document the standout steps.

Install homebrew etc.
Use Homebrew to install: GCC, python3.9, open-mpi, hdf5, and numpy
In the dependencies of PyFR the really tricky one is h5py. Even though at the time of writing, the homebrew numpy version was 1.20.3, h5py tries to build numpy 1.19.3, and there seem to be some issues with compile numpy (hence why I’m using the brew version). The work around I came up with was to clone the h5py git repo, and bump the numpy version number in setup.py. Then install this local version. I didn’t have any issues doing this, but pyfr doesn’t use the full h5py feature set. (You may need to upgrade pip setuptools etc.).
Now you can do pip install pyfr to get the last of the dependencies, and either use that, or uninstall and use a git clone (this is what I did).

One of the big selling points of the M1 was the neural engine, a 16 core accelerator aimed applying neural networks. Apple doesn’t make it easy to use the Neural Engine, I did have a cursory look at making a backend, but this seemed far too involved for the limited hardware available. Moreover, PyFR is generally memory bandwidth bound and so it likely wouldn’t benefit much from the ANE. Sadly, I couldn’t find a detailed profiler to confirm its bandwidth bound on M1, there is nothing like VTune or Nsight for Mac hardware, but it seems reasonable to assume given everything. As a result of all this, I used the OpenMP backend; hence, I needed GCC. Also I used GiMMiK for matrix multiplication which will probably be the best option in this case.

To measure performance though we can use the following metric:

\[\frac{\text{Runtime}}{\text{DoF}\times\text{RK steps}}.\]

The specific setup I used I previously detailed here, but here I reduced the number of time step as a Mac isn’t a supercomputer and there was no need to run it over night. This case has a $p=3$ hexahedral mesh, this leads to quite sparse operators in the FR, and hence why GiMMiK for the matmul is best option.

Using all 8 cores this was the result:

	Single	Double
Runtime [s]	233.399	441.365
DoF	4096000	4096000
RHS	2000	2000
ns/DoF/RHS	28.49	53.88

If instead I just use 4 threads, the performance was reduced, but not by much. This is not untypical for OpenMP, as threads will end up spending time waiting on other threads.

	Single	Double
Runtime [s]	244.541	460.131
DoF	4096000	4096000
RHS	2000	2000
ns/DoF/RHS	29.85	56.17

Something you can do instead is to set the thread scheduler to dynamic, this will dynamically allocate chunks of the loop to cores as they become available. For this I used the default OpenMP chunk size. As you can see below, this was somewhere between the static 8 core and static 4 core performance. So it seems that the overhead of the dynamic allocation isn’t worth it in this instance.

	Single	Double
Runtime [s]	237.116	444.979
DoF	4096000	4096000
RHS	2000	2000
ns/DoF/RHS	28.94	54.32

To wrap it up, the performance doesn’t seem that bad considering. In this post I showed the results for the new Nvidia A100 GPUs and for reference, they seem to be about 21.6 times more performant than a single M1 chip. In the future, a cache blocking update to PyFR will be pushed for the full Navier–Stokes equations. This should give a reasonable performance bump to CPUs. So that is something to watch for. Either way, with the memory bandwidth on CPUs continuing to improve, for bandwidth bound applications such as PyFR, CPUs seem to be becoming more competitive.

Cuda Binary Partitions and Pipelines

2021-06-09T22:26:11+00:00

Something I have recently been working on is fusing two GPU kernels in PyFR, one kernel is a pointwise kernel and the other is a matrix multiplication kernel. For more background you can watch this talk. Both these kernels are memory bandwidth bound, and so to increase speed we can reduce going out to main memory by using shared memory.

Some background on shared memory, it sits at the same level as L1 cache, and hence has much higher bandwidth — but unlike cache — the user can explicitly perform load and store operations on it. However, to load something into shared from global, the compiler will first load it from global into a register, and then from the register to shared. The reason, at least as far as I can see, for doing this is that shared memory is shared between threads in a block, and only after a thread sync will it be guaranteed that the value will be resident in shared. Therefore, putting it in a register would give the compiler more flexibility when optimising. However, this doesn’t necessarily fit with what an engineer might want.

Enter the Ampere series of GPUs by Nvidia. The interesting thing that was introduced with the Ampere was ability to bypass the register stage, and even L1 and L2 cache, when loading global into shared. To achieve this you currently have to make use of the memcpy_async functionality added in CUDA 11. There are a couple way to use this but, at least to me, the way that is more interesting are pipelines.

A pipeline is a feature exposed to Volta (sm_70) and later GPUs, and its a queue that can have multiple stage. Producers add jobs to the tail of the queue and consumers remove jobs from the head. As the names suggests, producers ‘produce’ data to be used by the consumers. Why might you want to do this? Well Ampere has dedicated hardware to do the load into shared that bypasses registers/cache. A simple example is shown below:

__global__ example(int n, float* __restrict__ g)
{
    auto block = cg::this_thread_block();
    extern __shared__ float s[];
    constexpr size_t stages = 1;
    constexpr auto scope = cuda::thread_scope::thread_scope_block;
    __shared__ cuda::pipeline_shared_state<scope, stages> shared_state;
    auto pipe = cuda::make_pipeline(block, &shared_state);

    pipe.producer_aquire();
        cuda::memcpy_async(block, s + 2*block.thread_rank(), g + 2*block.thread_rank(), 2*sizeof(float), pipe);
    pipe.producer.commit();

    pipe.consumer_wait();
        //Some compute
    pipe.consmer_release();
}

This is a single stage pipeline, where each thread simply loads a two floats from g into s. This works in chunks, so thread 0 will load g[0] and g[1] into s[0] and s[1], respectively. (This didn’t seem to be obviously documented at the time I wrote this).

You can use this feature on Volta but you don’t get the hardware acceleration that Ampere has. So for my application what I wanted to do was have some threads working as producers and some as consumers, currently, all threads are both. To achieve this it made the most sense to use the binary partition feature. We start by defining the roles, for example like this:

    auto role = ((block.thread_rank() % 2) == 0) ? cuda::pipeline_role::producer : cuda::pipeline_role::consumer;

This makes even threads producers and odd threads consumers. We can then pass this when we make the pipeline to get what we want, for example:

    auto pipe = cuda::make_pipeline(block, &shared_state, role);

Now if you make those modification to the simple memcpy_async example above it will hang and the consumer wait. What is going on? Well there is nothing currently stopping the threads that we want to be exclusively consumers executing the provider part. According to the C++ API documentation on git, the behaviour in this case is undefined. But looking at the source it seems that the consumer threads get suck waiting on the copy that never happens.

Instead, you have to add some protection to the producer and consumer statements. So the complete example would be:

    auto block = cg::this_thread_block();
    extern __shared__ float s[];
    constexpr size_t stages = 1;
    constexpr auto scope = cuda::thread_scope::thread_scope_block;
    auto role = ((block.thread_rank() % 2) == 0) ? cuda::pipeline_role::producer : cuda::pipeline_role::consumer;
    __shared__ cuda::pipeline_shared_state<scope, stages> shared_state;
    auto pipe = cuda::make_pipeline(block, &shared_state, role);

    if(role == cuda::pipeline_role::producer)
    {
        pipe.producer_aquire();
            cuda::memcpy_async(block, s + 2*block.thread_rank(), g + 2*block.thread_rank(), 2*sizeof(float), pipe);
        pipe.producer.commit();
    }

    if(role == cuda::pipeline_rol::consumer)
    {
         pipe.consumer_wait();
            //Some compute
        pipe.consmer_release();
    }

I thought I would add this clarification, mainly as it caused me some issues and the feature seemed to be a bit under-documented. You might be wondering how this performed in may application, well it seemed to lead to significant branch divergence, that killed performance. It also seems to me that although the memcpy_async is supported on Volta, you really don’t get the benefits. However, in my experience with A100s, it seems that the asynchronous paradigm will prove to be quite important, but due to the dedicated hardware the method I just described may not be that useful. More testing required.

Array pointers in F90

2021-06-08T01:31:21+00:00

I was recently playing around with pointers in fortran, what I wanted to acheive was an array of pointers where each pointed to a different elemnt in an array. In C/C++ this is simple to achieve, something a bit like this for example:

	float a[2];
	float *b[2];

	a[0] = 1.; a[1] = 2.;
	b[0] = &a[1]; b[1] = &a[0];

However, this isn’t natively supported in fortran at the moment. This is perhaps with good reason, in fortran by assuming that pointers to an array point to a contiguous part of the array it avoids alaising; meaning the compiler can make curtain assumptions. An example of a pointer in fortran would be:

	real, target :: a(10)
	real, pointer :: b(:)
	b => a(4:6)

To acheive the behaviour I’m interested, one method is to declare a derived type, and then make an array of that type. For example, to match the behavour of the earlier C/C++ example you could do:

	type real_ptr
		real(kind=4), pointer :: p
	end type real_ptr 
	real(kind=4), target :: a(3)
	type(real_ptr) :: b(2)
	b(1)%p => a(2); b(2)%p => a(1)

You might rightly ask, how does this perform? Surely using a derived type and having to invoke a bit more heavy machinery can’t be too performant. Well below is the interesting part of the assembly:

	movss	xmm0, DWORD PTR .LC0[rip]
	movss	DWORD PTR [rbp-8], xmm0
	movss	xmm0, DWORD PTR .LC1[rip]
	movss	DWORD PTR [rbp-4], xmm0
	lea	rax, [rbp-8]
	add	rax, 4
	mov	QWORD PTR [rbp-32], rax
	lea	rax, [rbp-8]
	mov	QWORD PTR [rbp-24], rax

This was compiled with GCC 8.4.0 on an Intel based system. The interesting bit is that, barring some addtional standard setup bits required by fortran, the assembly is exactly the same. So to answer the question, is this apporach performant in fortran; the answer is its as performant as C/C++ in this case.

I also tried this on an Arm based system, but the differences were more significant. But frankly I put this down to the fortran compiler for Arm.