PyFR performance on Mac M1 Chip

The new processor being used by Apple is, as many will know by now, there own custom ARM architecture. Claims a plenty have been made about its performance, but when I hear these claims all I really want to know is: how fast can it run a Taylor–Green vortex.

To answer this I am going to turn to the high-order numerical method, flux reconstruction. The reason being that I like it and have access to some performant codes for it, in particular, PyFR. Now getting PyFR to work on MacOS 11 with the M1 chip was a bit of a faff, and for my own record in case I brick my machine, I’ll document the standout steps.

Install homebrew etc.
Use Homebrew to install: GCC, python3.9, open-mpi, hdf5, and numpy
In the dependencies of PyFR the really tricky one is h5py. Even though at the time of writing, the homebrew numpy version was 1.20.3, h5py tries to build numpy 1.19.3, and there seem to be some issues with compile numpy (hence why I’m using the brew version). The work around I came up with was to clone the h5py git repo, and bump the numpy version number in setup.py. Then install this local version. I didn’t have any issues doing this, but pyfr doesn’t use the full h5py feature set. (You may need to upgrade pip setuptools etc.).
Now you can do pip install pyfr to get the last of the dependencies, and either use that, or uninstall and use a git clone (this is what I did).

One of the big selling points of the M1 was the neural engine, a 16 core accelerator aimed applying neural networks. Apple doesn’t make it easy to use the Neural Engine, I did have a cursory look at making a backend, but this seemed far too involved for the limited hardware available. Moreover, PyFR is generally memory bandwidth bound and so it likely wouldn’t benefit much from the ANE. Sadly, I couldn’t find a detailed profiler to confirm its bandwidth bound on M1, there is nothing like VTune or Nsight for Mac hardware, but it seems reasonable to assume given everything. As a result of all this, I used the OpenMP backend; hence, I needed GCC. Also I used GiMMiK for matrix multiplication which will probably be the best option in this case.

To measure performance though we can use the following metric:

\[\frac{\text{Runtime}}{\text{DoF}\times\text{RK steps}}.\]

The specific setup I used I previously detailed here, but here I reduced the number of time step as a Mac isn’t a supercomputer and there was no need to run it over night. This case has a $p=3$ hexahedral mesh, this leads to quite sparse operators in the FR, and hence why GiMMiK for the matmul is best option.

Using all 8 cores this was the result:

	Single	Double
Runtime [s]	233.399	441.365
DoF	4096000	4096000
RHS	2000	2000
ns/DoF/RHS	28.49	53.88

If instead I just use 4 threads, the performance was reduced, but not by much. This is not untypical for OpenMP, as threads will end up spending time waiting on other threads.

	Single	Double
Runtime [s]	244.541	460.131
DoF	4096000	4096000
RHS	2000	2000
ns/DoF/RHS	29.85	56.17

Something you can do instead is to set the thread scheduler to dynamic, this will dynamically allocate chunks of the loop to cores as they become available. For this I used the default OpenMP chunk size. As you can see below, this was somewhere between the static 8 core and static 4 core performance. So it seems that the overhead of the dynamic allocation isn’t worth it in this instance.

	Single	Double
Runtime [s]	237.116	444.979
DoF	4096000	4096000
RHS	2000	2000
ns/DoF/RHS	28.94	54.32

To wrap it up, the performance doesn’t seem that bad considering. In this post I showed the results for the new Nvidia A100 GPUs and for reference, they seem to be about 21.6 times more performant than a single M1 chip. In the future, a cache blocking update to PyFR will be pushed for the full Navier–Stokes equations. This should give a reasonable performance bump to CPUs. So that is something to watch for. Either way, with the memory bandwidth on CPUs continuing to improve, for bandwidth bound applications such as PyFR, CPUs seem to be becoming more competitive.