Jekyll2023-01-11T08:41:30+00:00https://willtrojak.org/feed.xmlWill TrojakDubiner Basis for Regular Tetrahedrons2022-03-30T15:00:12+00:002022-03-30T15:00:12+00:00https://willtrojak.org/polynomials/2022/03/30/duniner_basis_tet<p>I have recently been working on extending stable flux reconstruction methods to
polygons and polyhedra. One of the difficulties in doing this is imposing the
symmetry conditions. This can be done, but is made much easier if a regular
reference element is used. However for tetrahedron, it is common to use an
irregular reference tetrahedron, specifically one with nodes:</p>
\[[-1, -1, -1], \; [1, -1, -1],\; [-1,1,-1],\; [-1,-1,1]\]
<p>The basis set out in Hesthaven and Warburton (2008) uses these points, but as I
say this isn’t very helpful for imposing rotational symmetries. Instead, I would
like to use the regular tetrahedron defined by the points:</p>
\[[\sqrt{8/9}, 0, -1/3],\; [-\sqrt{2/9}, \sqrt{2/3}, -1/3],\; [-\sqrt{2/9},-\sqrt{2/3}, -1/3],\; [0,0,1]\]
<p>I couldn’t find a reference that had considered a basis on these points before,
so as making the basis was a bit fiddly I decided to write it down for others.</p>
\[\phi_{i,j,k}(x,y,z) = \frac{2(c-1)^{i+j}(1-a)^j}{\sqrt{3}}J^{(0,0)}_{i}(a){J}^{(2i+1,0)}_j(b){J}^{(2i+2j+2,0)}_k(c)\]
<p>where \(J\) are normalised Jacobi polynomials and for</p>
\[a = -\frac{(4\sqrt{2}x + z - 1)}{3(z - 1)},\; b = -\frac{2\sqrt{3}y}{2x + \sqrt{2}(z - 1)},\; c = \frac{3z - 1}{2}.\]
<p>In the transformed coordinate system of \(a\), \(b\), and \(c\), the domain is
\([-1,1]^3\). One disadvantage of this basis is that it isn’t quite normalised
as I couldn’t work that out. The normalisation is probably non-trivial due to
the \(x\) and \(z\) dependence in \(a\) and \(b\). However, this isn’t such a
big issue as it just means you have to calculate the mass matrix explicitly.
Which is a trade-off I’m willing to make for the utility this basis offers.</p>I have recently been working on extending stable flux reconstruction methods to polygons and polyhedra. One of the difficulties in doing this is imposing the symmetry conditions. This can be done, but is made much easier if a regular reference element is used. However for tetrahedron, it is common to use an irregular reference tetrahedron, specifically one with nodes:PyFR performance on Mac M1 Chip2021-06-17T04:26:25+00:002021-06-17T04:26:25+00:00https://willtrojak.org/pyfr/m1/2021/06/17/pyfr-M1-mac<p>The new processor being used by Apple is, as many will know by now, there own
custom ARM architecture. Claims a plenty have been made about its performance,
but when I hear these claims all I really want to know is: how fast can it run a
<a href="https://doi.org/10.1098/rspa.1937.0036">Taylor–Green vortex</a>.</p>
<p>To answer this I am going to turn to the high-order numerical method, flux
reconstruction. The reason being that I like it and have access to some
performant codes for it, in particular, <a href="https://github.com/PyFR/PyFR">PyFR</a>. Now getting PyFR to work
on MacOS 11 with the M1 chip was a bit of a faff, and for my own record in case I
brick my machine, I’ll document the standout steps.</p>
<ol>
<li>Install homebrew etc.</li>
<li>Use Homebrew to install: GCC, python3.9, open-mpi, hdf5, and numpy</li>
<li>In the dependencies of PyFR the really tricky one is h5py. Even though at the
time of writing, the homebrew numpy version was 1.20.3, h5py tries to build
numpy 1.19.3, and there seem to be some issues with compile numpy (hence why
I’m using the brew version). The work around I came up with was to clone the
h5py git repo, and bump the numpy version number in setup.py. Then install
this local version. I didn’t have any issues doing this, but pyfr doesn’t use
the full h5py feature set. (You may need to upgrade pip setuptools etc.).</li>
<li>Now you can do <code class="language-plaintext highlighter-rouge">pip install pyfr</code> to get the last of the dependencies, and
either use that, or uninstall and use a git clone (this is what I did).</li>
</ol>
<p>One of the big selling points of the M1 was the neural engine, a 16 core
accelerator aimed applying neural networks. Apple doesn’t make it easy to use
the Neural Engine, I did have a cursory look at making a backend, but this
seemed far too involved for the limited hardware available. Moreover, PyFR is
generally memory bandwidth bound and so it likely wouldn’t benefit much from the
ANE. Sadly, I couldn’t find a detailed profiler to confirm its bandwidth bound
on M1, there is nothing like VTune or Nsight for Mac hardware, but it seems
reasonable to assume given everything. As a result of all this, I used the
OpenMP backend; hence, I needed GCC. Also I used GiMMiK for matrix
multiplication which will probably be the best option in this case.</p>
<p>To measure performance though we can use the following metric:</p>
\[\frac{\text{Runtime}}{\text{DoF}\times\text{RK steps}}.\]
<p>The specific setup I used I previously detailed <a href="https://pyfr.discourse.group/t/tgv-performance-numbers/407/12">here</a>, but here I
reduced the number of time step as a Mac isn’t a supercomputer and there was no
need to run it over night. This case has a $p=3$ hexahedral mesh, this leads to
quite sparse operators in the FR, and hence why GiMMiK for the matmul is best
option.</p>
<p>Using all 8 cores this was the result:</p>
<table>
<thead>
<tr>
<th> </th>
<th>Single</th>
<th>Double</th>
</tr>
</thead>
<tbody>
<tr>
<td>Runtime [s]</td>
<td>233.399</td>
<td>441.365</td>
</tr>
<tr>
<td>DoF</td>
<td>4096000</td>
<td>4096000</td>
</tr>
<tr>
<td>RHS</td>
<td>2000</td>
<td>2000</td>
</tr>
<tr>
<td>ns/DoF/RHS</td>
<td>28.49</td>
<td>53.88</td>
</tr>
</tbody>
</table>
<p>If instead I just use 4 threads, the performance was reduced, but not by much.
This is not untypical for OpenMP, as threads will end up spending time waiting
on other threads.</p>
<table>
<thead>
<tr>
<th> </th>
<th>Single</th>
<th>Double</th>
</tr>
</thead>
<tbody>
<tr>
<td>Runtime [s]</td>
<td>244.541</td>
<td>460.131</td>
</tr>
<tr>
<td>DoF</td>
<td>4096000</td>
<td>4096000</td>
</tr>
<tr>
<td>RHS</td>
<td>2000</td>
<td>2000</td>
</tr>
<tr>
<td>ns/DoF/RHS</td>
<td>29.85</td>
<td>56.17</td>
</tr>
</tbody>
</table>
<p>Something you can do instead is to set the thread scheduler to dynamic, this
will dynamically allocate chunks of the loop to cores as they become available.
For this I used the default OpenMP chunk size. As you can see below, this was
somewhere between the static 8 core and static 4 core performance. So it seems
that the overhead of the dynamic allocation isn’t worth it in this instance.</p>
<table>
<thead>
<tr>
<th> </th>
<th>Single</th>
<th>Double</th>
</tr>
</thead>
<tbody>
<tr>
<td>Runtime [s]</td>
<td>237.116</td>
<td>444.979</td>
</tr>
<tr>
<td>DoF</td>
<td>4096000</td>
<td>4096000</td>
</tr>
<tr>
<td>RHS</td>
<td>2000</td>
<td>2000</td>
</tr>
<tr>
<td>ns/DoF/RHS</td>
<td>28.94</td>
<td>54.32</td>
</tr>
</tbody>
</table>
<p>To wrap it up, the performance doesn’t seem that bad considering. In
<a href="https://pyfr.discourse.group/t/tgv-performance-numbers/407/12">this</a> post I showed the results for the new Nvidia A100 GPUs and for
reference, they seem to be about 21.6 times more performant than a single M1
chip. In the future, a cache blocking update to PyFR will be pushed for the full
Navier–Stokes equations. This should give a reasonable performance bump to
CPUs. So that is something to watch for. Either way, with the memory bandwidth
on CPUs continuing to improve, for bandwidth bound applications such as PyFR,
CPUs seem to be becoming more competitive.</p>The new processor being used by Apple is, as many will know by now, there own custom ARM architecture. Claims a plenty have been made about its performance, but when I hear these claims all I really want to know is: how fast can it run a Taylor–Green vortex.Cuda Binary Partitions and Pipelines2021-06-09T22:26:11+00:002021-06-09T22:26:11+00:00https://willtrojak.org/cuda/2021/06/09/cuda-pipe-bin-part<p>Something I have recently been working on is fusing two GPU kernels in PyFR, one
kernel is a pointwise kernel and the other is a matrix multiplication kernel.
For more background you can watch this <a href="https://doi.org/10.52843/cassyni.2x9rkc">talk</a>. Both these kernels are
memory bandwidth bound, and so to increase speed we can reduce going out to main
memory by using shared memory.</p>
<p>Some background on shared memory, it sits at the same level as L1 cache, and
hence has much higher bandwidth — but unlike cache — the user can explicitly
perform load and store operations on it. However, to load something into shared
from global, the compiler will first load it from global into a register, and
then from the register to shared. The reason, at least as far as I can see, for
doing this is that shared memory is shared between threads in a block, and only
after a thread sync will it be guaranteed that the value will be resident in
shared. Therefore, putting it in a register would give the compiler more
flexibility when optimising. However, this doesn’t necessarily fit with what an
engineer might want.</p>
<p>Enter the Ampere series of GPUs by Nvidia. The interesting thing that was
introduced with the Ampere was ability to bypass the register stage, and even L1
and L2 cache, when <em>loading</em> global into shared. To achieve this you currently
have to make use of the <code class="language-plaintext highlighter-rouge">memcpy_async</code> functionality added in CUDA 11. There are
a couple way to use this but, at least to me, the way that is more interesting
are pipelines.</p>
<p>A pipeline is a feature exposed to Volta (<code class="language-plaintext highlighter-rouge">sm_70</code>) and later GPUs, and its a
queue that can have multiple stage. Producers add jobs to the tail of the queue
and consumers remove jobs from the head. As the names suggests, producers
‘produce’ data to be used by the consumers. Why might you want to do this? Well
Ampere has dedicated hardware to do the load into shared that bypasses
registers/cache. A simple example is shown below:</p>
<figure class="highlight"><pre><code class="language-c" data-lang="c"><span class="n">__global__</span> <span class="nf">example</span><span class="p">(</span><span class="kt">int</span> <span class="n">n</span><span class="p">,</span> <span class="kt">float</span><span class="o">*</span> <span class="n">__restrict__</span> <span class="n">g</span><span class="p">)</span>
<span class="p">{</span>
<span class="k">auto</span> <span class="n">block</span> <span class="o">=</span> <span class="n">cg</span><span class="o">::</span><span class="n">this_thread_block</span><span class="p">();</span>
<span class="k">extern</span> <span class="n">__shared__</span> <span class="kt">float</span> <span class="n">s</span><span class="p">[];</span>
<span class="n">constexpr</span> <span class="kt">size_t</span> <span class="n">stages</span> <span class="o">=</span> <span class="mi">1</span><span class="p">;</span>
<span class="n">constexpr</span> <span class="k">auto</span> <span class="n">scope</span> <span class="o">=</span> <span class="n">cuda</span><span class="o">::</span><span class="n">thread_scope</span><span class="o">::</span><span class="n">thread_scope_block</span><span class="p">;</span>
<span class="n">__shared__</span> <span class="n">cuda</span><span class="o">::</span><span class="n">pipeline_shared_state</span><span class="o"><</span><span class="n">scope</span><span class="p">,</span> <span class="n">stages</span><span class="o">></span> <span class="n">shared_state</span><span class="p">;</span>
<span class="k">auto</span> <span class="n">pipe</span> <span class="o">=</span> <span class="n">cuda</span><span class="o">::</span><span class="n">make_pipeline</span><span class="p">(</span><span class="n">block</span><span class="p">,</span> <span class="o">&</span><span class="n">shared_state</span><span class="p">);</span>
<span class="n">pipe</span><span class="p">.</span><span class="n">producer_aquire</span><span class="p">();</span>
<span class="n">cuda</span><span class="o">::</span><span class="n">memcpy_async</span><span class="p">(</span><span class="n">block</span><span class="p">,</span> <span class="n">s</span> <span class="o">+</span> <span class="mi">2</span><span class="o">*</span><span class="n">block</span><span class="p">.</span><span class="n">thread_rank</span><span class="p">(),</span> <span class="n">g</span> <span class="o">+</span> <span class="mi">2</span><span class="o">*</span><span class="n">block</span><span class="p">.</span><span class="n">thread_rank</span><span class="p">(),</span> <span class="mi">2</span><span class="o">*</span><span class="k">sizeof</span><span class="p">(</span><span class="kt">float</span><span class="p">),</span> <span class="n">pipe</span><span class="p">);</span>
<span class="n">pipe</span><span class="p">.</span><span class="n">producer</span><span class="p">.</span><span class="n">commit</span><span class="p">();</span>
<span class="n">pipe</span><span class="p">.</span><span class="n">consumer_wait</span><span class="p">();</span>
<span class="c1">//Some compute</span>
<span class="n">pipe</span><span class="p">.</span><span class="n">consmer_release</span><span class="p">();</span>
<span class="p">}</span></code></pre></figure>
<p>This is a single stage pipeline, where each thread simply loads a two floats
from <code class="language-plaintext highlighter-rouge">g</code> into <code class="language-plaintext highlighter-rouge">s</code>. This works in chunks, so thread 0 will load <code class="language-plaintext highlighter-rouge">g[0]</code> and <code class="language-plaintext highlighter-rouge">g[1]</code>
into <code class="language-plaintext highlighter-rouge">s[0]</code> and <code class="language-plaintext highlighter-rouge">s[1]</code>, respectively. (This didn’t seem to be obviously
documented at the time I wrote this).</p>
<p>You can use this feature on Volta but you don’t get the hardware acceleration
that Ampere has. So for my application what I wanted to do was have some threads
working as producers and some as consumers, currently, all threads are both. To
achieve this it made the most sense to use the binary partition feature. We
start by defining the roles, for example like this:</p>
<figure class="highlight"><pre><code class="language-c" data-lang="c"> <span class="k">auto</span> <span class="n">role</span> <span class="o">=</span> <span class="p">((</span><span class="n">block</span><span class="p">.</span><span class="n">thread_rank</span><span class="p">()</span> <span class="o">%</span> <span class="mi">2</span><span class="p">)</span> <span class="o">==</span> <span class="mi">0</span><span class="p">)</span> <span class="o">?</span> <span class="n">cuda</span><span class="o">::</span><span class="n">pipeline_role</span><span class="o">::</span><span class="n">producer</span> <span class="o">:</span> <span class="n">cuda</span><span class="o">::</span><span class="n">pipeline_role</span><span class="o">::</span><span class="n">consumer</span><span class="p">;</span></code></pre></figure>
<p>This makes even threads producers and odd threads consumers. We can then pass
this when we make the pipeline to get what we want, for example:</p>
<figure class="highlight"><pre><code class="language-c" data-lang="c"> <span class="k">auto</span> <span class="n">pipe</span> <span class="o">=</span> <span class="n">cuda</span><span class="o">::</span><span class="n">make_pipeline</span><span class="p">(</span><span class="n">block</span><span class="p">,</span> <span class="o">&</span><span class="n">shared_state</span><span class="p">,</span> <span class="n">role</span><span class="p">);</span></code></pre></figure>
<p>Now if you make those modification to the simple <code class="language-plaintext highlighter-rouge">memcpy_async</code> example above it
will hang and the consumer wait. What is going on? Well there is nothing
currently stopping the threads that we want to be exclusively consumers
executing the provider part. According to the C++ API documentation on git, the
behaviour in this case is undefined. But looking at the source it seems that the
consumer threads get suck waiting on the copy that never happens.</p>
<p>Instead, you have to add some protection to the producer and consumer
statements. So the complete example would be:</p>
<figure class="highlight"><pre><code class="language-c" data-lang="c"> <span class="k">auto</span> <span class="n">block</span> <span class="o">=</span> <span class="n">cg</span><span class="o">::</span><span class="n">this_thread_block</span><span class="p">();</span>
<span class="k">extern</span> <span class="n">__shared__</span> <span class="kt">float</span> <span class="n">s</span><span class="p">[];</span>
<span class="n">constexpr</span> <span class="kt">size_t</span> <span class="n">stages</span> <span class="o">=</span> <span class="mi">1</span><span class="p">;</span>
<span class="n">constexpr</span> <span class="k">auto</span> <span class="n">scope</span> <span class="o">=</span> <span class="n">cuda</span><span class="o">::</span><span class="n">thread_scope</span><span class="o">::</span><span class="n">thread_scope_block</span><span class="p">;</span>
<span class="k">auto</span> <span class="n">role</span> <span class="o">=</span> <span class="p">((</span><span class="n">block</span><span class="p">.</span><span class="n">thread_rank</span><span class="p">()</span> <span class="o">%</span> <span class="mi">2</span><span class="p">)</span> <span class="o">==</span> <span class="mi">0</span><span class="p">)</span> <span class="o">?</span> <span class="n">cuda</span><span class="o">::</span><span class="n">pipeline_role</span><span class="o">::</span><span class="n">producer</span> <span class="o">:</span> <span class="n">cuda</span><span class="o">::</span><span class="n">pipeline_role</span><span class="o">::</span><span class="n">consumer</span><span class="p">;</span>
<span class="n">__shared__</span> <span class="n">cuda</span><span class="o">::</span><span class="n">pipeline_shared_state</span><span class="o"><</span><span class="n">scope</span><span class="p">,</span> <span class="n">stages</span><span class="o">></span> <span class="n">shared_state</span><span class="p">;</span>
<span class="k">auto</span> <span class="n">pipe</span> <span class="o">=</span> <span class="n">cuda</span><span class="o">::</span><span class="n">make_pipeline</span><span class="p">(</span><span class="n">block</span><span class="p">,</span> <span class="o">&</span><span class="n">shared_state</span><span class="p">,</span> <span class="n">role</span><span class="p">);</span>
<span class="k">if</span><span class="p">(</span><span class="n">role</span> <span class="o">==</span> <span class="n">cuda</span><span class="o">::</span><span class="n">pipeline_role</span><span class="o">::</span><span class="n">producer</span><span class="p">)</span>
<span class="p">{</span>
<span class="n">pipe</span><span class="p">.</span><span class="n">producer_aquire</span><span class="p">();</span>
<span class="n">cuda</span><span class="o">::</span><span class="n">memcpy_async</span><span class="p">(</span><span class="n">block</span><span class="p">,</span> <span class="n">s</span> <span class="o">+</span> <span class="mi">2</span><span class="o">*</span><span class="n">block</span><span class="p">.</span><span class="n">thread_rank</span><span class="p">(),</span> <span class="n">g</span> <span class="o">+</span> <span class="mi">2</span><span class="o">*</span><span class="n">block</span><span class="p">.</span><span class="n">thread_rank</span><span class="p">(),</span> <span class="mi">2</span><span class="o">*</span><span class="k">sizeof</span><span class="p">(</span><span class="kt">float</span><span class="p">),</span> <span class="n">pipe</span><span class="p">);</span>
<span class="n">pipe</span><span class="p">.</span><span class="n">producer</span><span class="p">.</span><span class="n">commit</span><span class="p">();</span>
<span class="p">}</span>
<span class="k">if</span><span class="p">(</span><span class="n">role</span> <span class="o">==</span> <span class="n">cuda</span><span class="o">::</span><span class="n">pipeline_rol</span><span class="o">::</span><span class="n">consumer</span><span class="p">)</span>
<span class="p">{</span>
<span class="n">pipe</span><span class="p">.</span><span class="n">consumer_wait</span><span class="p">();</span>
<span class="c1">//Some compute</span>
<span class="n">pipe</span><span class="p">.</span><span class="n">consmer_release</span><span class="p">();</span>
<span class="p">}</span> </code></pre></figure>
<p>I thought I would add this clarification, mainly as it caused me some issues and
the feature seemed to be a bit under-documented. You might be wondering how this
performed in may application, well it seemed to lead to significant branch
divergence, that killed performance. It also seems to me that although the
<code class="language-plaintext highlighter-rouge">memcpy_async</code> is supported on Volta, you really don’t get the benefits.
However, in my experience with A100s, it seems that the asynchronous paradigm
will prove to be quite important, but due to the dedicated hardware the
method I just described may not be that useful. More testing required.</p>Something I have recently been working on is fusing two GPU kernels in PyFR, one kernel is a pointwise kernel and the other is a matrix multiplication kernel. For more background you can watch this talk. Both these kernels are memory bandwidth bound, and so to increase speed we can reduce going out to main memory by using shared memory.Array pointers in F902021-06-08T01:31:21+00:002021-06-08T01:31:21+00:00https://willtrojak.org/fortran/2021/06/08/f90-array-pointers<p>I was recently playing around with pointers in fortran, what I wanted to acheive
was an array of pointers where each pointed to a different elemnt in an array.
In C/C++ this is simple to achieve, something a bit like this for example:</p>
<figure class="highlight"><pre><code class="language-c" data-lang="c"> <span class="kt">float</span> <span class="n">a</span><span class="p">[</span><span class="mi">2</span><span class="p">];</span>
<span class="kt">float</span> <span class="o">*</span><span class="n">b</span><span class="p">[</span><span class="mi">2</span><span class="p">];</span>
<span class="n">a</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">=</span> <span class="mi">1</span><span class="p">.;</span> <span class="n">a</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span> <span class="o">=</span> <span class="mi">2</span><span class="p">.;</span>
<span class="n">b</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">=</span> <span class="o">&</span><span class="n">a</span><span class="p">[</span><span class="mi">1</span><span class="p">];</span> <span class="n">b</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span> <span class="o">=</span> <span class="o">&</span><span class="n">a</span><span class="p">[</span><span class="mi">0</span><span class="p">];</span></code></pre></figure>
<p>However, this isn’t natively supported in fortran at the moment. This is perhaps
with good reason, in fortran by assuming that pointers to an array point to a
contiguous part of the array it avoids alaising; meaning the compiler can make
curtain assumptions. An example of a pointer in fortran would be:</p>
<figure class="highlight"><pre><code class="language-fortran" data-lang="fortran"><span class="w"> </span><span class="kt">real</span><span class="p">,</span><span class="w"> </span><span class="k">target</span><span class="w"> </span><span class="p">::</span><span class="w"> </span><span class="n">a</span><span class="p">(</span><span class="mi">10</span><span class="p">)</span><span class="w">
</span><span class="kt">real</span><span class="p">,</span><span class="w"> </span><span class="k">pointer</span><span class="w"> </span><span class="p">::</span><span class="w"> </span><span class="n">b</span><span class="p">(:)</span><span class="w">
</span><span class="n">b</span><span class="w"> </span><span class="o">=></span><span class="w"> </span><span class="n">a</span><span class="p">(</span><span class="mi">4</span><span class="p">:</span><span class="mi">6</span><span class="p">)</span></code></pre></figure>
<p>To acheive the behaviour I’m interested, one method is to declare a derived
type, and then make an array of that type. For example, to match the behavour of
the earlier C/C++ example you could do:</p>
<figure class="highlight"><pre><code class="language-fortran" data-lang="fortran"><span class="w"> </span><span class="k">type</span><span class="w"> </span><span class="n">real_ptr</span><span class="w">
</span><span class="kt">real</span><span class="p">(</span><span class="nb">kind</span><span class="o">=</span><span class="mi">4</span><span class="p">),</span><span class="w"> </span><span class="k">pointer</span><span class="w"> </span><span class="p">::</span><span class="w"> </span><span class="n">p</span><span class="w">
</span><span class="k">end</span><span class="w"> </span><span class="k">type</span><span class="w"> </span><span class="n">real_ptr</span><span class="w">
</span><span class="kt">real</span><span class="p">(</span><span class="nb">kind</span><span class="o">=</span><span class="mi">4</span><span class="p">),</span><span class="w"> </span><span class="k">target</span><span class="w"> </span><span class="p">::</span><span class="w"> </span><span class="n">a</span><span class="p">(</span><span class="mi">3</span><span class="p">)</span><span class="w">
</span><span class="k">type</span><span class="p">(</span><span class="n">real_ptr</span><span class="p">)</span><span class="w"> </span><span class="p">::</span><span class="w"> </span><span class="n">b</span><span class="p">(</span><span class="mi">2</span><span class="p">)</span><span class="w">
</span><span class="n">b</span><span class="p">(</span><span class="mi">1</span><span class="p">)</span><span class="o">%</span><span class="n">p</span><span class="w"> </span><span class="o">=></span><span class="w"> </span><span class="n">a</span><span class="p">(</span><span class="mi">2</span><span class="p">);</span><span class="w"> </span><span class="n">b</span><span class="p">(</span><span class="mi">2</span><span class="p">)</span><span class="o">%</span><span class="n">p</span><span class="w"> </span><span class="o">=></span><span class="w"> </span><span class="n">a</span><span class="p">(</span><span class="mi">1</span><span class="p">)</span></code></pre></figure>
<p>You might rightly ask, how does this perform? Surely using a derived type and
having to invoke a bit more heavy machinery can’t be too performant. Well below
is the interesting part of the assembly:</p>
<figure class="highlight"><pre><code class="language-asm" data-lang="asm"> movss xmm0, DWORD PTR .LC0[rip]
movss DWORD PTR [rbp-8], xmm0
movss xmm0, DWORD PTR .LC1[rip]
movss DWORD PTR [rbp-4], xmm0
lea rax, [rbp-8]
add rax, 4
mov QWORD PTR [rbp-32], rax
lea rax, [rbp-8]
mov QWORD PTR [rbp-24], rax</code></pre></figure>
<p>This was compiled with GCC 8.4.0 on an Intel based system. The interesting bit
is that, barring some addtional standard setup bits required by fortran, the
assembly is <em>exactly</em> the same. So to answer the question, is this apporach
performant in fortran; the answer is its as performant as C/C++ in this case.</p>
<p>I also tried this on an Arm based system, but the differences were more
significant. But frankly I put this down to the fortran compiler for Arm.</p>I was recently playing around with pointers in fortran, what I wanted to acheive was an array of pointers where each pointed to a different elemnt in an array. In C/C++ this is simple to achieve, something a bit like this for example: