Jekyll2023-01-11T08:41:30+00:00https://willtrojak.org/feed.xmlWill TrojakDubiner Basis for Regular Tetrahedrons2022-03-30T15:00:12+00:002022-03-30T15:00:12+00:00https://willtrojak.org/polynomials/2022/03/30/duniner_basis_tet<p>I have recently been working on extending stable flux reconstruction methods to polygons and polyhedra. One of the difficulties in doing this is imposing the symmetry conditions. This can be done, but is made much easier if a regular reference element is used. However for tetrahedron, it is common to use an irregular reference tetrahedron, specifically one with nodes:</p> $[-1, -1, -1], \; [1, -1, -1],\; [-1,1,-1],\; [-1,-1,1]$ <p>The basis set out in Hesthaven and Warburton (2008) uses these points, but as I say this isn’t very helpful for imposing rotational symmetries. Instead, I would like to use the regular tetrahedron defined by the points:</p> $[\sqrt{8/9}, 0, -1/3],\; [-\sqrt{2/9}, \sqrt{2/3}, -1/3],\; [-\sqrt{2/9},-\sqrt{2/3}, -1/3],\; [0,0,1]$ <p>I couldn’t find a reference that had considered a basis on these points before, so as making the basis was a bit fiddly I decided to write it down for others.</p> $\phi_{i,j,k}(x,y,z) = \frac{2(c-1)^{i+j}(1-a)^j}{\sqrt{3}}J^{(0,0)}_{i}(a){J}^{(2i+1,0)}_j(b){J}^{(2i+2j+2,0)}_k(c)$ <p>where $$J$$ are normalised Jacobi polynomials and for</p> $a = -\frac{(4\sqrt{2}x + z - 1)}{3(z - 1)},\; b = -\frac{2\sqrt{3}y}{2x + \sqrt{2}(z - 1)},\; c = \frac{3z - 1}{2}.$ <p>In the transformed coordinate system of $$a$$, $$b$$, and $$c$$, the domain is $$[-1,1]^3$$. One disadvantage of this basis is that it isn’t quite normalised as I couldn’t work that out. The normalisation is probably non-trivial due to the $$x$$ and $$z$$ dependence in $$a$$ and $$b$$. However, this isn’t such a big issue as it just means you have to calculate the mass matrix explicitly. Which is a trade-off I’m willing to make for the utility this basis offers.</p>I have recently been working on extending stable flux reconstruction methods to polygons and polyhedra. One of the difficulties in doing this is imposing the symmetry conditions. This can be done, but is made much easier if a regular reference element is used. However for tetrahedron, it is common to use an irregular reference tetrahedron, specifically one with nodes:PyFR performance on Mac M1 Chip2021-06-17T04:26:25+00:002021-06-17T04:26:25+00:00https://willtrojak.org/pyfr/m1/2021/06/17/pyfr-M1-mac<p>The new processor being used by Apple is, as many will know by now, there own custom ARM architecture. Claims a plenty have been made about its performance, but when I hear these claims all I really want to know is: how fast can it run a <a href="https://doi.org/10.1098/rspa.1937.0036">Taylor–Green vortex</a>.</p> <p>To answer this I am going to turn to the high-order numerical method, flux reconstruction. The reason being that I like it and have access to some performant codes for it, in particular, <a href="https://github.com/PyFR/PyFR">PyFR</a>. Now getting PyFR to work on MacOS 11 with the M1 chip was a bit of a faff, and for my own record in case I brick my machine, I’ll document the standout steps.</p> <ol> <li>Install homebrew etc.</li> <li>Use Homebrew to install: GCC, python3.9, open-mpi, hdf5, and numpy</li> <li>In the dependencies of PyFR the really tricky one is h5py. Even though at the time of writing, the homebrew numpy version was 1.20.3, h5py tries to build numpy 1.19.3, and there seem to be some issues with compile numpy (hence why I’m using the brew version). The work around I came up with was to clone the h5py git repo, and bump the numpy version number in setup.py. Then install this local version. I didn’t have any issues doing this, but pyfr doesn’t use the full h5py feature set. (You may need to upgrade pip setuptools etc.).</li> <li>Now you can do <code class="language-plaintext highlighter-rouge">pip install pyfr</code> to get the last of the dependencies, and either use that, or uninstall and use a git clone (this is what I did).</li> </ol> <p>One of the big selling points of the M1 was the neural engine, a 16 core accelerator aimed applying neural networks. Apple doesn’t make it easy to use the Neural Engine, I did have a cursory look at making a backend, but this seemed far too involved for the limited hardware available. Moreover, PyFR is generally memory bandwidth bound and so it likely wouldn’t benefit much from the ANE. Sadly, I couldn’t find a detailed profiler to confirm its bandwidth bound on M1, there is nothing like VTune or Nsight for Mac hardware, but it seems reasonable to assume given everything. As a result of all this, I used the OpenMP backend; hence, I needed GCC. Also I used GiMMiK for matrix multiplication which will probably be the best option in this case.</p> <p>To measure performance though we can use the following metric:</p> $\frac{\text{Runtime}}{\text{DoF}\times\text{RK steps}}.$ <p>The specific setup I used I previously detailed <a href="https://pyfr.discourse.group/t/tgv-performance-numbers/407/12">here</a>, but here I reduced the number of time step as a Mac isn’t a supercomputer and there was no need to run it over night. This case has a $p=3$ hexahedral mesh, this leads to quite sparse operators in the FR, and hence why GiMMiK for the matmul is best option.</p> <p>Using all 8 cores this was the result:</p> <table> <thead> <tr> <th> </th> <th>Single</th> <th>Double</th> </tr> </thead> <tbody> <tr> <td>Runtime [s]</td> <td>233.399</td> <td>441.365</td> </tr> <tr> <td>DoF</td> <td>4096000</td> <td>4096000</td> </tr> <tr> <td>RHS</td> <td>2000</td> <td>2000</td> </tr> <tr> <td>ns/DoF/RHS</td> <td>28.49</td> <td>53.88</td> </tr> </tbody> </table> <p>If instead I just use 4 threads, the performance was reduced, but not by much. This is not untypical for OpenMP, as threads will end up spending time waiting on other threads.</p> <table> <thead> <tr> <th> </th> <th>Single</th> <th>Double</th> </tr> </thead> <tbody> <tr> <td>Runtime [s]</td> <td>244.541</td> <td>460.131</td> </tr> <tr> <td>DoF</td> <td>4096000</td> <td>4096000</td> </tr> <tr> <td>RHS</td> <td>2000</td> <td>2000</td> </tr> <tr> <td>ns/DoF/RHS</td> <td>29.85</td> <td>56.17</td> </tr> </tbody> </table> <p>Something you can do instead is to set the thread scheduler to dynamic, this will dynamically allocate chunks of the loop to cores as they become available. For this I used the default OpenMP chunk size. As you can see below, this was somewhere between the static 8 core and static 4 core performance. So it seems that the overhead of the dynamic allocation isn’t worth it in this instance.</p> <table> <thead> <tr> <th> </th> <th>Single</th> <th>Double</th> </tr> </thead> <tbody> <tr> <td>Runtime [s]</td> <td>237.116</td> <td>444.979</td> </tr> <tr> <td>DoF</td> <td>4096000</td> <td>4096000</td> </tr> <tr> <td>RHS</td> <td>2000</td> <td>2000</td> </tr> <tr> <td>ns/DoF/RHS</td> <td>28.94</td> <td>54.32</td> </tr> </tbody> </table> <p>To wrap it up, the performance doesn’t seem that bad considering. In <a href="https://pyfr.discourse.group/t/tgv-performance-numbers/407/12">this</a> post I showed the results for the new Nvidia A100 GPUs and for reference, they seem to be about 21.6 times more performant than a single M1 chip. In the future, a cache blocking update to PyFR will be pushed for the full Navier–Stokes equations. This should give a reasonable performance bump to CPUs. So that is something to watch for. Either way, with the memory bandwidth on CPUs continuing to improve, for bandwidth bound applications such as PyFR, CPUs seem to be becoming more competitive.</p>The new processor being used by Apple is, as many will know by now, there own custom ARM architecture. Claims a plenty have been made about its performance, but when I hear these claims all I really want to know is: how fast can it run a Taylor–Green vortex.Cuda Binary Partitions and Pipelines2021-06-09T22:26:11+00:002021-06-09T22:26:11+00:00https://willtrojak.org/cuda/2021/06/09/cuda-pipe-bin-part<p>Something I have recently been working on is fusing two GPU kernels in PyFR, one kernel is a pointwise kernel and the other is a matrix multiplication kernel. For more background you can watch this <a href="https://doi.org/10.52843/cassyni.2x9rkc">talk</a>. Both these kernels are memory bandwidth bound, and so to increase speed we can reduce going out to main memory by using shared memory.</p> <p>Some background on shared memory, it sits at the same level as L1 cache, and hence has much higher bandwidth — but unlike cache — the user can explicitly perform load and store operations on it. However, to load something into shared from global, the compiler will first load it from global into a register, and then from the register to shared. The reason, at least as far as I can see, for doing this is that shared memory is shared between threads in a block, and only after a thread sync will it be guaranteed that the value will be resident in shared. Therefore, putting it in a register would give the compiler more flexibility when optimising. However, this doesn’t necessarily fit with what an engineer might want.</p> <p>Enter the Ampere series of GPUs by Nvidia. The interesting thing that was introduced with the Ampere was ability to bypass the register stage, and even L1 and L2 cache, when <em>loading</em> global into shared. To achieve this you currently have to make use of the <code class="language-plaintext highlighter-rouge">memcpy_async</code> functionality added in CUDA 11. There are a couple way to use this but, at least to me, the way that is more interesting are pipelines.</p> <p>A pipeline is a feature exposed to Volta (<code class="language-plaintext highlighter-rouge">sm_70</code>) and later GPUs, and its a queue that can have multiple stage. Producers add jobs to the tail of the queue and consumers remove jobs from the head. As the names suggests, producers ‘produce’ data to be used by the consumers. Why might you want to do this? Well Ampere has dedicated hardware to do the load into shared that bypasses registers/cache. A simple example is shown below:</p> <figure class="highlight"><pre><code class="language-c" data-lang="c"><span class="n">__global__</span> <span class="nf">example</span><span class="p">(</span><span class="kt">int</span> <span class="n">n</span><span class="p">,</span> <span class="kt">float</span><span class="o">*</span> <span class="n">__restrict__</span> <span class="n">g</span><span class="p">)</span> <span class="p">{</span> <span class="k">auto</span> <span class="n">block</span> <span class="o">=</span> <span class="n">cg</span><span class="o">::</span><span class="n">this_thread_block</span><span class="p">();</span> <span class="k">extern</span> <span class="n">__shared__</span> <span class="kt">float</span> <span class="n">s</span><span class="p">[];</span> <span class="n">constexpr</span> <span class="kt">size_t</span> <span class="n">stages</span> <span class="o">=</span> <span class="mi">1</span><span class="p">;</span> <span class="n">constexpr</span> <span class="k">auto</span> <span class="n">scope</span> <span class="o">=</span> <span class="n">cuda</span><span class="o">::</span><span class="n">thread_scope</span><span class="o">::</span><span class="n">thread_scope_block</span><span class="p">;</span> <span class="n">__shared__</span> <span class="n">cuda</span><span class="o">::</span><span class="n">pipeline_shared_state</span><span class="o">&lt;</span><span class="n">scope</span><span class="p">,</span> <span class="n">stages</span><span class="o">&gt;</span> <span class="n">shared_state</span><span class="p">;</span> <span class="k">auto</span> <span class="n">pipe</span> <span class="o">=</span> <span class="n">cuda</span><span class="o">::</span><span class="n">make_pipeline</span><span class="p">(</span><span class="n">block</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">shared_state</span><span class="p">);</span> <span class="n">pipe</span><span class="p">.</span><span class="n">producer_aquire</span><span class="p">();</span> <span class="n">cuda</span><span class="o">::</span><span class="n">memcpy_async</span><span class="p">(</span><span class="n">block</span><span class="p">,</span> <span class="n">s</span> <span class="o">+</span> <span class="mi">2</span><span class="o">*</span><span class="n">block</span><span class="p">.</span><span class="n">thread_rank</span><span class="p">(),</span> <span class="n">g</span> <span class="o">+</span> <span class="mi">2</span><span class="o">*</span><span class="n">block</span><span class="p">.</span><span class="n">thread_rank</span><span class="p">(),</span> <span class="mi">2</span><span class="o">*</span><span class="k">sizeof</span><span class="p">(</span><span class="kt">float</span><span class="p">),</span> <span class="n">pipe</span><span class="p">);</span> <span class="n">pipe</span><span class="p">.</span><span class="n">producer</span><span class="p">.</span><span class="n">commit</span><span class="p">();</span> <span class="n">pipe</span><span class="p">.</span><span class="n">consumer_wait</span><span class="p">();</span> <span class="c1">//Some compute</span> <span class="n">pipe</span><span class="p">.</span><span class="n">consmer_release</span><span class="p">();</span> <span class="p">}</span></code></pre></figure> <p>This is a single stage pipeline, where each thread simply loads a two floats from <code class="language-plaintext highlighter-rouge">g</code> into <code class="language-plaintext highlighter-rouge">s</code>. This works in chunks, so thread 0 will load <code class="language-plaintext highlighter-rouge">g</code> and <code class="language-plaintext highlighter-rouge">g</code> into <code class="language-plaintext highlighter-rouge">s</code> and <code class="language-plaintext highlighter-rouge">s</code>, respectively. (This didn’t seem to be obviously documented at the time I wrote this).</p> <p>You can use this feature on Volta but you don’t get the hardware acceleration that Ampere has. So for my application what I wanted to do was have some threads working as producers and some as consumers, currently, all threads are both. To achieve this it made the most sense to use the binary partition feature. We start by defining the roles, for example like this:</p> <figure class="highlight"><pre><code class="language-c" data-lang="c"> <span class="k">auto</span> <span class="n">role</span> <span class="o">=</span> <span class="p">((</span><span class="n">block</span><span class="p">.</span><span class="n">thread_rank</span><span class="p">()</span> <span class="o">%</span> <span class="mi">2</span><span class="p">)</span> <span class="o">==</span> <span class="mi">0</span><span class="p">)</span> <span class="o">?</span> <span class="n">cuda</span><span class="o">::</span><span class="n">pipeline_role</span><span class="o">::</span><span class="n">producer</span> <span class="o">:</span> <span class="n">cuda</span><span class="o">::</span><span class="n">pipeline_role</span><span class="o">::</span><span class="n">consumer</span><span class="p">;</span></code></pre></figure> <p>This makes even threads producers and odd threads consumers. We can then pass this when we make the pipeline to get what we want, for example:</p> <figure class="highlight"><pre><code class="language-c" data-lang="c"> <span class="k">auto</span> <span class="n">pipe</span> <span class="o">=</span> <span class="n">cuda</span><span class="o">::</span><span class="n">make_pipeline</span><span class="p">(</span><span class="n">block</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">shared_state</span><span class="p">,</span> <span class="n">role</span><span class="p">);</span></code></pre></figure> <p>Now if you make those modification to the simple <code class="language-plaintext highlighter-rouge">memcpy_async</code> example above it will hang and the consumer wait. What is going on? Well there is nothing currently stopping the threads that we want to be exclusively consumers executing the provider part. According to the C++ API documentation on git, the behaviour in this case is undefined. But looking at the source it seems that the consumer threads get suck waiting on the copy that never happens.</p> <p>Instead, you have to add some protection to the producer and consumer statements. So the complete example would be:</p> <figure class="highlight"><pre><code class="language-c" data-lang="c"> <span class="k">auto</span> <span class="n">block</span> <span class="o">=</span> <span class="n">cg</span><span class="o">::</span><span class="n">this_thread_block</span><span class="p">();</span> <span class="k">extern</span> <span class="n">__shared__</span> <span class="kt">float</span> <span class="n">s</span><span class="p">[];</span> <span class="n">constexpr</span> <span class="kt">size_t</span> <span class="n">stages</span> <span class="o">=</span> <span class="mi">1</span><span class="p">;</span> <span class="n">constexpr</span> <span class="k">auto</span> <span class="n">scope</span> <span class="o">=</span> <span class="n">cuda</span><span class="o">::</span><span class="n">thread_scope</span><span class="o">::</span><span class="n">thread_scope_block</span><span class="p">;</span> <span class="k">auto</span> <span class="n">role</span> <span class="o">=</span> <span class="p">((</span><span class="n">block</span><span class="p">.</span><span class="n">thread_rank</span><span class="p">()</span> <span class="o">%</span> <span class="mi">2</span><span class="p">)</span> <span class="o">==</span> <span class="mi">0</span><span class="p">)</span> <span class="o">?</span> <span class="n">cuda</span><span class="o">::</span><span class="n">pipeline_role</span><span class="o">::</span><span class="n">producer</span> <span class="o">:</span> <span class="n">cuda</span><span class="o">::</span><span class="n">pipeline_role</span><span class="o">::</span><span class="n">consumer</span><span class="p">;</span> <span class="n">__shared__</span> <span class="n">cuda</span><span class="o">::</span><span class="n">pipeline_shared_state</span><span class="o">&lt;</span><span class="n">scope</span><span class="p">,</span> <span class="n">stages</span><span class="o">&gt;</span> <span class="n">shared_state</span><span class="p">;</span> <span class="k">auto</span> <span class="n">pipe</span> <span class="o">=</span> <span class="n">cuda</span><span class="o">::</span><span class="n">make_pipeline</span><span class="p">(</span><span class="n">block</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">shared_state</span><span class="p">,</span> <span class="n">role</span><span class="p">);</span> <span class="k">if</span><span class="p">(</span><span class="n">role</span> <span class="o">==</span> <span class="n">cuda</span><span class="o">::</span><span class="n">pipeline_role</span><span class="o">::</span><span class="n">producer</span><span class="p">)</span> <span class="p">{</span> <span class="n">pipe</span><span class="p">.</span><span class="n">producer_aquire</span><span class="p">();</span> <span class="n">cuda</span><span class="o">::</span><span class="n">memcpy_async</span><span class="p">(</span><span class="n">block</span><span class="p">,</span> <span class="n">s</span> <span class="o">+</span> <span class="mi">2</span><span class="o">*</span><span class="n">block</span><span class="p">.</span><span class="n">thread_rank</span><span class="p">(),</span> <span class="n">g</span> <span class="o">+</span> <span class="mi">2</span><span class="o">*</span><span class="n">block</span><span class="p">.</span><span class="n">thread_rank</span><span class="p">(),</span> <span class="mi">2</span><span class="o">*</span><span class="k">sizeof</span><span class="p">(</span><span class="kt">float</span><span class="p">),</span> <span class="n">pipe</span><span class="p">);</span> <span class="n">pipe</span><span class="p">.</span><span class="n">producer</span><span class="p">.</span><span class="n">commit</span><span class="p">();</span> <span class="p">}</span> <span class="k">if</span><span class="p">(</span><span class="n">role</span> <span class="o">==</span> <span class="n">cuda</span><span class="o">::</span><span class="n">pipeline_rol</span><span class="o">::</span><span class="n">consumer</span><span class="p">)</span> <span class="p">{</span> <span class="n">pipe</span><span class="p">.</span><span class="n">consumer_wait</span><span class="p">();</span> <span class="c1">//Some compute</span> <span class="n">pipe</span><span class="p">.</span><span class="n">consmer_release</span><span class="p">();</span> <span class="p">}</span> </code></pre></figure> <p>I thought I would add this clarification, mainly as it caused me some issues and the feature seemed to be a bit under-documented. You might be wondering how this performed in may application, well it seemed to lead to significant branch divergence, that killed performance. It also seems to me that although the <code class="language-plaintext highlighter-rouge">memcpy_async</code> is supported on Volta, you really don’t get the benefits. However, in my experience with A100s, it seems that the asynchronous paradigm will prove to be quite important, but due to the dedicated hardware the method I just described may not be that useful. More testing required.</p>Something I have recently been working on is fusing two GPU kernels in PyFR, one kernel is a pointwise kernel and the other is a matrix multiplication kernel. For more background you can watch this talk. Both these kernels are memory bandwidth bound, and so to increase speed we can reduce going out to main memory by using shared memory.Array pointers in F902021-06-08T01:31:21+00:002021-06-08T01:31:21+00:00https://willtrojak.org/fortran/2021/06/08/f90-array-pointers<p>I was recently playing around with pointers in fortran, what I wanted to acheive was an array of pointers where each pointed to a different elemnt in an array. In C/C++ this is simple to achieve, something a bit like this for example:</p> <figure class="highlight"><pre><code class="language-c" data-lang="c"> <span class="kt">float</span> <span class="n">a</span><span class="p">[</span><span class="mi">2</span><span class="p">];</span> <span class="kt">float</span> <span class="o">*</span><span class="n">b</span><span class="p">[</span><span class="mi">2</span><span class="p">];</span> <span class="n">a</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">=</span> <span class="mi">1</span><span class="p">.;</span> <span class="n">a</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span> <span class="o">=</span> <span class="mi">2</span><span class="p">.;</span> <span class="n">b</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">=</span> <span class="o">&amp;</span><span class="n">a</span><span class="p">[</span><span class="mi">1</span><span class="p">];</span> <span class="n">b</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span> <span class="o">=</span> <span class="o">&amp;</span><span class="n">a</span><span class="p">[</span><span class="mi">0</span><span class="p">];</span></code></pre></figure> <p>However, this isn’t natively supported in fortran at the moment. This is perhaps with good reason, in fortran by assuming that pointers to an array point to a contiguous part of the array it avoids alaising; meaning the compiler can make curtain assumptions. An example of a pointer in fortran would be:</p> <figure class="highlight"><pre><code class="language-fortran" data-lang="fortran"><span class="w"> </span><span class="kt">real</span><span class="p">,</span><span class="w"> </span><span class="k">target</span><span class="w"> </span><span class="p">::</span><span class="w"> </span><span class="n">a</span><span class="p">(</span><span class="mi">10</span><span class="p">)</span><span class="w"> </span><span class="kt">real</span><span class="p">,</span><span class="w"> </span><span class="k">pointer</span><span class="w"> </span><span class="p">::</span><span class="w"> </span><span class="n">b</span><span class="p">(:)</span><span class="w"> </span><span class="n">b</span><span class="w"> </span><span class="o">=&gt;</span><span class="w"> </span><span class="n">a</span><span class="p">(</span><span class="mi">4</span><span class="p">:</span><span class="mi">6</span><span class="p">)</span></code></pre></figure> <p>To acheive the behaviour I’m interested, one method is to declare a derived type, and then make an array of that type. For example, to match the behavour of the earlier C/C++ example you could do:</p> <figure class="highlight"><pre><code class="language-fortran" data-lang="fortran"><span class="w"> </span><span class="k">type</span><span class="w"> </span><span class="n">real_ptr</span><span class="w"> </span><span class="kt">real</span><span class="p">(</span><span class="nb">kind</span><span class="o">=</span><span class="mi">4</span><span class="p">),</span><span class="w"> </span><span class="k">pointer</span><span class="w"> </span><span class="p">::</span><span class="w"> </span><span class="n">p</span><span class="w"> </span><span class="k">end</span><span class="w"> </span><span class="k">type</span><span class="w"> </span><span class="n">real_ptr</span><span class="w"> </span><span class="kt">real</span><span class="p">(</span><span class="nb">kind</span><span class="o">=</span><span class="mi">4</span><span class="p">),</span><span class="w"> </span><span class="k">target</span><span class="w"> </span><span class="p">::</span><span class="w"> </span><span class="n">a</span><span class="p">(</span><span class="mi">3</span><span class="p">)</span><span class="w"> </span><span class="k">type</span><span class="p">(</span><span class="n">real_ptr</span><span class="p">)</span><span class="w"> </span><span class="p">::</span><span class="w"> </span><span class="n">b</span><span class="p">(</span><span class="mi">2</span><span class="p">)</span><span class="w"> </span><span class="n">b</span><span class="p">(</span><span class="mi">1</span><span class="p">)</span><span class="o">%</span><span class="n">p</span><span class="w"> </span><span class="o">=&gt;</span><span class="w"> </span><span class="n">a</span><span class="p">(</span><span class="mi">2</span><span class="p">);</span><span class="w"> </span><span class="n">b</span><span class="p">(</span><span class="mi">2</span><span class="p">)</span><span class="o">%</span><span class="n">p</span><span class="w"> </span><span class="o">=&gt;</span><span class="w"> </span><span class="n">a</span><span class="p">(</span><span class="mi">1</span><span class="p">)</span></code></pre></figure> <p>You might rightly ask, how does this perform? Surely using a derived type and having to invoke a bit more heavy machinery can’t be too performant. Well below is the interesting part of the assembly:</p> <figure class="highlight"><pre><code class="language-asm" data-lang="asm"> movss xmm0, DWORD PTR .LC0[rip] movss DWORD PTR [rbp-8], xmm0 movss xmm0, DWORD PTR .LC1[rip] movss DWORD PTR [rbp-4], xmm0 lea rax, [rbp-8] add rax, 4 mov QWORD PTR [rbp-32], rax lea rax, [rbp-8] mov QWORD PTR [rbp-24], rax</code></pre></figure> <p>This was compiled with GCC 8.4.0 on an Intel based system. The interesting bit is that, barring some addtional standard setup bits required by fortran, the assembly is <em>exactly</em> the same. So to answer the question, is this apporach performant in fortran; the answer is its as performant as C/C++ in this case.</p> <p>I also tried this on an Arm based system, but the differences were more significant. But frankly I put this down to the fortran compiler for Arm.</p>I was recently playing around with pointers in fortran, what I wanted to acheive was an array of pointers where each pointed to a different elemnt in an array. In C/C++ this is simple to achieve, something a bit like this for example: