> *This [restrict] keyword makes a dramatic performance difference.* Hyperbole; ...

> This [restrict] keyword makes a dramatic performance difference.

Hyperbole; it's applicable in specific circumstances.

There are ways to code without restrict to get the same performance.

Sample code:

  struct node {
    struct node *next, *prev;
  };

  /* Original: pred and succ must not overlap, but this is not expressed. */

  void insert_after_A(struct node *pred, struct node *succ)
  {
    succ->prev = pred;
    succ->next = pred->next;
    pred->next->prev = succ;
    pred->next = succ;
  }

  /* Optimize with restrict: pred and succ do not overlap. */

  void insert_after_B(struct node *restrict pred, struct node *restrict succ)
  {
    succ->prev = pred;
    succ->next = pred->next;
    pred->next->prev = succ;
    pred->next = succ;
  }

  /* Optimize by hand in "load-store style". */

  void insert_after_C(struct node *pred, struct node *succ)
  {
    struct node *oldsucc = pred->next;
    succ->prev = pred;
    succ->next = aft;
    oldsucc->prev = succ;
    pred->next = succ;
  }

Compiling with GCC, the first example generates 8 memory operations; both of the other two get it down to 7.

  insert_after_A:
          movl  4(%esp), %eax    ;; %eax = pred
          movl  8(%esp), %edx    ;; %edx = succ
          movl  (%eax), %ecx     ;; %ecx = pred->next
          movl  %eax, 4(%edx)    ;; {%edx:succ}->prev = pred
          movl  %ecx, (%edx)     ;; {%edx:succ}->next = {%ecx:pred}->next
          movl  (%eax), %ecx     ;; %ecx = pred->next !!! deja vu: this load was done a few instructions back!
          movl  %edx, (%eax)     ;; {%eax:next}->next = {%edx:succ} 
          movl  %edx, 4(%ecx)    ;; {%ecx:pred->next}->prev = {%edx:succ} 
          ret

  insert_after_B:
          movl  4(%esp), %edx
          movl  8(%esp), %eax
          movl  (%edx), %ecx
          movl  %edx, 4(%eax)
          movl  %eax, (%edx)
          movl  %ecx, (%eax)
          movl  %eax, 4(%ecx)
          ret

  insert_after_C:
          movl  4(%esp), %edx
          movl  8(%esp), %eax
          movl  (%edx), %ecx
          movl  %edx, 4(%eax)
          movl  %ecx, (%eax)
          movl  %eax, 4(%ecx)
          movl  %eax, (%edx)
          ret

Both the use of restrict in B and the technique in C have cut down the wasteful memory access. That access is done due to the suspicion that the object was changed by a prior operation due to overlap.

Function C works by caching the pred->next value in a local variable and referring to that.

None of the assignments through the structure type can possibly affect the value of aft; the structures cannot overlap with the local variable. (This is an implicit non-overlap restriction similar to what restrict expresses for the two arguments.)

Once we establish aft, all of the pointers involved in the function are local variables; so none of the local->memb = val assignments raise any suspicion that the value of local has been overwritten, requiring it to be reloaded from memory. We code five accesses and got five (plus the two to load the arguments from the stack, making seven).

Function B has the disadvantage that the behavior becomes undefined if pred and succ are pointers to the same node. Function C has no such problem. Even though the code is just as good as for B, the behavior is defined for overlapping pred and succ.

restrict is C trying to keep up with Fortran. What are situations when we can't use this type of load-store coding to reduce memory operations? Why, array processing!

Well, of course we can take the same approach in array processing; but the problem is that array processing is automatically unrolled by the compiler. Unrolling is hampered in some situations when we don't know whether the arrays overlap. If we simply introduce local variables into the loop, it still won't be unrolled. What we really have to do is manual unrolling. Manual unrolling is guesswork; whether unrolling helps or hurts depends on which specific member of which processor family we are compiling for (how big is its instruction cache and such).

E.g.

   vector_add(double *sum, double *a, double *b, int n)
   {
      for (int i = 0; i < n; i++)
        sum[i] = a[i] + b[i];
   }

If we add restrict here, then none of the arrays overlap and this kind of optimization is valid. (Let's ignore the nuances of n not being divisible by 4):

    for (int i = 0; i < n; i += 4) {
      sum[i]   = a[i]   + b[i];
      sum[i+0] = a[i+0] + b[i+0];
      sum[i+1] = a[i+1] + b[i+1];
      sum[i+2] = a[i+2] + b[i+2];
    }

If we code this ourselves, it's a lot of work, which could slow down the code if the unrolling turns out to be bad for our target CPU.