Low-Lock Singletons In D

DConf 2013 was a huge success!  I’ve finally recovered enough from my red-eye flight back to the East Coast to blog about a topic that really stuck out in my mind from it.

At DConf, I gave a talk on D-specific design patterns.  (The video will be up on dconf.org shortly.)  One pattern in particular attracted much more audience attention than I thought it would, so I decided to blog about it:  The Low-Lock Singleton Pattern.

Most people who program for a living are probably familiar with the traditional Singleton Pattern, which is used to create a single, globally available instance of a class.  A canonical implementation in D is shown below:

class MySingleton {
  static MySingleton get() {
    if (instance_ is null) {
      instance_ = new MySingleton;
    }
    return instance_;
  }
 private:
  this() {}
  __gshared MySingleton instance_;
 }

Most of these people also probably know that this naive implementation is a concurrency nightmare.  You can’t safely call get() from multiple threads simultaneously.  The most obvious reason is because two instances of MySingleton will get created if get() is called from a second thread before it’s finished running in the first thread that called it, i.e. before instance_ !is null from the first call.

If you want thread safety, the most obvious solution is to wrap everything in a mutex, represented in D by a synchronized block.  Omitting all the boilerplate except the get() method, this would look something like:

  static MySingleton get() {
    synchronized {
      if (instance_ is null) {
        instance_ = new MySingleton;
      }
    }
    return instance_;
  }

Now, if Thread 1 calls get(), it has a chance to finish instantiating instance_ and storing a reference to it before Thread 2 checks whether instance_ is null.  There’s only one problem with this:  It comes at a huge performance cost (benchmarks forthcoming).  Entering a synchronized block is expensive.

One solution that looks correct to the naive eye but is subtly broken is double-checked locking.  The idea here is to check whether instance_ is null outside synchronized block, and then if so, check again inside one:

  static MySingleton get() {
    if (instance_ is null) {
      synchronized {
        if (instance_ is null) {
          instance_ = new MySingleton;
        }
      }
    }
    return instance_;
  }

Intuitively this seems reasonable.  instance_ can’t go from non-null to null, only from null to non-null.  Therefore, if it’s already non-null by the time we get to the first check (the one outside the synchronized block) it would seem we know every thing we need to know and can simply return instance_.  Alas, the real world is not that simple.  The details are complicated and best explained by others, but the executive summary is this:  The sequentiality of program execution is an illusion maintained within a single thread.  Compilers and CPUs reorder instructions and the illusion breaks down when dealing with multiple threads.  For example, instance_ can be assigned a non-null value before the object it refers to is fully constructed.

In D, though, thread-local storage (TLS) is a first class member of the language.  This leads to an interesting workaround that’s similar in spirit to double-checked locking but actually works.  It’s a previously obscure pattern that was independently invented by at least myself and Alexander Terekhov, an IBM researcher working in Java.   It never caught on for some reason, probably because of the abysmal performance of TLS in Java back when it was discovered, as illustrated by Doug Lea’s benchmarks from that era.  Anyhow, the pattern is implemented in D below:

class MySingleton {
  static MySingleton get() {
    if (!instantiated_) {
      synchronized {
        if (instance_ is null) {
          instance_ = new MySingleton;
        }
        instantiated_ = true;
      }
    }
    return instance_;
  }
 private:
  this() {}
  static bool instantiated_;  // Thread local
  __gshared MySingleton instance_;
 }

The idea is simple:  We maintain a thread-local flag to track whether we’ve already entered the synchronized block from the current thread and ensured that instance_ is instantiated.  The synchronized block is entered once per thread, amortized over a whole run of the program.

Benchmarks

Ideally, we’d like thread-safe Singletons that are as fast as thread-unsafe Singletons.  How well does our low-lock TLS-based Singleton perform in practice?  I created a micro-benchmark calling get() in a loop 1 billion times and ran it for naive, thread-unsafe Singletons (unsafe), Singletons that enter a synchronized block on every call to get() (synchronized), and the low-lock TLS-based Singletons (TLS).  The results are shown in the graph below:

chart_1

On GDC the overhead of TLS vs. unsafe is virtually undetectable.  On DMD it’s noticeable but small.  In both cases it pales in comparison to the overhead of synchronizing on every call to get().  Note that this benchmark was run in a single thread, so the mutex used to implement the synchronized statement was never contested.  If it were, the results when synchronizing on every call to get() would be far worse.

There you have it:  An obscure Java pattern that was ahead of its time, resurrected in the presence of more modern thread-local storage implementations in D, lets us have our cake and eat it too:  Fast Singletons that are thread safe without any dependence on obscure memory model details.

Posted in Uncategorized | 5 Comments

Concurrency, Parallelism and D

If you’ve followed advances in computer architecture over the past half-decade, you’ve probably noticed that more and more cores are showing up on CPUs.  The transistor budgets of CPUs are still increasing exponentially as Moore’s Law predicts, but CPU designers are finding it increasingly difficult to translate the increased transistor budgets into increased single thread performance.  Why the sudden change in paradigm?  Under the hood, modern CPUs execute code in parallel.  Examine the following source code:

i++;
j++;

This probably compiles down to two assembly instructions on the x86 architecture, assuming i and j live in EAX and EDX respectively:

inc EAX;
inc EDX;

Neither instruction depends on the result of the other.  They can be executed in parallel, and this is exactly what the hardware does.  This is called instruction level parallelism, or ILP.  The problem with ILP is that it doesn’t scale past a certain point.  There’s only so much parallelism available at such a low level.  Even where ILP exists, the CPU needs to prove on the fly that the relevant instructions can be executed in parallel, and work around issues like branching and register sharing to make parallel execution happen.  (For an excellent article that discusses these issues in a more detailed but still accessible way see Modern Microprocessors: A 90-Minute Guide.)  What’s a CPU designer to do when he/she can’t find anymore ways to translate an increased transistor budget into proportionally increased performance?  Punt that task to the programmer, of course.  The responsibility for exploiting parallelism in your code has shifted from the hardware and by extension the hardware designer to you, the programmer.

Notice that the word “concurrency” hasn’t been mentioned once yet, except in the title.  Wikipedia defines concurrency as “…a property of systems in which several computations are executing simultaneously, and potentially interacting with each other.”  At the lowest levels we definitely have concurrency.  Two increment operations are being executed simultaneously, yet somehow even assembly language programmers don’t need to have the slightest clue about deadlocks, race conditions or any of the other standard concurrency concepts, even if they want to write highly optimized code that benefits maximally from ILP.  Concurrency has been abstracted away as an implementation detail of parallelism.  In its place is the higher level language of parallelism, which speaks in terms of dependencies.  Two pieces of code that do not depend on each other’s results, including side effects, may be executed in parallel.  This excellent article discusses in more detail why concurrency and parallelism are not the same thing.

There are other applications of concurrency where abstracting away its existence is not desirable, let alone practical.  If you’re writing a server, handling multiple requests simultaneously is a fundamental part of the problem you’re solving, not an implementation detail.  If you’re writing a game, simultaneously processing graphics and sound is not an implementation detail.  If you’re writing an operating system, multitasking is not an implementation detail.  There are two traditional ways to handle explicit concurrency:  Multiple processes and multiple threads.  Threads share an address space.  This means that all of the memory in the process is implicitly shared between threads.  Change a value from one thread and all of the other threads “know” about it.  This makes communication between concurrently executing tasks easy, but makes ensuring correctness of the code hard, since any memory address may be read or written to by any thread, in a nondeterministic order.  Processes don’t share an address space and all sharing is explicit.  The downsides are that interprocess communication has substantial overhead and passing complex object graphs (i.e. the general case of objects that have pointers to other objects) between address spaces is a non-trivial problem.

D’s approach to general case to concurrency can be summarized as isolation via the type system plus message passing and limited, explicit memory sharing.  If you use std.concurrency for all your multithreading needs and don’t use unsafe casts to subvert the type system, there can be no implicit sharing of mutable data between threads and no low-level data races.  This is accomplished using several features of D’s type system and the design of std.concurrency.  First, std.concurrency’s spawn() function can only take a function pointer, not a delegate or a class instance.  (For those not familiar with D, a delegate is a “fat pointer” that holds a pointer to a function and a pointer to a context, such as a class or struct instance, or a closure.)

import std.stdio, std.concurrency;

void fun() {
     writeln("Printing from a regular function.");
}

void main(string[] args) {
     void fun2() {
          writeln("Printing from a closure.");
     }

     auto t1 = spawn(&fun);   // Works.

     // Error.  Taking the address of a nested function produces a delegate 
     // and allows variables declared in the outer function to be accessed 
     // from the nested function's body.  Therefore, the local variables of 
     // main() would be accessible from multiple threads if this were
     // allowed.
     auto t2 = spawn(&fun2);  
}

Second, the arguments to functions started by spawn and messages passed between threads must not have unshared aliasing (using the standard library’s terminology).  This means that only data marked immutable or shared may be transitively reachable via pointers or references passed into a spawned function or passed as a message.  (The immutable and shared type constructors are transitive, meaning all data reachable via pointer/reference indirection from immutable/shared data is also immutable/shared.)

import std.stdio, std.concurrency;

void fun(string str) {
     writeln(str);
}

void fun2(char[] str) {
     writeln(str);
}

void fun3(int i) {
     writeln(i);
}

void main() {
     string str1 = "foo";  // string is an alias for immutable(char)[].
     char[] str2 = "foo".dup;
     auto t1 = spawn(&fun, str1);  // Works.  Pointers to immutable data.
     auto t2 = spawn(&fun2, str2);  // Error:  Pointers to mutable data.
     auto t3 = spawn(&fun3, 1);  // Works.  No pointer indirection at all.

     send(t1, str1);  // Pass a message.  Works.  Pointers to immutable data.
     send(t1, str2);  // Error:  Pointers to mutable data.
     send(t1, 1);     // Works.  No pointer indirection at all.
}

Third, all global or static variables not explicitly marked as shared or __gshared are implicitly thread-local.  (__gshared variables are classic C-style global variables, are intentionally ugly looking, are not allowed in code marked @safe, are easily greppable and are regarded as a similar to using an unsafe I-know-what-I’m-doing cast. __gshared is a storage class, not a type constructor, so __gshared variables have the same type as non-__gshared variables.)  Immutable variables are implicitly shared, since there are no concurrency issues when sharing immutable data.

Fourth, the compiler guarantees sequential consistency of all access to data marked shared.  Sequential consistency means that all reads and writes from a given thread happen in the order they were issued, as seen from all threads.  The details of shared for classes and structs are too complicated to discuss here.  The last chapter of Andrei Alexandrescu’s book, “The D Programming Language“, discusses them in detail.  Basically, shared classes and structs provide shared-memory concurrency that’s limited, explicit (because the type must be marked shared and designed for thread safety) and safe from low-level data races, though it does not prevents concurrency-related bugs in high-level invariants.

Such a scheme provides the best of both worlds for concurrency use cases where the units of concurrency are not tightly coupled and complex state doesn’t need to be passed frequently.  You can do practical concurrent programming without the complexities of locks, atomic operations or memory models.  You don’t need to give up mutable state, which makes sense since mutable state isn’t a problem for concurrency as long as it’s private to a thread.  Communication between threads is cheap and even complex object graphs can be easily, cheaply and safely shared as long as they’re immutable.  Finally, if the rules need to be broken occasionally, memory can be shared in a limited way that’s guaranteed free from low-level data races.

What’s this got to do with parallelism?  Sometimes loose coupling between units of concurrency is an unaffordable luxury.  One such case is when implementing fine-grained parallelism.  (For the purpose of this article fine-grained parallelism is defined as parallelism where communication between units of concurrency happens several times per second.)  Unless the type system becomes so complex that computer science Ph.Ds can’t wrap their heads around it or major sacrifices are made in efficiency or expressiveness, this probably can’t be made statically checkable.  Nonetheless, there are safer ways of exploiting parallelism than using threads directly, while fitting into D’s current type system and preserving efficiency and expressiveness.

std.parallelism (source code) is a module I’ve been prototyping for several years that was recently accepted into Phobos, the D standard library.  To introduce some terminology before discussing this module in more detail, std.parallelism is based on the concept of a Task, which is the fundamental unit of work and may be executed in parallel with any other Task.  A TaskPool encapsulates a task queue and worker threads, which execute the Task at the front of the task queue.  The globally accessible instance of this class is called taskPool.  A task queue is a FIFO queue where Tasks are submitted for execution.  A Task can be used explicitly for future/promise parallelism, and is used under the hood to implement higher level primitives.

The fundamental compromise std.parallelism makes is that it’s up to you the programmer to understand how your program works and identify pieces of code that have no data dependencies and can be safely executed in parallel.  In the interest of preserving efficiency, expressiveness and flexibility std.parallelism doesn’t try to hold your hand here.  However, once you’ve correctly identified parallelizable code, std.parallelism provides primitives that automatically handle the low-level concurrency issues related to parceling out the work to worker threads and getting the results back to the thread where they’re needed.  If you need to think about locks, race conditions, atomic operations, dining philosophers, sleeping barbers, or generally black magic of low-level concurrency, you’re probably doing it wrong.*

One way to think of std.parallelism’s model is that it uses a weaker version of the isolation principle of std.concurrency.  If used idiomatically, every piece of data is owned by a single thread at any given instant or not mutable at that instant.  No two threads may even attempt to update the same piece of data at the same time.  If a piece of data were protected by a lock, this lock would never be contested.  The twist is that this ownership may change during the life of the program.  A memory fence is automatically inserted anytime data may change ownership.  Unfortunately some discipline is required to write code in idiomatic std.parallelism form.  The concept of all data being owned by a single thread at any given time is a high level invariant that is not statically checkable.  What std.parallelism provides is the mechanisms to make this model easy to implement without using locks or atomic operations or generally thinking about concurrency or low level memory model issues.  For example, the following code computes the logarithm of every number from 1 to 1,000,000 in parallel and stores the results in a single array:

import std.math, std.parallelism;

void main() {
     auto logs = new double[1_000_000];

     // A parallel foreach loop is just like a regular foreach loop,
     // except its body may be executed in parallel.
     foreach(i, ref num; taskPool.parallel(logs)) {
         num = log(i + 1.0);
     }
}

Let’s look at this code from the perspective of our weak isolation model.  Before the parallel foreach loop is entered, logs is owned by the program’s main thread.  The parallel foreach loop logic sends different loop iterations to different threads for execution.  The loop body is written such that all iterations are independent.  No two iterations write to the same piece of memory, and no iteration reads any piece of memory that another writes.  If worker thread A performs iteration 0, then worker thread A “owns” logs[0] while this iteration is occurring.  Adjacent elements may be owned by different threads, though in practice the design minimizes false sharing.  Once the loop is complete, ownership of all elements of logs is transferred back to the main thread.  The worker threads hold no reference to it and either terminate, sleep or execute unrelated work.

Let’s take a look at another example, this time using the parallel reduction function taskPool.reduce.  (Thanks to Russel Winder for this example.)

import std.algorithm, std.parallelism, std.range;

void main() {
     immutable n = 1_000_000_000;
     immutable delta = 1.0 / n;

     // Calculate pi by quadrature.  getTerm() gets the individual
     // terms in the summation and is evaluated in parallel by
     // taskPool.reduce, by reading from the std.algorithm.map
     // range.
     real getTerm(int i) {
          immutable x = ( i - 0.5 ) * delta;
          return delta / ( 1.0 + x * x ) ;
     }

     // The string "a + b" is a template parameter and is a shorthand
     // way of writing a simple lambda function in D.  It's equivalent
     // to a function that takes two arguments named a and b and returns
     // a + b.
     immutable pi = 4.0 * taskPool.reduce!"a + b"(
          std.algorithm.map!getTerm(iota(n))
     );
}

iota(n) returns a range (basically a pair of iterators for those from a C++ background) that holds all numbers in [0, n).  std.algorithm.map lazily computes getTerm() for each element of iota(n).  The data it contains is sliced up and sent to multiple threads.  Each temporarily owns a summation variable and computes its part of the sum.  Then, ownership of all of these summation variables is transferred to the main thread and the final reduction of each thread’s result is performed serially.  (The fact that the terms of summation are computed lazily instead of stored is an implementation detail.  Conceptually, the elements of the range are still parcelled out to multiple threads.)

In the reduce example, getTerm() could write to a global variable or do several other nasty things.  std.parallelism could require that the reduction function be pure, but I made the decision to favor flexibility over safety.  First, a function may have unimportant side effects.  It might allocate and free memory from a custom thread-safe allocator.  It might generate random numbers from a thread-local random number generator instance.  It might really be pure but not marked as such because the discipline required to recursively mark it and all functions it calls pure doesn’t scale much better than any other form of discipline in programming.

Similarly in the case of future/promise parallelism, tasks may have side effects as long as there’s no dependency.  This example writes to two different files in parallel:

import std.parallelism, std.file : write;

void main() {
     auto writeTask = task!write("foo.txt", "Writing from a task.");
     writeTask.executeInNewThread();

     write("bar.txt", "Writing from the main thread.");
     writeTask.yieldForce();  // Wait for writeTask to finish.
}

The result of these efforts is a compromise:  A library that allows you to program in parallel, fully utilizing your new-fangled multicore hardware, if you understand parallelism but not concurrency.  Concurrency only leaks out in that, if you try to parallelize something that can’t be parallelized, bugs will manifest themselves as erratic, non-deterministic behavior.  An approach that favored safety over efficiency and flexibility might try to determine what could be safely executed in parallel automatically, avoiding the requirement that the user understand parallelism.  In the extreme it might do away with mutable state entirely, making this problem trivial.  (This is what purely functional languages do.)  Such approaches would always be conservative because proving lack of data dependency in the general case is equivalent to solving the halting problem and the compiler doesn’t generally have access to the whole codebase.  Furthermore, I find the concept of writing parallel programs that look almost the same as their serial counterparts and don’t require a restrictive language design quite elegant.  In the end, D is a systems language and with great power comes great responsibility.

*  I occasionally use locks with parallel foreach loops when I want low-level control over how a reduction is performed, especially with respect to memory usage.  To me this is like using a goto:  Usually more structured, higher level primitives are better but there are occasional oddball cases.  Therefore, like goto, parallel foreach loops with locks to update shared data structures should be discouraged but not banned.

Creative Commons License
Concurrency, Parallelism and D by David Simcha is licensed under a Creative Commons Attribution 3.0 Unported License.

Posted in Uncategorized | 6 Comments

Hello world!

Welcome to WordPress.com. After you read this, you should delete and write your own post, with a new title above. Or hit Add New on the left (of the admin dashboard) to start a fresh post.

Here are some suggestions for your first post.

  1. You can find new ideas for what to blog about by reading the Daily Post.
  2. Add PressThis to your browser. It creates a new blog post for you about any interesting  page you read on the web.
  3. Make some changes to this page, and then hit preview on the right. You can alway preview any post or edit you before you share it to the world.
Posted in Uncategorized | 1 Comment