Concurrency, Parallelism and D

If you’ve followed advances in computer architecture over the past half-decade, you’ve probably noticed that more and more cores are showing up on CPUs.  The transistor budgets of CPUs are still increasing exponentially as Moore’s Law predicts, but CPU designers are finding it increasingly difficult to translate the increased transistor budgets into increased single thread performance.  Why the sudden change in paradigm?  Under the hood, modern CPUs execute code in parallel.  Examine the following source code:

i++;
j++;

This probably compiles down to two assembly instructions on the x86 architecture, assuming i and j live in EAX and EDX respectively:

inc EAX;
inc EDX;

Neither instruction depends on the result of the other.  They can be executed in parallel, and this is exactly what the hardware does.  This is called instruction level parallelism, or ILP.  The problem with ILP is that it doesn’t scale past a certain point.  There’s only so much parallelism available at such a low level.  Even where ILP exists, the CPU needs to prove on the fly that the relevant instructions can be executed in parallel, and work around issues like branching and register sharing to make parallel execution happen.  (For an excellent article that discusses these issues in a more detailed but still accessible way see Modern Microprocessors: A 90-Minute Guide.)  What’s a CPU designer to do when he/she can’t find anymore ways to translate an increased transistor budget into proportionally increased performance?  Punt that task to the programmer, of course.  The responsibility for exploiting parallelism in your code has shifted from the hardware and by extension the hardware designer to you, the programmer.

Notice that the word “concurrency” hasn’t been mentioned once yet, except in the title.  Wikipedia defines concurrency as “…a property of systems in which several computations are executing simultaneously, and potentially interacting with each other.”  At the lowest levels we definitely have concurrency.  Two increment operations are being executed simultaneously, yet somehow even assembly language programmers don’t need to have the slightest clue about deadlocks, race conditions or any of the other standard concurrency concepts, even if they want to write highly optimized code that benefits maximally from ILP.  Concurrency has been abstracted away as an implementation detail of parallelism.  In its place is the higher level language of parallelism, which speaks in terms of dependencies.  Two pieces of code that do not depend on each other’s results, including side effects, may be executed in parallel.  This excellent article discusses in more detail why concurrency and parallelism are not the same thing.

There are other applications of concurrency where abstracting away its existence is not desirable, let alone practical.  If you’re writing a server, handling multiple requests simultaneously is a fundamental part of the problem you’re solving, not an implementation detail.  If you’re writing a game, simultaneously processing graphics and sound is not an implementation detail.  If you’re writing an operating system, multitasking is not an implementation detail.  There are two traditional ways to handle explicit concurrency:  Multiple processes and multiple threads.  Threads share an address space.  This means that all of the memory in the process is implicitly shared between threads.  Change a value from one thread and all of the other threads “know” about it.  This makes communication between concurrently executing tasks easy, but makes ensuring correctness of the code hard, since any memory address may be read or written to by any thread, in a nondeterministic order.  Processes don’t share an address space and all sharing is explicit.  The downsides are that interprocess communication has substantial overhead and passing complex object graphs (i.e. the general case of objects that have pointers to other objects) between address spaces is a non-trivial problem.

D’s approach to general case to concurrency can be summarized as isolation via the type system plus message passing and limited, explicit memory sharing.  If you use std.concurrency for all your multithreading needs and don’t use unsafe casts to subvert the type system, there can be no implicit sharing of mutable data between threads and no low-level data races.  This is accomplished using several features of D’s type system and the design of std.concurrency.  First, std.concurrency’s spawn() function can only take a function pointer, not a delegate or a class instance.  (For those not familiar with D, a delegate is a “fat pointer” that holds a pointer to a function and a pointer to a context, such as a class or struct instance, or a closure.)

import std.stdio, std.concurrency;

void fun() {
     writeln("Printing from a regular function.");
}

void main(string[] args) {
     void fun2() {
          writeln("Printing from a closure.");
     }

     auto t1 = spawn(&fun);   // Works.

     // Error.  Taking the address of a nested function produces a delegate 
     // and allows variables declared in the outer function to be accessed 
     // from the nested function's body.  Therefore, the local variables of 
     // main() would be accessible from multiple threads if this were
     // allowed.
     auto t2 = spawn(&fun2);  
}

Second, the arguments to functions started by spawn and messages passed between threads must not have unshared aliasing (using the standard library’s terminology).  This means that only data marked immutable or shared may be transitively reachable via pointers or references passed into a spawned function or passed as a message.  (The immutable and shared type constructors are transitive, meaning all data reachable via pointer/reference indirection from immutable/shared data is also immutable/shared.)

import std.stdio, std.concurrency;

void fun(string str) {
     writeln(str);
}

void fun2(char[] str) {
     writeln(str);
}

void fun3(int i) {
     writeln(i);
}

void main() {
     string str1 = "foo";  // string is an alias for immutable(char)[].
     char[] str2 = "foo".dup;
     auto t1 = spawn(&fun, str1);  // Works.  Pointers to immutable data.
     auto t2 = spawn(&fun2, str2);  // Error:  Pointers to mutable data.
     auto t3 = spawn(&fun3, 1);  // Works.  No pointer indirection at all.

     send(t1, str1);  // Pass a message.  Works.  Pointers to immutable data.
     send(t1, str2);  // Error:  Pointers to mutable data.
     send(t1, 1);     // Works.  No pointer indirection at all.
}

Third, all global or static variables not explicitly marked as shared or __gshared are implicitly thread-local.  (__gshared variables are classic C-style global variables, are intentionally ugly looking, are not allowed in code marked @safe, are easily greppable and are regarded as a similar to using an unsafe I-know-what-I’m-doing cast. __gshared is a storage class, not a type constructor, so __gshared variables have the same type as non-__gshared variables.)  Immutable variables are implicitly shared, since there are no concurrency issues when sharing immutable data.

Fourth, the compiler guarantees sequential consistency of all access to data marked shared.  Sequential consistency means that all reads and writes from a given thread happen in the order they were issued, as seen from all threads.  The details of shared for classes and structs are too complicated to discuss here.  The last chapter of Andrei Alexandrescu’s book, “The D Programming Language“, discusses them in detail.  Basically, shared classes and structs provide shared-memory concurrency that’s limited, explicit (because the type must be marked shared and designed for thread safety) and safe from low-level data races, though it does not prevents concurrency-related bugs in high-level invariants.

Such a scheme provides the best of both worlds for concurrency use cases where the units of concurrency are not tightly coupled and complex state doesn’t need to be passed frequently.  You can do practical concurrent programming without the complexities of locks, atomic operations or memory models.  You don’t need to give up mutable state, which makes sense since mutable state isn’t a problem for concurrency as long as it’s private to a thread.  Communication between threads is cheap and even complex object graphs can be easily, cheaply and safely shared as long as they’re immutable.  Finally, if the rules need to be broken occasionally, memory can be shared in a limited way that’s guaranteed free from low-level data races.

What’s this got to do with parallelism?  Sometimes loose coupling between units of concurrency is an unaffordable luxury.  One such case is when implementing fine-grained parallelism.  (For the purpose of this article fine-grained parallelism is defined as parallelism where communication between units of concurrency happens several times per second.)  Unless the type system becomes so complex that computer science Ph.Ds can’t wrap their heads around it or major sacrifices are made in efficiency or expressiveness, this probably can’t be made statically checkable.  Nonetheless, there are safer ways of exploiting parallelism than using threads directly, while fitting into D’s current type system and preserving efficiency and expressiveness.

std.parallelism (source code) is a module I’ve been prototyping for several years that was recently accepted into Phobos, the D standard library.  To introduce some terminology before discussing this module in more detail, std.parallelism is based on the concept of a Task, which is the fundamental unit of work and may be executed in parallel with any other Task.  A TaskPool encapsulates a task queue and worker threads, which execute the Task at the front of the task queue.  The globally accessible instance of this class is called taskPool.  A task queue is a FIFO queue where Tasks are submitted for execution.  A Task can be used explicitly for future/promise parallelism, and is used under the hood to implement higher level primitives.

The fundamental compromise std.parallelism makes is that it’s up to you the programmer to understand how your program works and identify pieces of code that have no data dependencies and can be safely executed in parallel.  In the interest of preserving efficiency, expressiveness and flexibility std.parallelism doesn’t try to hold your hand here.  However, once you’ve correctly identified parallelizable code, std.parallelism provides primitives that automatically handle the low-level concurrency issues related to parceling out the work to worker threads and getting the results back to the thread where they’re needed.  If you need to think about locks, race conditions, atomic operations, dining philosophers, sleeping barbers, or generally black magic of low-level concurrency, you’re probably doing it wrong.*

One way to think of std.parallelism’s model is that it uses a weaker version of the isolation principle of std.concurrency.  If used idiomatically, every piece of data is owned by a single thread at any given instant or not mutable at that instant.  No two threads may even attempt to update the same piece of data at the same time.  If a piece of data were protected by a lock, this lock would never be contested.  The twist is that this ownership may change during the life of the program.  A memory fence is automatically inserted anytime data may change ownership.  Unfortunately some discipline is required to write code in idiomatic std.parallelism form.  The concept of all data being owned by a single thread at any given time is a high level invariant that is not statically checkable.  What std.parallelism provides is the mechanisms to make this model easy to implement without using locks or atomic operations or generally thinking about concurrency or low level memory model issues.  For example, the following code computes the logarithm of every number from 1 to 1,000,000 in parallel and stores the results in a single array:

import std.math, std.parallelism;

void main() {
     auto logs = new double[1_000_000];

     // A parallel foreach loop is just like a regular foreach loop,
     // except its body may be executed in parallel.
     foreach(i, ref num; taskPool.parallel(logs)) {
         num = log(i + 1.0);
     }
}

Let’s look at this code from the perspective of our weak isolation model.  Before the parallel foreach loop is entered, logs is owned by the program’s main thread.  The parallel foreach loop logic sends different loop iterations to different threads for execution.  The loop body is written such that all iterations are independent.  No two iterations write to the same piece of memory, and no iteration reads any piece of memory that another writes.  If worker thread A performs iteration 0, then worker thread A “owns” logs[0] while this iteration is occurring.  Adjacent elements may be owned by different threads, though in practice the design minimizes false sharing.  Once the loop is complete, ownership of all elements of logs is transferred back to the main thread.  The worker threads hold no reference to it and either terminate, sleep or execute unrelated work.

Let’s take a look at another example, this time using the parallel reduction function taskPool.reduce.  (Thanks to Russel Winder for this example.)

import std.algorithm, std.parallelism, std.range;

void main() {
     immutable n = 1_000_000_000;
     immutable delta = 1.0 / n;

     // Calculate pi by quadrature.  getTerm() gets the individual
     // terms in the summation and is evaluated in parallel by
     // taskPool.reduce, by reading from the std.algorithm.map
     // range.
     real getTerm(int i) {
          immutable x = ( i - 0.5 ) * delta;
          return delta / ( 1.0 + x * x ) ;
     }

     // The string "a + b" is a template parameter and is a shorthand
     // way of writing a simple lambda function in D.  It's equivalent
     // to a function that takes two arguments named a and b and returns
     // a + b.
     immutable pi = 4.0 * taskPool.reduce!"a + b"(
          std.algorithm.map!getTerm(iota(n))
     );
}

iota(n) returns a range (basically a pair of iterators for those from a C++ background) that holds all numbers in [0, n).  std.algorithm.map lazily computes getTerm() for each element of iota(n).  The data it contains is sliced up and sent to multiple threads.  Each temporarily owns a summation variable and computes its part of the sum.  Then, ownership of all of these summation variables is transferred to the main thread and the final reduction of each thread’s result is performed serially.  (The fact that the terms of summation are computed lazily instead of stored is an implementation detail.  Conceptually, the elements of the range are still parcelled out to multiple threads.)

In the reduce example, getTerm() could write to a global variable or do several other nasty things.  std.parallelism could require that the reduction function be pure, but I made the decision to favor flexibility over safety.  First, a function may have unimportant side effects.  It might allocate and free memory from a custom thread-safe allocator.  It might generate random numbers from a thread-local random number generator instance.  It might really be pure but not marked as such because the discipline required to recursively mark it and all functions it calls pure doesn’t scale much better than any other form of discipline in programming.

Similarly in the case of future/promise parallelism, tasks may have side effects as long as there’s no dependency.  This example writes to two different files in parallel:

import std.parallelism, std.file : write;

void main() {
     auto writeTask = task!write("foo.txt", "Writing from a task.");
     writeTask.executeInNewThread();

     write("bar.txt", "Writing from the main thread.");
     writeTask.yieldForce();  // Wait for writeTask to finish.
}

The result of these efforts is a compromise:  A library that allows you to program in parallel, fully utilizing your new-fangled multicore hardware, if you understand parallelism but not concurrency.  Concurrency only leaks out in that, if you try to parallelize something that can’t be parallelized, bugs will manifest themselves as erratic, non-deterministic behavior.  An approach that favored safety over efficiency and flexibility might try to determine what could be safely executed in parallel automatically, avoiding the requirement that the user understand parallelism.  In the extreme it might do away with mutable state entirely, making this problem trivial.  (This is what purely functional languages do.)  Such approaches would always be conservative because proving lack of data dependency in the general case is equivalent to solving the halting problem and the compiler doesn’t generally have access to the whole codebase.  Furthermore, I find the concept of writing parallel programs that look almost the same as their serial counterparts and don’t require a restrictive language design quite elegant.  In the end, D is a systems language and with great power comes great responsibility.

*  I occasionally use locks with parallel foreach loops when I want low-level control over how a reduction is performed, especially with respect to memory usage.  To me this is like using a goto:  Usually more structured, higher level primitives are better but there are occasional oddball cases.  Therefore, like goto, parallel foreach loops with locks to update shared data structures should be discouraged but not banned.

Creative Commons License
Concurrency, Parallelism and D by David Simcha is licensed under a Creative Commons Attribution 3.0 Unported License.

About these ads
This entry was posted in Uncategorized. Bookmark the permalink.

6 Responses to Concurrency, Parallelism and D

  1. David Gileadi says:

    Great article, from the perspective of a novice in the world of concurrency and parallelism! It’s easy to understand and does a great job of introducing both your library and the ideas behind it.

    One place an edit may be needed: in the sentence “Let’s take a look at another example, this time using the parallel reduction function taskPool.reduce. (Thanks to Russel Winder for this idea.)” you are introducing both the example and the function taskPool.reduce. Thanking Russel Winder is a little ambiguous, since it’s unclear what he had the idea for: the example or introducing a reduce function into your parallelism library. You might want to clarify that the example was his idea.

  2. “The concept of all data being owned by a single thread at any given time is a high level invariant that is not statically checkable.” Not in the current version of D. But at some point I proposed adding unique types to D, a la Scala, that are statically checkable (think unique_ptr’s in C++). With unique types, you would be able to pass ownership securely between threads.

    • dsimcha says:

      Right. I understand that this can be statically checked to some degree, but due to the Halting Problem and the difficulties of interprocedural analysis, such checking will always be conservative, especially when dealing with different parts of the same data structure being owned by different threads (which is frequently the case in data parallelism). Therefore, I question whether such a system would be useful for data parallelism, though it would definitely be useful to add flexibility to message passing.

  3. Pingback: KafeKafe » Concurrency, Parallelism and D

  4. anonymous says:

    Only two articles… what a loss :( Please come back to D and your blog!

  5. There is certainly a lot to find out about this topic.

    I really like all of the points you made.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s