A Parallel Rigid Body Dynamics Simulator in the C-UP Language


C-UP is designed to allow programmers to utilise the multiple cores of a modern processor quickly, safely and in a platform independent way – that is to say without having to use threads, mutexes, atomics, inline assembly, pragmas, DMA, etc.

This case study shows how the rigid body dynamics library provided with C-UP was implemented with a specific focus on how the features of the language make this implementation much more straightforward than you might expect. Familiarity with the concepts and terminology of rigid body simulation is recommended although cursory explanations are given.

The first part describes the design of the rigid body simulator. The second and third parts focus on how the features of C-UP are used to make a parallel collision detection system and a parallel constraint solver. Results are presented in the fourth part and finally some underlying implementation details are provided in part five.


Part 1: Design



A simulation is comprised of a number of a number of physical entities known as bodies. Some bodies are fixed in position and are used to represent the immovable structure of the world whereas others can move around freely. The latter have physical properties like mass and momentum and are affected by forces such as gravity. Furthermore, their movement can be constrained by collisions with other bodies or by joints that connect pairs of bodies.

A standard rigid body has six degrees of freedom (3 axes of linear movement and 3 of angular movement.) In this system there is also support for a linear body type which only supports linear movement and is useful for simulating human characters which usually just slide around a level in games with rotation directly controlled by the player or AI. Any number of other body types can be supported (e.g. cloth body, soft body, water body) but the important point is that the concept of having different body types leads us to design a simple class hierarchy for bodies:

Body          // abstract





The idea behind sleeping is that many of the movable bodies in a scene will not currently be moving at all because they have come to rest. It’s expensive to simulate rigid bodies so it makes sense for bodies that are at a dead stop to go into a sleeping state where they aren’t consuming any processor cycles. A body that is awake is considered ‘active’ in this dynamics library. Note that all the immovable bodies that comprise the static scenery of your game world are always inactive.

It’s also the case that an inactive body doesn’t need to store much of the information that an active body does. Specifically, it’s known to have no linear or angular velocity and momentum. The library uses this fact to halve the memory used by an inactive body which is highly beneficial because as was there tend to be many more inactive bodies than active ones in a scene.

The management of active data in response to inactivity is automatically handled and the user of the dynamics library needn’t be aware of it, but it does lead to us having a hierarchy for active data that corresponds to the above body hierarchy:

BodyActiveData             // abstract





Each body can have one or more shapes attached to it which are used to determine if that body has collided with any other bodies in the scene.

The supported shapes are those in the geometry package (System.Geom) and are represented by this class hierarchy:

Shape         // abstract


    ConvexShape      // abstract







Convex shapes have their own abstract base class because certain operations only make sense on convex shapes or are much more optimal on convex shapes. You can create your own shapes by derivation and more will undoubtedly be added by me in the future.

To attach a shape to a rigid body it must be encapsulated in a ShapeInstance, which contains a pointer to the shape itself, material information (friction, restitution), a user data pointer and an optional transform relative to the owner body. Because more complex transforms require more memory to store and have worse performance a hierarchy of shape instances is defined:






A scene represents an entire simulated environment. A body must be added to a scene to be involved in that simulation. The scene is responsible for advancing time and orchestrating all collision detection and constraint solving. Constraint solving means applying forces to the various bodies so they:

a)      don’t move through one another when colliding

b)      stay connected when held together by a joint.

The basic steps involved in advancing the simulation are:

1.       Update velocities of all active rigid bodies in response to external forces.

2.       Detect collisions for active bodies.

3.       Solve constraints – i.e. modify the velocities of all bodies to enforce collisions and joints.

4.       Update positions of all active rigid bodies using the constrained velocities.

Steps 2 and 3 are by far the most processor intensive and the next two parts describe how each of them is implemented and optimised.


Part 2: Collision detection


Shape Collider

It doesn’t really make sense for collision detection code to reside in the shape classes themselves because a collision always involves 2 shapes, so where would the code for, say, box v sphere collision live – in box or in sphere? Neither makes sense so we introduce the concept of a separate shape collider which in this instance is just a single function, albeit an interesting one that introduces a bunch of new concepts:

public bool CollideShapes(CollideShapesResult& result, virtual const(Shape)& shapeA, float4x4 transformA, virtual const(Shape)& shapeB, float4x4 transformB) : PCollideShapes;


This function is declared at module scope which means it is static (i.e. it doesn’t have an implicit ‘this’ pointer.)

It’s public so it’s freely accessible by any other code.

It returns a bool indicating if a collision occurred or not and the details of that collision are returned in the result structure passed by local reference as the first parameter. This is not a c++ reference and null can be passed if the details aren’t needed. Because it’s a local reference it can (but doesn’t have to) point to stack data and uses the & symbol, whereas a normal pointer cannot refer to stack data and uses the * symbol.

The other parameters are the two shapes to collide and the local to world space transform of each shape. Notice that the individual shape parameters are marked virtual which is possible because C-UP allows virtual dispatch on the type of any number parameters, not just ‘this’. This is immediately more convenient for collision between two shapes than what you’d end up doing in c++ which is probably a nested switch statement or a matrix of function pointers both of which are tedious to maintain and can’t be expanded by the user adding new shapes.

The “: PCollideShapes” after the function declaration is a ‘parallel set’ and is explained in a minute.


Detecting Collisions

The simulator maintains an array of all bodies whose axis aligned bounding boxes overlap. The details of how this works aren’t relevant to this discussion but suffice it to say that there exists such an array. Each overlap structure is just a pair of body pointers.

This array is the input to the narrow phase collision detection (i.e. inter-shape collision detection.) The output being an array of contact points which describe the intersections between shapes used by the constraint solver in the next phase.

The serial code for performing this detection might look something like this:

Overlap[] overlaps;

List<Contact> contacts = new List<Contact>();


foreach (Overlap& overlap; overlaps)


    foreach (ShapeInstance& siA; overlap.BodyA.ShapeInstances)


        foreach (ShapeInstance& siB; overlap.BodyB.ShapeInstances)


            CollideShapesResult result;

            If (CollideShapes(&result, siA.Shape, siA.Xform, siB.Shape, siB.Xform))


                contacts.Add( new Contact(&result) );






It loops through all of the overlapping bodies, then over all the shapes in each body. Each pair of shapes is tested for intersection and if one occurs a new contact is allocated and added to the array list. Arrays in C-UP store their length alongside a pointer to the first element which is how the foreach statement can return each element in turn. Using foreach also eliminates the need for bounds checking as it’s implicit that the bounds won’t be exceeded.



Our next objective is to make the above code run in parallel. The full rules about what you can and can’t do in parallel code are beyond the scope of this tutorial so please refer to the Parallelism section in the C-UP language guide for more info.

The first observation is that we can’t call the general new operator in parallel code to allocate new contact structures. However, we can declare our own new operator and make that allocate from a parallel linear heap. We’ll declare an alias for the heap type so we don’t have to use the long winded name everywhere:

public alias ParallelLinearHeap<16> ConstraintHeap;   // 16 byte aligned allocs

public ConstraintHeap Heap = ConstraintHeap(1024 * 1024);     // 1Mb buffer


The ‘new’ operator is shown below – “: parallel” indicates that it can be called from parallel code and obeys the data access rules described in the language guide. This is different to using parallel before a function declaration which means that function can be queued for parallel execution as you’ll see in a minute. New is a special function which has the name of a keyword and must always return a particular type (array of bytes) which cannot be explicitly specified.

       new(long size, int align, ConstraintHeap& heap) : parallel


           return heap.Allocate(size, align);



This function is declared inside the Contact type so it’s only used for allocations of that type. The default new implementation uses garbage collecting memory manager, but this can be overridden at type, module or global scope. C-UP also supports multiple memory heaps in the language and each heap can use a separate allocation/freeing scheme.


Anyway, the next issue storing contact pointers in the contacts list. A list is just a wrapper around an array and you can write to an array in parallel code but you can’t access the same elements of the same array in parallel. We could solve this issue by creating as many contact lists as we have processor cores and pass a different one to each job, but that’s not very palatable as we’ll probably get a lot of wasted space. A cleaner way is to make a parallel array list which, as the name suggests, can have items added to it in parallel. Again, we use an alias to keep things tidy:

public alias ParallelList<Contact*> ContactList;

       ContactList contacts = ContactList();



Finally we observe that what is true of the contacts array above is also true of the input ‘overlaps’ array – that is, we’d like to be able to iterate it in parallel but we can’t do that with an array. Once again, we could do this by passing different parts of the array to individual calls to the collision function but this won’t lead to good load balancing. Instead we’re going to use a ParallelReadStream, which is just a parallel wrapper around an existing array (var is a type inferred from an assignment – useful to save our fingers from typing and crucial for implementing generics):

var overlapStream = ParallelReadStream<Overlap>(overlaps);



Now we’re getting somewhere and all that remains is to make a function that iterates the overlaps and if it detects a collision allocates a contact and adds it to the contacts list. As data in a parallel function can only be accessed via the parameters (and only through references if they are local – that’s &, not *) we must pass in all of the above collections as arguments:

parallel void CollideAllPairs(ParallelReadStream<Overlap>& overlaps,

     ConstraintList& list, ConstraintHeap& heap) : PCollideShapes


    Overlap pair;

    while (overaps.Read(&pair))


        CollideBodies(pair.BodyA, pair.BodyB, list, heap);




This calls the following function which collides all of the shape instance pairs in the given pair of bodies and actually creates and adds the new contacts:

void CollideBodies(Body& bodyA, Body& BodyB, ConstraintList& list,

ConstraintHeap& heap) : PCollideShapes


    foreach (ShapeInstance& siA; bodyA.ShapeInstances)


        foreach (ShapeInstance& siB; bodyB.ShapeInstances)


            CollideShapesResult result;

            if (CollideShapes(&result, siA.Shape, siA.Xform, siB.Shape, siB.Xform))


                list.Add( new(heap) Contact(&result) );







It’s worth noting that the ParallelReadStream, ParallelList and ParallelLinearHeap types aren’t some built in C++ magic that you can’t extend, they are entirely implemented in C-UP itself using the shared memory facilities and atomic operations. You can write your own parallel structures as you see fit but because shared memory is inherently unsafe a compile option must be passed when building any module that uses shared memory directly. Indirect uses as above don’t require the option.

The final piece of the puzzle is to call the above function once for each CPU core we want to utilise. Parallel functions only execute in parallel when called from within a parallel statement, in which case these calls are queued up and execute in parallel when the parallel statement block exits. The parallel execution is entirely safe (meaning correct results as per sequential processing are guaranteed) because the runtime system performs automatic dependency checking between the functions. It’s up to you to arrange things so there are as few dependencies as possible in order to get a speed up but no matter what you do you’ll always get the correct results out of your program.

parallel void Collide()


    parallel         // parallel statement


        for (int i = 0; i < JobQueue.GetNumThreads(); i++)

            CollideAllPairs(&pairStream, &Constraint.Constraints, &Constraint.Heap);




As an aside, it wouldn’t really matter if we just called this function an arbitrary number of times (say 16) as there are only as many worker threads under the hood as there are cores so on any processor with <= 16 cores you’d get exactly the same parallelism as above and the extra calls would just immediately exit as there are no more overlaps to consume. Of course making exactly the right number of jobs is optimal so that’s what we do.


Now, about that “: PCollideShapes” after the above function headers. That’s a parallel set and is an extension of “: parallel” used on our “new” function. The basic “: parallel” allows a function to be called from parallel code and states that it will only access data through its parameters (it’s as near to a pure function as you get in C-UP). Often this is enough but C-UP also allows you to declare that certain types or memory heaps are to be treated as constant for the duration of this parallel function and such types can be freely dereferenced. Here’s the declaration of PCollideShapes:

parallel PCollideShapes : const type(Shape, Float3, Short, Plane, ConvexHull.Face,

ShapeInstance, const(ShapeInstance*), Body);


This says that the given types are immutable in this scope and so references to them can be freely dereferenced in parallel code. Most of these types should look familiar from Part 1 – here’s why they’re needed:

Shape – the shape classes contain the the actual geometric shape data

Float3, Short, Plane, ConvexHull.Face – the ConvexHull shape contains arrays of vertices (Float3), vertex indices (Short), plane equations (Plane) and faces (ConvexHull.Face) all of which need to be accessed during intersection testing.

ShapeInstance, const(ShapeInstance*) – the shape instances in a body are stored as an array of const ShapeInstance pointers.

Body – the Body contains the main body transform needed for intersection testing


Part 3: Constraint Solving


Collision detection is in many ways a poster child for parallelism because testing for intersection between one pair of bodies is inherently independent of testing any other pair.

Constraint solving is quite the opposite because given a big stack of rigid bodies you get an interconnected system of constraints which all depend on one another either directly or indirectly - words to strike fear into the heart of anyone looking for parallelism.

But before we worry about that a quick overview of how constraint solving is implemented in this library. Note that a contact is a particular type of constraint and as contacts are the only type of constraint we’re discussing you can think of those terms as interchangeable.



There are numerous approaches to solving a system of constraints and many papers have been written on just this subject. For our purposes we’ll just say that this library uses the simple approach of iteratively applying impulses to solve constraints. An impulse is a force applied over an infinitely short period of time to effect an immediate change to the velocity of a body. By iterating over all of the constraints in the system (as generated by the collision detection phase) and applying them one at a time and then repeating this process for a number of iterations we gradually adjust the velocities of all bodies and ‘relax’ towards a perfect solution to the constraints. The number of iterations over the entire system we need can be tuned to trade off correctness for speed which is a very desirable property for games – you can even dynamically adjust the iterations as more constraints are activated to keep performance level.

To see how every constraint can influence every other constraint consider how forces pass through the system between iterations. If three bodies (A, B and C) are stacked on top of each other in that order, then A and C are not in direct contact, but an impulse applied between A and B will affect the velocity of B which will then affect the impulses applied between B and C. This simple illustration should convince you that we can’t just solve the constraints between (A, B) and (B, C) in parallel as they do indeed depend on one another because they both affect the velocity of B.



Although a single interconnected group of bodies will all depend on one another we can see that if the system is large enough it’s still possible to make groups of constraints that are independent of other groups of constraints and that’s exactly the approach taken here. Consider the stack of cubes shown below.

In the first image each red line represents a collision constraint between two bodies and it’s clear to see that they’re all interconnected.

In the second image we temporarily ignore certain constraints (the ones in grey) and we can see that it’s possible to form several independent systems of constraints which can be solved in parallel. Applying a constraint affects the velocities of exactly 2 bodies so as long as no red constraint touches a body in two different groups (A, B, or C) then we’re fine and we can apply the constraints affecting the different groups in parallel.

Of course we must still apply those greyed out constraints and the third image shows us doing just that. We don’t have any work to in parallel with this but in a larger more complex system than that shown above we might well do. This does however indicate that we can’t really hope to achieve a perfect speed up for constraint solving (at least not using this method, and this is the best I could come up with.)

We need our groups to contain a reasonable number of constraints to amortize the overhead of managing parallel jobs so we don’t want to make groups containing only a single constraint. How these groups of constraints are determined isn’t really important so we’ll just say that it can be done; and quickly, which is a crucial point because any work we do to make these groups is work that doesn’t need to be done for the serial version of the algorithm and so immediately puts us even further back in terms of saving processor cycles.



So given that we have made the groups of constraints A, B, C and D, what does the C-UP code look like to solve them in parallel?

The groups of constraints described above are comprised of arrays of ConstraintJob structures. The ConstraintJob structure can only be allocated on the stack, which means it can contain local references to the constraint and two bodies. This is important because the only non-const references that can be dereferenced in parallel code are local ones which is how we limit the scope of data that can be written in a parallel function.

       private struct ConstraintJob local


           public Constraint& c;

           public BodyActiveData& a, b;



Given an array of these we can apply them all using this function, which can be called in parallel.

       private static parallel void ApplyBatch(ConstraintJob[] local jobs) : parallel


           foreach (ConstraintJob& job; jobs)

               job.c.Apply(job.a, job.b);



The code that then calls this function looks like this. The ConstraintJobs are stored in one big array and the size of each batch is stored in a separate batchSizes array. The total number of batches is in numBatches.

parallel (numIterations)          // parallel statement with iteration count


           for (int j = 0, s = 0; j < numBatches; j++)


               ApplyBatch(jobs[s .. s + batchSizes[j]]);

               s += batchSizes[j];




This code will enqueue all of the calls to ApplyBatch and at the end of the parallel statement block will invoke them all in parallel with automatic dependency checking. By specifying a number of iterations in the parallel statement we tell the runtime that each parallel function must be called that number of times in the same order as if we placed an outer loop around the loop above. I.e. batch 0, 1, 2, …, N, 0, 1, 2, …, N, etc. Passing an iteration count to the parallel statement is much more efficient than using an extra for loop inside the parallel statement because much of the work of dependency checking can be shared between the iterations.



Part 4: Results


Here’s a screenshot of the dynamics sample program that comes with C-UP. In this scene there are 465 active bodies stacked in a pyramid on a single immovable box for the floor and about 1000 overlapping rigid body pairs.

At the top you can see some timing bars. The green part is the collision detection and the blue is the constraint solver. The topmost bar is the overall timing and the 4 bars underneath are the work being done on the 4 cores of my test machine, with white gaps where no useful work is being done due to dependency checking cost or actual dependencies.

The speedup achieved is as follows

1 core                    2 cores                  3 cores                  4 cores

Collide                    1x                           1.9x                        2.8x                        3.8x

Solve                        1x                           1.6x                        2.1x                        2.5x


As you can see collision detection lives up to its billing as being inherently parallelisable with nearly perfect scaling. Constraint solving is a much harder problem with inherent dependencies limiting the available speed up but it still achieves respectable results. In a real game a better speed up on constraint solving might be expected because all the rigid bodies don’t tend to be in a single big pile (referred to as an island in rigid body circles) and so more opportunities for parallelism are present because separate islands are not inter-dependent.


Part 5: Implementation


If you’ve got this far then you might be interested in how some of this stuff works under the hood and if so you’re in the right place.


Virtual Dispatch

Every pointer in C-UP is comprised of a 47 bit pointer and 16 bits of runtime type information (the remaining bit is used by the garbage collector.) Storing the type info in the pointer means that a pointer doesn’t have to be dereferenced in order to perform virtual dispatch.

Virtual dispatch tables are stored per-function (as opposed to per-type in c++) as a tree representing a sparse N-dimensional matrix, where N is the number of virtual parameters – 2 in this case. In this example where there are 7 shape classes and as we allow more or less any combination of shapes to collide let’s say we have a full matrix of 49 entries. Each entry is a function pointer so the entire table is < 200 bytes in 32 bit code will can quite happily reside in the L1 cache, which matters in this case because we’re calling CollideShapes potentially hundreds of times in quick succession.

This is what the call to CollideShapes looks like on x86 (the arguments have already been pushed onto the stack):

movzx  esi, word ptr [esp+14]     ; load runtime type of 1st Shape*

movzx  edi, word ptr [esp+86]     ; load runtime type of 2nd Shape*

lea    esi, [esi*4+CollideShapes] ; base address of vtable + offset for 1st shape

add    esi, dword ptr [esi]       ; apply offset for 1st shape type

call   dword ptr [esi+edi*4]      ; call address stored in row + 2nd shape column


The first two loads are of data that has just been stored to the stack which will certainly be in the L1 cache even if it’s not still available in a register.

The reason for the add instruction where you might expect a multiply is that we don’t store a full N dimensional matrix as that could require a huge amount of memory although doing so might make sense in this case where we have a densely packed reasonably small matrix. The beauty of doing virtual dispatch per function is that the code can be different at each call site allowing us to optimise for things like that, and even to generate different jump tables for different heterogeneous cores (virtual functions on SPUs, anyone?)




Dependency Checking

It’s already been alluded to that dependencies between queued parallel functions are automatically detected in the C-UP runtime. There are a few rules in language that make this tenable, which are fully detailed in the language document but I’ll summarise here.

The most important rule is that data can only be accessed via function parameters and only via reference parameters if they are local references. Local references (pointers or arrays) can only be used on parameters or local variables, or as members of local-only structs (like the ConstraintJob struct earlier in this document.) Because of these limitations it’s fundamentally impossible to create a large hierarchy of local references so the job runtime is able to follow all local references starting at the function parameters and work out what areas of memory you plan to access.

Given this information it’s very quick and easy to determine if two parallel functions can execute safely in parallel by sorting the accessible memory spans into ascending order and doing a single walk of them checking for overlaps. The act of sorting these spans is actually the slowest part and is the main reason why passing an iteration count to the parallel statement is faster than just looping as it’s only done once per function.

Here’s a screenshot from the profiling tool provided with C-UP showing the above pyramid constraint solver in action. The blue parts are memory spans being found and sorted, the magenta is actual dependency checking happening and the green is real work being done. Any white space is just wasted time where nothing useful could happen. The red lines indicate that a dependency was found between two functions.

You can’t really tell from this screenshot but dependencies are only checked between a job that wants to start running and any that are currently running. That is, no exhaustive check of all possible dependencies occurs as that would be wasteful. Having said that it could turn out to be faster to do an exhaustive check once and then no checks on subsequent iterations if the iteration count is high, but that’s not what currently happens.