Parallel BDDs

Evan Bergeron, Kevin Zheng

Summary

We implemented a parallel binary decision diagram library, focusing on spatial locality and cache coherence. Our two main key data structures are

We obtain a 3x speedup on 4 cores on the CS3540 circuit, scaling toward a speedup of 4x on 8-12 cores. This speedup is competitive with published results.

Background

Binary decision diagrams (BDDs) are directed graphs that represent boolean functions. Once constructed, these graphs provide constant time equivalence checking. Unfortunately, constructing these graphs can be costly.

The graphs are built incrementally, usually taking a couple at a time and combining them. We focused on parallelizing this combination step.

This is tough for a number of reasons.

The low arithmetic intensity is somewhat inherent to the problem.

Approach

In all implementations, we maintain a unique table: a key-value store that ensures no duplicate nodes are created.

First approach - DFS

Our original serial implementation DFS’d on the graph using the if-then-else normal form operation as described in [2]. We paralleled this using Cilk+ in a fashion similar to [9, 10]. Pseudocode is presented below:

// DFS approach
bdd combine(bdd f, bdd g, bdd h) {
  (fv, gv, hv) = left child;
  (fvn, gvn, hvn) = left child;
  result->left = spawn_task combine(fv, gv, hv);
  result->right = combine(fvn, gvn, hvn);
  sync;
  return result;
}

A constant size worker pool is maintained (in our case, using Cilk+). There is data parallelism in the left and right children of each node, so compute each subgraph in parallel.

In this implementation, a lossy memoization cache is shared between workers to avoid duplicate work. Additionally, the DAG and unique table are merged as described in [2] to reduce memory footprint.

This implementation provides a nice solution to the data dependency issue; simply compute from the leaves up. This makes certain canonicity checks very straightforward and yields elegant, readable code. Unfortunately, does very little to address the spatial locality issue.

Improving Spatial Locality

And so begins the long journey to spatial locality.

Open-Addressing

Our initial unique table used separate chaining with linked lists in each bucket. After decided that we wanted to focus on memory locality, we switched a linear-probing, open-addressing scheme.

In a parallel context, this hash table must be able to be read and written to concurrently. Our final implementation is lockfree, heavily based on [11].

Node Managers

Our original graph implementation used a simple recursive formulation:

struct bdd_node {
  int varid;
  bdd_node *left;
  bdd_node *right;
};

with each of these nodes being malloc’d individually. To improve spatial locality, we implemented node managers, as described in [3, 4].

At init time, the node managers pre-allocate large blocks of memory. When the user requests a new node allocation, they must provide a variable id. Node with similar variable ids are grouped together.

A new struct definition

At each step of either a BFS or DFS style traversal, the expansion algorithm must request the variable ids of each child node. [3] presents a nontraditional struct definition that inlines the children variable ids to preserve locality of access. We did something similar, inlining packed pointers to the left and right children.

struct bdd_ptr_packed {
  uint16_t varid;
  uint32_t idx;
} __attribute__((packed));

struct bdd {
  bdd_ptr_packed lo;    // 6 bytes
  bdd_ptr_packed hi;    // 6 bytes
  uint16_t varid;       // 2 bytes
  uint16_t refcount;    // 2 bytes
};

Only 16 bytes in total, matching [3]’s size. Conveniently, this lets us use a 128-bit compare_and_swap.

The packed pointers are used as unique identifiers into our unique table. Since we define our own memory allocation scheme, we can uniquely identify nodes by their variable id and internal position.

BFS

This was hard. It took us several tries to correctly implement this, though it was our intent for a while. All of the memory locality design decisions made above were made with BFS in mind.

In BDD graphs, the value of a node is dependent on its childrens’ values. In a DFS context, we compute the value of the children prior to the parent, making reduction straightforward. In BFS, we need two phases: an expansion and a reduction phase.

Our final implementation uses the ite formulation of BFS, similar to [8]. Our understanding is that this incurs more memory accesses than the more modern (op, node, node) implementations as in [4, 5, 6], but we ultimately chose this approach because of its consistency with our DFS implementation, allowing for a good deal of code reuse.

We maintain a mapping from bdd_node triples to a final bdd_result. We maintain a dynamically-resized list for each variable. The elements of the list are 4-tuples: 3 bdd_ptrs (a key) and 1 bdd node (a result).

Notably, our request is reused for both expansion and reduction, in contrast to some of the other solutions we’ve seen in the literature. Additionally, we maintain a counter for the size of these arrays, yielding a constant-time clearing operation (just set the counter to zero).

The BFS pseudocode is presented below.

void bfs_expand(f, g, h) {
  make new bdd_node r;
  create_request(f, g, h, r);
  for each varid
    parallel for: for each request of this varid
      expand lo and hi, add to appropriate queue
}
void bfs_reduce() {
  for each varid, starting from the bottom
    parallel for: for each request of this varid
      if either child is a forwarding node,
        grab the node they forward to
      if lo == hi, make yourself a forwarding node
        to one of them
      if your value is already in the unique table,
        forward to that
      otherwise, insert into the unique_table
}

Final Implementation

Our implementation has two key data structures:

The Unique Table

Our unique table is an array of hash tables. We have a num_var parameter; for each variable id, there is a separate hash table. This helps ensure spatial locality.

The hash tables use open-addressing with linear probing. A compare_and_swap is used while probing across. These hash tables are filled with the following struct:

struct bdd {
  bdd_ptr_packed lo;  // 6 bytes
  bdd_ptr_packed hi;  // 6 bytes
  uint16_t varid;     // 2 bytes
  uint16_t refcount;  // 2 bytes
}

This struct represents a node in our BDDs. The lo and hi pointers represent edges to other nodes. The varid identifies which boolean variable this node is associated with. The refcount aids in garbage collection.

Since the pointers to lo and hi lead to another location in this data structure, we’ve effectively embedded the directed graphs directly into the hash tables.

Note that the bdd_ptr_packed structs are inlined. This is done with the intent of reducing memory accesses. Our bdd_apply function needs to know the id of the two children nodes. These structs contain the varids, so we avoid an additional dereference of a value that’s likely to be a cache miss.

These hash tables support essentially one operation: lookup_or_insert. This is a combined find and put, essentially. We hash the key and linearly probe across using compare and swap. If we find the key we’re looking for, we atomically increment the refcount. If we find an empty space, we insert here.

This being said, these design choices weren’t without their tradeoffs. In our current implementation, we cannot support a dynamically-resizing unique table, as the array and hash tables are merged. Rehashing the keys would invalidate all the existing pointers. Given further implementation time, this would be the first order of business to address. Since BDD nodes are rarely distributed uniformly across variable ids, we result in allocating significantly more memory than needed.

The Request Table

Designing an efficient request table turned out to be pretty tricky. In a typical use case, our BDD combination function is called repeatedly in quick succession. Each of these invocation needs a fresh request table. Calling malloc and free to create a new request table is prohibitively expensive, so we needed a way to get a constant-time delete_all operation.

To support this operation, we introduced a version counter. Each element in the table stores its version number. Our delete_all operation simply increments the version counter.

Our final implementation maintains an array and hash table for each variable id. The array stores result nodes, all of which will eventually be forwarded into the unique table. The hash table’s keys are (f, g, h) triples. The values are indicies into the array.

All of these data structures are accessed in parallel by multiple threads. The array supports lock-free atomic updates and resizing. The hash table uses fine-grained locking with an open-addressing, linear probing scheme. Notably, we implemented these locks ourselves using GCC test-and-set primitives.

Technologies

We used the OpenMP parallel for construct and GCC __atomic operations for everything else.

Results

The above is a plot of the speedup we obtain when running the C3540 circuit. These tests were run on Xeon E5645 CPUs. These machines are 6 cores, with 12 total execution contexts.

Notably, this speedup matches that achieved by [5] on the same data set.

The above graph compares the number of memory accesses made by our BFS and DFS implementations, both on a single thread. This test was run on the nqueens 7x7 board. The y-axis is a logarithmic scale - an increment of one is a factor of 10 increase.

Here, shorter is better. The lower the bar, the fewer memory access / cache misses.

We can see that the number of memory accesses our BFS implementation makes are several orders of magnitude less than that of our DFS Implementation.

Future Work

Despite the compelling speedup and memory access data, our implementation is not competitive in raw walltime with established BDD libraries. There are a handful sequential optimizations that we didn’t get the chance to implement. Complement edges and standard triples as described in [2] would be the first step.

To our knowledge, this would be the first time these ite-style optimizations would be combined with a BFS traversal.

Additionally, the size of the resulting BDD can be highly dependent on the ordering of variables. For this reason, some BDD libraries provide heuristics for dynamic variable reordering.

Additionally, we’d be interested in implementing a hybrid DFS/BFS traversal in the spirit of [4, 5, 6].

Lessons Learned

In choosing this project, we unwittingly stumbled into one of the most studied areas of computer science. It was a bit overwhelming getting our feet under us in understanding how to implement sequential BDDs. We spent a great deal of time reviewing existing literature, oftentimes having to follow a trail of references to understand the original papers we were interested in.

Because of this, we wound up having to implement a good deal of the code in the final week of the project. Fortunately, we both happily embraced immersing ourselves in the project for the final week. BDDs are a very cool data structure and it was a lot of fun reasoning over various design decisions. It was also very satsifying to finally get something working :)

References

[1] Randal E. Bryant. Graph-Based Algorithms for Boolean Function Manipulation.

[2] Karl S. Brace, Richard L. Rudell, Randal E. Bryant. Efficient Implementation of a BDD Package.

[3] Jagesh V. Sanghavi, Rajeev K. Ranjan, Robert K. Brayton, Alberto Sangiovanni-Vincentelli. High Performance BDD Package By Exploiting Memory Hierarchy.

[4] Bwolen Yang, Yirng-An Chen, Randal E. Bryant, David R. O’Hallaron. Space-and-Time-Efficient BDD Construction via Working Set Control.

[5] Bwolen Yang and David R. O’Hallaron. Parallel Breath-First BDD Construction.

[6] Yirng-An Chen, Bwolen Yang, and Randal E. Bryant. Breath-First with Depth-First BDD Construction: A Hybrid Approach.

[7] Tony Stornetta and Forrest Brewer. Implementation of an Efficient Parallel BDD Package.

[8] Pranav Ashar and Matthew Cheong. Efficient Breath-First Manipulation of Binary Decision Diagrams.

[9] Tom Van Dijk, Alfons Laarman, and Jaco van de Pol. Multi-Core BDD Operations for Symbolic Reachability.

[10] Yuxiong He. Multicore-enabling a Binary Decision Diagram algorithm.

[11] Jesh Preshing. The World’s Simplest Lockfree Hash Table

Equal work was performed by both project members.