Optimizing many collection, high throughput workloads with WiredTiger

Identified bottlenecks and solutions

Cache management overhead with many collections

In order to find eviction candidates more efficiently, we implemented an algorithm that improves how the eviction process reviews content in cache. The algorithm considers the amount of content a tree (a collection or an index) holds in cache and how effective eviction has been in that tree historically, and weighs that against other trees that are currently open. This work was primarily done in WT-3148 and WT-3329.
We updated the eviction algorithm to increase randomization when looking for candidates. Without this randomization, very large caches could have content that the eviction thread rarely processed. This work was primarily done in WT-3149.

Lock contention due to management of collection handles

We updated how we maintain and traverse sets of handles. In MongoDB, each collection and index requires a handle in WiredTiger after it has been accessed. For workloads that access a lot of collections, the set of open handles can grow very large, sometimes in excess of a million handles. We improved the code that interacts with those sets to ensure there is minimal interference between different consumers of the handles.
We also made changes to set access control. We originally used a spinlock implementation, which is ideal when there are a reasonable number of threads attempting to coordinate access, and those threads only require the lock for a short period of time. However, MongoDB users often have many threads, and many users of handles do not make changes to the structures, they only need to know that others are not making changes. By using efficient read/write locks to coordinate access instead of a spinlock, we increased concurrency significantly for high throughput workloads. The work was primarily done in WT-3115 and WT-3152.

Checkpoint creation interfering with throughput

Checkpoint creation can alter the throughput characteristics of a workload for high-throughput users of MongoDB. This makes it difficult to predict performance and leads to over-provisioning of hardware to cope with additional load generated by checkpoints. MongoDB 3.6 introduces several improvements to reduce this interference.
We changed our checkpoint algorithm to spread the IO load more evenly across the duration of the checkpoint. We introduced a more sophisticated pre-checkpoint phase used to spread the IO load and to reduce the amount of work checkpoints need to do while holding resources shared with other operations.
We optimized the interactions that checkpoints have with other WiredTiger utility threads, for example the eviction server. This ensures that utility threads are not blocked for significant amounts of time by checkpoints. Examples of this improvement are captured in detail in WT-3150 and WT-3261.
We optimized the interactions between checkpoints and user level schema operations (collection/index create and drop operations). Since checkpoints capture a view of the entire mongod instance at a point in time, there is a need to coordinate between checkpoint creation and schema operations. These improved interactions mean that schema operations do not need to wait for any phase of a checkpoint before proceeding. An example of work that helped with this is in WT-3207.

Future proofing

We added additional diagnostics to capture information about how the algorithms within WiredTiger are operating. This makes it easier to diagnose the root cause of problems. See WT-3138 for more information.
We created performance tests using synthetic workloads while doing performance tuning. A number of those workloads have been added to the MongoDB continuous integration testing suite. This increases the visibility of this type of workload, thus encouraging continued improvements in throughput, while also providing a safeguard against introducing performance regressions for these workloads. See WT-3074 for more information.

Real world gains

Workload description

The goal for a synthetic workload is to create the simplest configuration that demonstrates the unacceptable performance characteristics. In this case, it needed to capture the experience of enterprise customers with workloads using thousands of collections spread across many databases, where data was distributed and accessed unevenly across those collections. The users had large WiredTiger caches configured (in excess of 100GB), but the data sets were always an order of magnitude larger than the available cache. During steady state use the workloads employed many MongoDB connections, which required many threads to operate in the database in parallel.
We used the wtperf utility for WiredTiger to simulate application workloads. The workload is split into two phases: write and read. The first phase populates the data using the following configuration:
# Create a set of tables with uneven distribution of data
conn_config="cache_size=1G,eviction=(threads_max=8),file_manager=(close_idle_time=100000),checkpoint=(wait=60,log_size=2GB),
             statistics=(fast),statistics_log=(wait=5,json),session_max=1000"
table_config="type=file"
table_count=1000
icount=0
random_range=1000000000
pareto=10
range_partition=true
report_interval=5

run_ops=10000000
populate_threads=0
icount=0
threads=((count=20,inserts=1))

# Warn if a latency over 1 second is seen
max_latency=1000
sample_interval=5
sample_rate=1
The second phase reads data using the following configuration:
# Read from a set of tables with uneven distribution of data
conn_config="cache_size=1G,eviction=(threads_max=8),file_manager=(close_idle_time=100000),checkpoint=(wait=60,log_size=2GB),
             statistics=(fast),statistics_log=(wait=5,json),session_max=1000"
table_config="type=file"
table_count=1000
icount=0
random_range=1000000000
pareto=10
range_partition=true
report_interval=5
create=false

run_time=600
threads=((count=20,reads=1))

# Warn if a latency over 1 second is seen
max_latency=1000
sample_interval=5
sample_rate=1

Throughput comparison

The tests were run on identical hardware multiple times. The averaged results are reported below.
MongoDB versionLoad time (seconds)Load latencies > 1 secondQuery throughput (ops/sec)Max throughput in any 5 second period (ops/sec)Min throughput in any 5 second period (ops/sec)
3.4.12956278675285672248698
3.5.101051551784583607525906

Comments