Adding more mongos routers to my deployment

In the past, we often recommended placing a mongos router on each application server. This meant that large deployments could have several hundred routers. This recommendation has evolved, and the preferred approach is now usually a small number of mongos routers. The reasons for this change are explained below.

Fewer connections

Every mongos process has at least one connection to each replica set member in order to determine which members are primary or secondary. Each connection consumes 1MB of memory, and having hundreds of mongos connections adds up to a significant amount of memory usage.

Improved availability and monitoring

Hard coding a mongos router for an application creates a single point of failure. If the router becomes unavailable, the application cannot continue to function. Instead, we recommend applications use a pool of routers to prevent this scenario. Although this increases complexity in your application code, as you need to determine a list of mongos routers to populate the seed list in the driver, the complexity is outweighed by the benefits. In addition to improving reliability, fewer routers decreases log traffic, thus enabling easier investigation of issues.

Improved config server performance

A large number of mongos routers can overload the config servers. With clustered config servers (MongoDB 3.0), mongos reads shard metadata from the first server specified in the –-configdb string. This can increase load on that config server, resulting in slower moveChunk commands (chunk migration) and frequent timeouts.
Replica set config servers (MongoDB 3.2 and later) improve this situation by reading from the nearest config server. Nonetheless, load issues may still occur with a large number of mongos routers.

Chunk splitting and migration

In MongoDB 3.2 and earlier, mongos routers perform all chunk splitting. Each router splits a chunk automatically when it detects that inserts are equal to 20% of the chunk size. This naive algorithm works well with small numbers of mongos routers, but the likelihood of splitting chunks is inversely proportional to the number of routers present. For example, if a cluster has 100 mongos routers receiving a uniform distribution of inserts, each receives 1% of the inserts. Each router is aware of only 0.2% of inserts rather than 20% across the entire cluster.
When each mongos observes only a small percentage of write traffic for the same collection, chunks may grow past the expected chunk size until migration triggers an attempt to split. This leads to an imbalance in data when the chunks have not been split by the naive algorithm. In addition, it may cause a burst of migrations when a single oversize chunk eventually gets split into multiple smaller ones.
For example, assuming the default chunk size of 64MB, a chunk range that has grown to 1GB, and a migration threshold of 8 chunks for a large cluster:
  • the donor shard is selected because it has the most chunks (8 more than the shard with the lowest number of chunks)
  • a 1GB chunk (not flagged jumbo) is chosen for migration and successfully split into approximately 32 half-full chunks (each representing about 32MB of documents)
  • the imbalance is now 40 chunks instead of 8 (assuming only one chunk has grown larger than expected)
Furthermore, as there is a single, globally coordinated lock to move a chunk, the number of failed migrations due to the lock is directly proportional to the number of mongos routers. While all migrations will eventually complete, the large number of log messages generated in these scenarios can cause concern for operators.
In MongoDB 3.4 and later, chunk splitting is still affected by the percentage of write traffic each mongos observes. However, the balancer runs on the primary of the config server replica set. Balancing is therefore unaffected directly by the number of mongos routers, although heavy traffic between the routers and the config servers may slow chunk migrations. MongoDB mitigates the impact of this workload by restricting each shard to one migration at a time and running parallel chunk migrations when possible. For more information, see Cluster Balancer.

Query performance

The best sharded query performance is obtained when a mongos router is close to the cluster members in terms of network latency. Some queries need to access multiple shards or may perform scatter/gather operations. In these cases, the router must wait for all shards to return a result, which can be expedited by locating the router closer to the shards. Optimizing mongos router placement for network latency allows for faster responses and more deterministic query times.

Comments