Whats is the oplog in mongodb replica set and how its work internally

A capped collection is similar to a first-in-first-out system. It stays at a set size, with the oldest documents being aged out to make space for new documents. The implementation of capped collections will vary by storage engine.
The most common example of a capped collection is a replica set oplog. Unlike a regular capped collection, the oplog does not have an _id index, and you cannot create indexes on it. Instead, the oplog is accessed using the timestamps given in the ts field.
This article describes the capped collection implementation in WiredTiger and a specific optimization for the replication oplog.

Oplog capped collections

In a regular capped collection using the WiredTiger storage engine, once the collection size limit is reached some of the oldest documents are deleted synchronously as new documents are inserted into the collection.
An oplog uses a dedicated background thread to perform capped collection deletion. The thread is called as required and is not time-based like a time-to-live index. There is additional logic that enables fast truncation of the oplog and efficient removal of old records by recording milestones, also known as oplog stones 
The stones represent logical markers against the oplog that are used as truncation points. When a record is inserted, its size is added to the stone being filled. If the size of the stone exceeds the threshold, then a new stone is created. If the number of stones exceeds the threshold (between 10 and 100, based on the size of the oplog), then a background thread truncates the records that are contained within the oldest stone.
Oplog stones are not persisted, so new stones are chosen at startup based on the records in the oplog. For small oplogs or those containing few records, the entire oplog is scanned and the number of stones required is computed by packing records into a stone until the threshold is exceeded.
For larger oplogs or those with many records (>20,000), records are oversampled (by a factor of 10) at random from the oplog. Samples are then chosen such that they are expected to be near the right boundary of the logical section. As the oplog is truncated, the error in this estimation is reduced because the actual size of newly created stones is known with greater certainty.

Removal the documents :

Keep "milestones" against the oplog to efficiently remove the old records using WT_SESSION::truncate() when the collection grows beyond its desired maximum size. AKA oplog stones.
The stones represent logical markers against the oplog that are used as truncation points. When a record is inserted, its size is added to the stone being filled. If the size of the stone exceeds the threshold, then a new stone is cut. If the number of stones exceeds its threshold (between 10 and 100), then the background thread for the oplog is signaled to delete the records represented by the oldest stone. The thresholds are determined based on the size of the oplog.
The stones are not persisted, so new stones are chosen at startup based on the records in the oplog. For small-sized oplogs or those not containing many records, the entire oplog is scanned to compute the stones to use. This is done simply by packing records into the stone until the threshold is exceeded.
For larger oplogs or those with many records (>20,000), records are oversampled (by a factor of 10) from the oplog at random using a WiredTigerRecordStore::RandomCursor. Samples are then chosen such that they are expected to be near the right boundary of the logical section. As the oplog is truncated, the error in this estimation is reduced because the actual size of newly created stones is known with greater certainty.
Changing the size of a record in the live oplog is no longer supported.


    Comments

    1. Thank you so much for this nice information. Hope so many people will get aware of this and useful as well. And please keep update like this.

      Big Data Consulting Services

      Data Lake Solutions

      Advanced Analytics Services

      Full Stack Development Solutions

      ReplyDelete
    2. nice artical : https://thedbadmin.com/how-to-rebuild-mongodb-replica-set-node-fast-in-few-minutes/

      ReplyDelete
    3. https://thedbadmin.com/running-mongodb-on-docker-compose/

      ReplyDelete

    Post a Comment