How can I work around the 1024 byte limit on index keys?

Mongodb imposes a 1024 byte limit on index keys, which may cause inserts to fail. This is especially noticeable after upgrading from version 2.4, due to a change in MongoDB behavior. How can I work around this limit?

Answer

Summary of Index Key Limitations

Index key size is determined by the sum of the BSON representation of all indexed fields for a given document, plus some internal overhead. This size limitation cannot be exceeded.

Prior to version 2.6:

When creating indexes for an existing collection, if one or more documents in the collection cause the limitation to be exceeded:
  • the index is created but the offending documents are omitted from the index
  • the first offending document is logged in the mongod log file.
When inserting or updating documents would result in data that exceeds the limitation for an existing index:
  • the insert or update proceeds without errors, but the document is omitted from the index

From version 2.6 forward:

When creating indexes for an existing collection, if one or more documents in the collection cause the limitation to be exceeded:
  • an error is raised and the index is not created. If re-creating an index, for example with reindex or repairDatabase, the index is dropped but is not re-created.
  • the first offending document is logged in the mongod log file.
When inserting or updating documents would result in data that exceeds the limitation for an existing index:
  • an error is raised and the insert or update fails
While the pre-2.6 behavior is more permissive, it can result in indexes that do not include some documents. Queries using those indexes will fail to return documents whose data violates the 1024 byte size limitation. As a result, some documents that appear in collection scans and certain indexes will not appear in other indexes.
To demonstrate, this example uses a 2.4 version database and loads it with a very large key field and a smaller alt_key field:
db.foo.drop();
var data="";
for (var i=0; i<1025 ; i++) data=data+'x';
db.foo.insert({key: data, alt_key: 'y'});
db.foo.insert({key: data, alt_key: 'y'});
db.foo.insert({key: data, alt_key: 'y'});
db.foo.insert({key: 'xxx', alt_key: 'y'});
There are four documents in the foo collection, three with 1025-byte values for key, and one with a 3-byte value.
Next, create two indexes, one on key and one on alt_key. Because this is a version 2.4 database, both indexes are created without errors, but the documents with large values for key do not have entries in the first index.
db.foo.createIndex({key: 1});
db.foo.createIndex({alt_key: 1});
Now query the collection 3 times. First, using .hint({natural: 1}) to request a collection scan, then using a hint for each index:
db.foo.find({key: {$gt: 'x'}, alt_key: 'y'}).hint({$natural: 1}).itcount();
4
db.foo.find({key: {$gt: 'x'}, alt_key: 'y'}).hint("alt_key_1").itcount();
4
db.foo.find({key: {$gt: 'x'}, alt_key: 'y'}).hint("key_1").itcount();
1
In all cases, the predicates specify all documents with:
  • key value greater than x
  • alt_key value equal to y
Thus, all documents in the collection should be returned in all cases.
The actual behavior displays the problem with an index that silently ignores documents with key limit violations:
  • The collection scan finds all 4 documents as expected.
  • The query using the index on alt_key also finds all 4 documents.
  • The query using the index on key, however, finds only 1 document because there are no entries in the index for the large values of key.

Workarounds

Restore the old behavior

If you prefer the pre-2.6 behavior for indexes, you can restore it with an optional parameter when starting the mongod process.
Warning
The pre-2.6 behavior for data exceeding the 1024 byte limit on index keys is to silently omit the offending keys from the index. This can result in situations where queries using different indexes may return different result sets with differing document counts.
This behavior is usually undesirable.
All versions of MongoDB from 2.6 to the latest release (currently 3.4) support the failIndexKeyTooLong parameter. Setting this parameter to false with the command line option -setParameter failIndexKeyTooLong=false restores the pre-2.6 behavior.

Use hashed indexes

Because hashed indexes store a short hash of the data in a document's indexed field, they can be used to overcome the 1024 byte key size limitation.
Hashed indexes are subject to several restrictions that may limit their utility:
  • A hashed index can be used only for equality comparisons.
    For example, a hashed index on the field A in collection foo can be used for queries such as
    db.foo.find({A: 'some very very long but quite precise piece of data ..........'})
    
    or
    db.foo.find({A: {$eq: 63}});
    
  • A hashed index cannot be used for range-based comparisons.
    For example, you cannot use a hashed index for this query:
    db.foo.find({A: {$gte: 'my search text starts with this'}});
    
  • A hashed index may contain only a single field. Compound hashed indexes are not supported.

Index a computed field

Consider a collection describing foreign books:
foreignBooks -->
{ title: 'Les Miserables, Volume 1',
  author: 'Hugo, V.',
  textEnglish: 'So long as there shall exist, by virtue of law and custom ...'
}
You might want to index and later search on the exact textEnglish field with a query like:
db.foreignBooks.find({$and: [{textEnglish: {$gte: 'So long as there shall exist'}},
                             {textEnglish: {$lte: 'So long as there shall not exist'}}]});
The English language translation of a book is likely to exceed 1024 bytes, making it difficult to create a BTREE index on that field.
Instead, you could add a computed field with the first 256 bytes of the text, then index and query on that.
For example:
foreignBooks -->
{ title: 'Les Miserables, Volume 1',
  author: 'Hugo, V.',
  textEnglish: 'So long as there shall exist, by virtue of law and custom ...',
  textEnglishFirst32: 'So long as there shall exist, by'
}
You can now index and search the exact textEnglishFirst32 field with a query like:
db.foreignBooks.find({$and: [ {textEnglishFirst32: {$gte: 'So long as there shall exist'}},
                              {textEnglishFirst32: {$lte: 'So long as there shall not exist'}}]});
The details of the computed field varies depending on the actual use case. In this example, the first 32 bytes of the textEnglish field are distinctive enough to perform the required search.
In more complex cases, you can use computed fields to restrict searches, even if they cannot completely substitute for the full-size field. In the case where you want to lookup a foreign book based on the full text of its English translation, you can use a query such as:
db.foreignBooks.find({textEnglishFirst32: 'So long as there shall exist, by',
                      textEnglish: 'So long as there shall exist, by virtue of law and custom ...'});
The first predict allows the use of an index on the computed field textEnglishFirst32.
The second predicate, on the original full-text field, recognizes that the results obtained trough the index contains the documents we want, but may also contain extraneous ones, and filters the results to only the exact set required.

Use text indexes

Text indexes in MongoDB are not subject to the 1024 byte limit, and may be suitable substitutes for BTREE indexes on fields containing large amounts of text.

Modify existing indexes

Remove fields with very large values from indexes.
  • Drop indexes entirely when the data cannot be indexed effectively and rely instead on other options, like hash-indexes, or indexing computed values.
  • In the case of compound indexes, it may be practical to re-create indexes with fewer fields to avoid the 1024 byte limit on index entries.
In some cases, indexes may be almost as effective in the absence of the large fields, while in other cases they are at least more effective than collection scans.

Refactor

Redesign your schema and your queries, avoiding the need for BTREE indexes on very large fields.

Comments