Recently we embarked on a project that required the storage of a relatively big dictionary with 10M+ key-value pairs. Unsurprisingly, Python took over two hours to build such dictionary, taking into accounts all the time for extending, accessing and writing to the dictionary, AND it eventually crashed. So I turned to C++ for help.
In C++, map is one of the ways you can store a string-key and an integer value. Since we are concerned about the data storage and access, I compared map
and unordered_map
.
An unordered_map
stores a hash table of the keys and the mapped value; while a map
is ordered. The important consideration here includes:
- Memory:
map
does not have the hash table and is therefore smaller than an unordered_map
.
- Access: accessing an
unordered_map
takes O(1) while accessing a map
takes log(n).
I have eventually chosen to go with map
, because it is more memory efficient considering the small RAM size that I have access to. However, it still takes up about 8GB of RAM per object during its runtime (and I have 1800 objects to run through, each building a different dictionary). Saving these seems to open another can of worm.
In Python, we could easily use Pickle or JSON to serialise the dictionary. In C++, it’s common to use the BOOST library. There are two archival functions in BOOST: text or binary archives. Text archives are human-readable but I don’t think I am really going to open and read 10M+ lines of key-value pairs, I opted for binary archives that are machine readable and smaller. (Read more:
https://stackoverflow.com/questions/1058051/boost-serialization-performance-text-vs-binary-format .)
To further compress the memory size when I save the maps, I used zlib compression. Obviously there are ready-to-use codes from these people half a year ago, which saved me debugging:
Ultimately this gets down to 96GB summing 1800 files, all done within 6 hours.