Sdílet prostřednictvím


Graph Engine vs. C++ unordered_map: memory consumption test, Part 2

In my previous post, I demonstrated the memory management efficiency of Graph Engine against C++ unordered_map with MSVC's built-in memory allocator. Note that the test is not about unordered_map alone, but a combination with buffer allocation and deallocation. I would like to measure how does the program consume memory with all these parts coming together. The goal is not to beat C++, not to beat unordered_map, but rather to show that, even the Graph Engine key-value store is designed to manage billions of objects across multiple machines, it does not get bloated with too many abstractions when it comes down to the simplest task, and is even comparable to a bare-metal C++ built-in data structure, which is not performing bad at all. There are a lot of discussions and I'd like to thank all for the comments and feedback. Here are a few important points collected from the discussion:

1. The C++ part has memory leak concerning the overwritten entries.
2. The rand()*rand() is a bad idea. Also, using different random generators(C routine vs. C# Random class) does not produce precisely aligned semantics.
3. There are overheads invoking the random generators.
4. For the C++ part, people demand better coding style :-)
5. Plot the memory consumption chart with an origin.
6. For better performance, call reserve() on the unordered_map.
7. For better performance, embed content buffer into the cell entries.
8. Why does the C# program only allocate buffer once outside the loop?

Let me first explain why some of these proposed changes to the test are ruled out. For reserve(), it does improve the performance on the C++ side. But please note that I did not make any optimizations on the C# side either. And for the purpose of a key-value store, there should not be pre-assumed space reservations hard coded into the program, because generally if you are going to put it online, instead of doing planned experiments, you cannot know the usage patterns in advance. For similar reasons, we cannot embed fixed-length std::array into cell entries, because the value blobs stored into a key-value store may vary in size. Plus, variable length entries do not fit into a map, or we will end up storing references to these content objects, which, logically does not make a difference than a void*. The program only allocate a single C# byte array, because the Graph Engine K/V store has its own memory management subsystem, which is not allocating from the CLR managed heap.

To make the random number generation fair to both the C++ and C# test programs, I switched to a dedicated C++ program for generating test instructions for both programs. It randomly generates 64M store operations, each containing a cell id and a cell size. The maximum cell size is twice as large as in the last test so that the expected total size should be the same.

 #include "stdafx.h"
#include <random>
#include <functional>
#include <unordered_map>
#include <algorithm>
#include <cstdint>
#include <cstdio>

#define ENTRY_COUNT        (64<<20)
#define ENTRY_MAX_ID       (ENTRY_COUNT * 4)
#define ENTRY_MAX_SIZE     (314 * 2)
#define BUCKET_COUNT       (64)
#define BUCKET_GRANULARITY (ENTRY_MAX_SIZE / BUCKET_COUNT)

struct param_entry
{
    int64_t cell_id;
    size_t cell_size;
};

int main(int argc, char* argv[])
{
    std::default_random_engine             rnd_gen;
    std::uniform_int_distribution<int64_t> cell_id_dist(0, ENTRY_MAX_ID);
    std::uniform_int_distribution<int32_t> cell_size_dist(1, ENTRY_MAX_SIZE);
    std::vector<param_entry>               param_entries(ENTRY_COUNT);
    std::unordered_map<int32_t, int32_t>   size_buckets;
    auto cell_id_gen   = std::bind(cell_id_dist, rnd_gen);
    auto cell_size_gen = std::bind(cell_size_dist, rnd_gen);

    for (auto& param : param_entries)
    {
        param.cell_id   = cell_id_gen();
        param.cell_size = cell_size_gen();
    }

    FILE* fp;
    fopen_s(&fp, "params.dat", "wb");
    fwrite(param_entries.data(), sizeof(param_entry), ENTRY_COUNT, fp);
    fclose(fp);

    /* statictical analysis */

    for (int i = 0; i < BUCKET_COUNT; ++i)
    {
        size_buckets[i] = 0;
    }

    for (auto& param : param_entries)
    {
        int bucket_idx = param.cell_size / BUCKET_GRANULARITY;
        ++size_buckets[bucket_idx];
    }

    std::sort(param_entries.begin(), param_entries.end(), [](param_entry &a, param_entry &b) {return a.cell_id < b.cell_id; });
    auto new_end               = std::unique(param_entries.begin(), param_entries.end() , [](param_entry &a, param_entry &b){ return a.cell_id == b.cell_id; });
    int32_t unique_entry_count = new_end - param_entries.begin();

    printf(">>>");
    printf("%d total entries\n", ENTRY_COUNT);
    printf(">>>");
    printf("%d unique entries\n", unique_entry_count);
    printf(">>>");
    printf("Size histogram:\n");
    for (int i = 0; i < BUCKET_COUNT; ++i)
    {
        printf("%-2d\t\t%d\n", i * BUCKET_GRANULARITY, size_buckets[i]);
    }

    return 0;
}
 

The resulting data file params.dat takes exactly 1GB of space, with the following statistics information:

 >>>67108864 total entries
>>>59378342 unique entries
>>>Size histogram:
0               855411
9               962141
18              960462
27              961976
36              961641
45              961725
54              963224
63              961521
72              961336
81              961785
90              961053
99              964048
108             962261
117             961443
126             960895
135             962208
144             960792
153             960925
162             961486
171             961390
180             961762
189             961682
198             962320
207             962201
216             960415
225             962173
234             963833
243             960419
252             960049
261             963305
270             959831
279             959656
288             961434
297             961652
306             964231
315             963157
324             961786
333             960566
342             961645
351             961816
360             961421
369             961435
378             962095
387             960651
396             962341
405             961820
414             961355
423             961368
432             960645
441             961942
450             960869
459             962132
468             962618
477             960927
486             962067
495             960434
504             962018
513             962531
522             962315
531             961781
540             962139
549             961675
558             963721
567             961252
 

Uniform enough. I limited the range of cell ids to [0, 256M] to introduce some duplicated entries. And the test code, updated, here:

 #include "stdafx.h"
#include <vector>
#include <iostream>
#include <unordered_map>
#include <random>
#include <functional>
#include <chrono>
#include <cstdint>
#include <ctime>
using namespace std;

#define ENTRY_COUNT (1<<20)*64

typedef int64_t cellid_t;

struct CellEntry
{
    void*    ptr;
    int32_t  len;
    uint16_t cellType;
};

struct param_entry
{
    int64_t cell_id;
    size_t cell_size;
};

param_entry                        param_entries[ENTRY_COUNT];
unordered_map<cellid_t, CellEntry> LocalStorage;

int main(int argc, char** argv)
{
    srand(time(NULL));
    std::chrono::steady_clock::time_point start_time, end_time;
    std::cout << "Preparing keys" << std::endl;

    FILE* fp;
    fopen_s(&fp, "params.dat", "rb");
    fread(param_entries, sizeof(param_entry), ENTRY_COUNT, fp);
    fclose(fp);

    std::cout << "Test started." << std::endl;
    start_time = std::chrono::steady_clock::now();

    for (auto entry : param_entries)
    {
        void* cell_buf      = new char[entry.cell_size];
        auto  upsert_result = LocalStorage.emplace(entry.cell_id, CellEntry{ cell_buf, entry.cell_size, 0 });

        if (!upsert_result.second)
        {
            std::swap(cell_buf, upsert_result.first->second.ptr);
            delete[] cell_buf;
        }

        if (rand() % 3 == 0)
        {
            delete[] upsert_result.first->second.ptr;
            LocalStorage.erase(entry.cell_id);
        }
    }

    end_time = std::chrono::steady_clock::now();

    std::cout << "done." << std::endl;
    std::cout << "Time = " << chrono::duration_cast<chrono::seconds>(end_time - start_time).count() << std::endl;

    getchar();

    return 0;
}
 

C#:

 using System;
using System.Collections.Generic;
using System.Diagnostics;
using System.IO;
using System.Linq;
using System.Runtime.InteropServices;
using System.Text;
using System.Threading.Tasks;
using Trinity;
using Trinity.Core.Lib;
namespace MemoryConsumptionTest_CSharp
{
    unsafe internal class Program
    {
        const int  ENTRY_COUNT    = (64 << 20);
        const int  ENTRY_MAX_SIZE = (314 * 2);
        [StructLayout(LayoutKind.Explicit)]
        struct param_entry
        {
            [FieldOffset(0)]
            public long cell_id;
            [FieldOffset(8)]
            public int cell_size;
            [FieldOffset(12)]
            private int __pad;
        };
        [DllImport("msvcrt.dll")]
        static extern void* fopen(string fname, string mode);
        [DllImport("msvcrt.dll")]
        static extern long fread(void* ptr, long size, long count, void* fp);
        [DllImport("msvcrt.dll")]
        static extern void fclose(void* fp);
        static void Main(string[] args)
        {
            // SaveCell will alloc memory inside GE,
            // so we just allocate a single buffer from C#
            byte* buf                        = (byte*)Memory.malloc(ENTRY_MAX_SIZE);
            param_entry* entries             = (param_entry*)Memory.malloc((ulong)(ENTRY_COUNT * sizeof(param_entry)));
            TrinityConfig.CurrentRunningMode = RunningMode.Embedded;
            Stopwatch stopwach               = new Stopwatch();
            Random dice                      = new Random();

            Console.WriteLine("Preparing keys");

            var fp = fopen("params.dat", "rb");
            fread(entries, sizeof(param_entry), ENTRY_COUNT, fp);
            fclose(fp);

            Console.WriteLine("Test started.");
            stopwach.Start();

            for(param_entry* p = entries, e = entries + ENTRY_COUNT; p != e; ++p)
            {
                Global.LocalStorage.SaveCell(p->cell_id, buf, p->cell_size, cellType: 0);

                if (dice.Next(3) == 0)
                {
                    Global.LocalStorage.RemoveCell(p->cell_id);
                }
            }

            stopwach.Stop();
            Console.WriteLine("Done.");
            Console.WriteLine("Time = {0}", stopwach.ElapsedMilliseconds / 1000);

            Console.ReadKey();
        }
    }
}
 

Updated performance result:

Looking forward to your comments!

Comments

  • Anonymous
    June 07, 2015
    This is a fine result, but for completeness, you should profile both tests and verify that the time being spent is actually being spent on the operation you are attempting to time.

  • Anonymous
    June 07, 2015
    Out of curiosity, does the final C# memory consumption number include the drop at the end or no?  I notice that both programs use 2GB of memory for param_entries.  In the C++ one though, the memory is allocated in the global space.  In the C# program, it's allocated on the heap.  I'm not sure if the C# program will  GC that memory or not (I'm not a C# expert).  What was the peak memory use by both apps? I'm just having a hard time understanding how the last comparison with so many memory leaks (3.5GB) consumed only 1.9 GB more memory than GraphEngine, but now with leaks fixed still uses 1.8 GB more memory. I do agree that GraphEngine has excellent (near perfect) memory consumption characteristics, which is the ultimate point of the article, but I'm concerned the C++ app isn't getting measured correctly.