Title:   A Comprehensive Study of Main Memory Partitioning and
         its Application to Large-Scale Comparison- and Radix-Sort

Appeared in:  SIGMOD 2014

Authors:   Orestis Polychroniou
           Kenneth A. Ross


1) In the related work section, sentence ...
"Balkesen et al. [4] claimed that non-partitioning hash joins are
competitive, but Balkesen et al. [3] improved over Blanas et al. [4]
and concluded that partitioning joins are generally faster, even
without using fast partitioning [14, 15]."
... has a typo error that identified the author of [4] as Balkesen et al.
The correct author of [4] is Blanas et al. as correctly mentioned later
in the same sentence.


2) There is an important typo mistake in the code of vertical register
resident range comparison in Section 3.5.1. The code shown uses the key
to blend delimiters while the comparison result should be used instead.
The algorithm is correctly described in the text but the example code
has errors. The correct code looks like this:

    // load 4x 32-bit keys from the input
    keys = _mm_load_si128(input_keys);
    input_keys += 4;
    // perform 4x 3-level binary tree comparisons
    cmp_L1 = _mm_cmpgt_epi32(keys, del_4);
    del_15 = _mm_blendv_epi8(del_1, del_5, cmp_L1);
    del_26 = _mm_blendv_epi8(del_2, del_6, cmp_L1);
    del_37 = _mm_blendv_epi8(del_3, del_7, cmp_L1);
    cmp_L2 = _mm_cmpgt_epi32(keys, del_26);
    del_1357 = _mm_blendv_epi8(del_15, del_37, cmp_L2);
    cmp_L3 = _mm_cmpgt_epi32(keys, del_1357);
    // bit-interleave 4x the 3 binary comparison results
    res = _mm_sub_epi32(_mm_xor_si128(res, res), cmp_L1);
    res = _mm_sub_epi32(_mm_add_epi32(res, res), cmp_L2);
    res = _mm_sub_epi32(_mm_add_epi32(res, res), cmp_L3);

Thanks to Nathan Kurz (nate@verse.com) for pointing out this error.


3) There is a minor typo mistake in Section 3.5.2. The code description shows
vector variables msk_1 and msk_2 as the results of the movemask instructions.
Both vector variables msk_1 and msk_2 should be replaced by scalar variables
r_1 and r_2 respectively. Variables r_1 and r_2 are used in the following lines.
The corrected code of Section 3.5.2 is:

    // access level 1 (non-root) of the index (5-way)
    lvl_1 = _mm_load_si128(&index_L1[r_0 << 2]);
    cmp_1 = _mm_cmpgt_epi32(lvl_1, key);
    r_1 = _mm_movemask_ps(cmp_1);   // ps: epi32
    r_1 = _bit_scan_forward(r_1 ^ 0x1FF);
    r_1 += (r_0 << 2) + r_0;
    // access level 2 of the index (9-way)
    lvl_2_A = _mm_load_si128(&index_L2[ r_1 << 3]);
    lvl_2_B = _mm_load_si128(&index_L2[(r_1 << 3) + 4]);
    cmp_2_A = _mm_cmpgt_epi32(lvl_2_A, key);
    cmp_2_B = _mm_cmpgt_epi32(lvl_2_B, key);
    cmp_2 = _mm_packs_epi32(cmp_2_A, cmp_2_B);
    cmp_2 = _mm_packs_epi16(cmp_2, _mm_setzero_si128());
    r_2 = _mm_movemask_epi8(cmp_2);
    r_2 = _bit_scan_forward(r_2 ^ 0x1FFFF);
    r_2 += (r_1 << 3) + r_1;


4) In the experimental section, there is an error in Figure 12 in the X axis
values, which are "1 2.5 5 10 25 50", while they should be half of what is
written, namely "0.5 1.25 2.5 5 12.5 25". The array sizes are the same in
GB with Figure 9, which has 64-bit (32-bit key + 32-bit payload) tuples.
The error can also be realized by considering that the platform RAM capacity
is 512 GB, thus cannot fit an array of 50 billion 128-bit (64-bit key + 64-bit
payload) tuples that would need approximately 800 GB of RAM.