Title: A Comprehensive Study of Main Memory Partitioning and its Application to Large-Scale Comparison- and Radix-Sort Appeared in: SIGMOD 2014 Authors: Orestis Polychroniou Kenneth A. Ross 1) In the related work section, sentence ... "Balkesen et al. [4] claimed that non-partitioning hash joins are competitive, but Balkesen et al. [3] improved over Blanas et al. [4] and concluded that partitioning joins are generally faster, even without using fast partitioning [14, 15]." ... has a typo error that identified the author of [4] as Balkesen et al. The correct author of [4] is Blanas et al. as correctly mentioned later in the same sentence. 2) There is an important typo mistake in the code of vertical register resident range comparison in Section 3.5.1. The code shown uses the key to blend delimiters while the comparison result should be used instead. The algorithm is correctly described in the text but the example code has errors. The correct code looks like this: // load 4x 32-bit keys from the input keys = _mm_load_si128(input_keys); input_keys += 4; // perform 4x 3-level binary tree comparisons cmp_L1 = _mm_cmpgt_epi32(keys, del_4); del_15 = _mm_blendv_epi8(del_1, del_5, cmp_L1); del_26 = _mm_blendv_epi8(del_2, del_6, cmp_L1); del_37 = _mm_blendv_epi8(del_3, del_7, cmp_L1); cmp_L2 = _mm_cmpgt_epi32(keys, del_26); del_1357 = _mm_blendv_epi8(del_15, del_37, cmp_L2); cmp_L3 = _mm_cmpgt_epi32(keys, del_1357); // bit-interleave 4x the 3 binary comparison results res = _mm_sub_epi32(_mm_xor_si128(res, res), cmp_L1); res = _mm_sub_epi32(_mm_add_epi32(res, res), cmp_L2); res = _mm_sub_epi32(_mm_add_epi32(res, res), cmp_L3); Thanks to Nathan Kurz (nate@verse.com) for pointing out this error. 3) There is a minor typo mistake in Section 3.5.2. The code description shows vector variables msk_1 and msk_2 as the results of the movemask instructions. Both vector variables msk_1 and msk_2 should be replaced by scalar variables r_1 and r_2 respectively. Variables r_1 and r_2 are used in the following lines. The corrected code of Section 3.5.2 is: // access level 1 (non-root) of the index (5-way) lvl_1 = _mm_load_si128(&index_L1[r_0 << 2]); cmp_1 = _mm_cmpgt_epi32(lvl_1, key); r_1 = _mm_movemask_ps(cmp_1); // ps: epi32 r_1 = _bit_scan_forward(r_1 ^ 0x1FF); r_1 += (r_0 << 2) + r_0; // access level 2 of the index (9-way) lvl_2_A = _mm_load_si128(&index_L2[ r_1 << 3]); lvl_2_B = _mm_load_si128(&index_L2[(r_1 << 3) + 4]); cmp_2_A = _mm_cmpgt_epi32(lvl_2_A, key); cmp_2_B = _mm_cmpgt_epi32(lvl_2_B, key); cmp_2 = _mm_packs_epi32(cmp_2_A, cmp_2_B); cmp_2 = _mm_packs_epi16(cmp_2, _mm_setzero_si128()); r_2 = _mm_movemask_epi8(cmp_2); r_2 = _bit_scan_forward(r_2 ^ 0x1FFFF); r_2 += (r_1 << 3) + r_1; 4) In the experimental section, there is an error in Figure 12 in the X axis values, which are "1 2.5 5 10 25 50", while they should be half of what is written, namely "0.5 1.25 2.5 5 12.5 25". The array sizes are the same in GB with Figure 9, which has 64-bit (32-bit key + 32-bit payload) tuples. The error can also be realized by considering that the platform RAM capacity is 512 GB, thus cannot fit an array of 50 billion 128-bit (64-bit key + 64-bit payload) tuples that would need approximately 800 GB of RAM.