Sort by Regex

Posted on Posted in Coding

Rax string reduction, /cat(), on the IMPALA backend was not trivial to implement. The IMPALA documentation states:

GROUP_CONCAT ... does not support the OVER clause, ...

Effectively this means no ORDER BY on GROUP_CONCAT in IMPALA. This must be implemented eventually, because the whole GROUP_CONCAT has limited use without it. Meantime, regexes can be used to put things back in order like so:

regex_replace('annadaveeve','(anna)(dave)(eve)','\3\1\2')

Which would place eve first, anna second, and dave third. The rank number rnum of each name can be expressed like this:

ROW_NUMBER() OVER(ORDER BY age DESC) AS rnum

The GROUP_CONCAT from rnum belonging to (anna)(dave)(eve) would be 231, which can be translated to \3\1\2 using regexp_replace and translate like so:

regexp_replace(translate('123', '231', '123'), '.', '\\\0')

Giving us the third argument of the sorting regex. The first and second argument are simple. Likewise an SQL SELECT statement like this:

SELECT 
    parent,
    GROUP_CONCAT(child, '; ') OVER(ORDER BY age DESC)
  FROM table
  GROUP BY parent

could be rewritten in IMPALA like (untested) so:

SELECT 
    parent,
    REGEXP_REPLACE(
      GROUP_CONCAT(child, ''),
      GROUP_CONCAT(CONCAT('(',child,')'), ''),
      STRLEFT(
        REGEXP_REPLACE(
          TRANSLATE(
            '123456789', 
            GROUP_CONCAT(CAST(rnum AS STRING), ''),
            '123456789'),
          '.', 
          '\\\0; '),
        MAX(rnum) * (2+2) - 2))
  FROM (
    SELECT 
        *, 
        ROW_NUMBER() OVER(PARTITION BY parent ORDER BY age DESC) AS rnum,
      FROM table) please 
  GROUP BY parent

Your milage may vary using this code, but if you, like Rax, need a solution right now, try a variant of this code. (You might need to double up the \\.) Also note that it might be safer to use an escaped version of child to use in the second argument of the outer regexp_replace like so:

  regexp_replace(child, '[[:punct:]]', '\\\0') AS esc_child

Finally, note, this only works with group sizes below ten. To up that, replace 123456789 by abcdefghijklmnop (or longer), replace CAST(rnum AS STRING) with something like substr('abcdefghijklmnop', rnum, 1) and add an extra regexp_replace in front of strleft to transform \a\b\c...\o\p into \1\2\3..\15\16.