Warm tip: This article is reproduced from stackoverflow.com, please click
google-bigquery

Removing duplicates and selecting distinct values with struct array in Bigquery

发布于 2020-04-08 15:39:48

I'm getting started with BigQuery. I have a database that looks like this which can be generated as

    WITH T AS (
  SELECT 0 AS id, 'red' AS colour, ARRAY<STRUCT<count INT64, shape STRING>>[(2, "dot"), (2, "dot"), (1, "string")] AS arr, DATE(2020,01,31) AS date UNION ALL
  SELECT 0 AS id, 'red' AS colour, ARRAY<STRUCT<count INT64, shape STRING>>[(2, "dot"), (2, "dot"), (1, "string")] AS arr, DATE(2020,01,31) AS date UNION ALL
  SELECT 0 AS id, 'red' AS colour, ARRAY<STRUCT<count INT64, shape STRING>>[(20, "dot"), (20, "dot"), (1, "string")] AS arr, DATE(2020,01,30) AS date UNION ALL
  SELECT 0 AS id, 'black' AS colour, ARRAY<STRUCT<count INT64, shape STRING>>[(296, "dot"), (212, "plane"), (156, "cube")] AS arr, DATE(2020,01,31) AS date UNION ALL
  SELECT 0 AS id, 'black' AS colour, ARRAY<STRUCT<count INT64, shape STRING>>[(296, "dot"), (212, "plane"), (156, "cube")] AS arr, DATE(2020,01,31) AS date UNION ALL
  SELECT 0 AS id, 'black' AS colour, ARRAY<STRUCT<count INT64, shape STRING>>[(296, "dot"), (21, "plane"), (156, "cube")] AS arr, DATE(2020,01,30) AS date UNION ALL
  SELECT 0 AS id, 'black' AS colour, ARRAY<STRUCT<count INT64, shape STRING>>[(296, "dot"), (2, "plane"), (156, "cube")] AS arr, DATE(2020,01,30) AS date UNION ALL
  SELECT 1 AS id, 'blue' AS colour, ARRAY<STRUCT<count INT64, shape STRING>>[(4, "cube"), (4, "cube"), (4, "cube")], DATE(2020, 01, 31) AS date UNION ALL
  SELECT 2 AS id, 'orange' AS colour, ARRAY<STRUCT<count INT64, shape STRING>>[(5, "string")], DATE(2020,01,31) AS date UNION ALL
  SELECT 2 AS id, 'orange' AS colour, ARRAY<STRUCT<count INT64, shape STRING>>[(5, "string")], DATE(2020,01,30) AS date
)
SELECT *
FROM T;

I want to select each distinct date, and for each date take each shape and maximum count for each id and each colour. For example for 2020-01-31 for red 0 it'd be 2 dot 1 string, for 2020-01-30 for 0 black it'd be 296 dot 21 plane 156 cube. There may be repetition in rows, in dates and within the struct array in the data.

More precisely, I'd like the result of the query to look like this , which can be generated by

WITH T AS (
  SELECT DATE(2020,01,31) AS date, ARRAY<STRUCT<count INT64, shape STRING, id INT64, colour STRING>>[(2, "dot", 0, "red"), (1, "string", 0, "red"), (296, "dot", 0, "black"), (212, "plane", 0, "black"), (156, "cube", 0, "black"), (4, "cube", 1, "blue"), (5, "string", 2, "orange")] AS res UNION ALL
  SELECT DATE(2020,01,30) AS date, ARRAY<STRUCT<count INT64, shape STRING, id INT64, colour STRING>>[(20, "dot", 0, "red"), (1, "string", 0, "red"), (296, "dot", 0, "black"), (21, "plane", 0, "black"), (156, "cube", 0, "black"), (5, "string", 2, "orange")] AS res
)
SELECT *
FROM T;

I'm struggling with two issues: removing duplicates and selecting id and shape for each row of the array. For instance the query

SELECT date, ARRAY_CONCAT_AGG(ARRAY((SELECT AS STRUCT MAX(count), shape FROM UNNEST(arr) GROUP BY shape)))
FROM T
GROUP BY date

returns me duplicates. And then I'd need to assign to each nested row the id and colour. Any suggestions would be much appreciated.

Thanks!

Questioner
John
Viewed
763
rtenha 2020-02-01 01:11
with T AS (),
unnested_and_unique as (
  select distinct id, colour, count, shape, date
  from T left join unnest(arr) x
)
select date,array_agg(struct(count,shape,id,colour)) as res
from unnested_and_unique
group by 1