Warm tip: This article is reproduced from stackoverflow.com, please click
postgresql sql data-warehouse postgresql-9.3

Extrapolate daily historical values from a table that only records when a value changes (Postgresql

发布于 2020-03-27 10:31:25

I have a table that records a row for each time a score for a location has changed.

score_history:

  • id int PK (uuid auto incrementing int)
  • happened_at timestamp (when the score changed)
  • location_id int FK (the location that the value is for)
  • score float (the new score)

This was done with an eye on efficiency and being able to simply retrieve a list of changes for a given location and serves that purpose nicely.

I'm trying to output the data in a very redundant format to help load it into a rigid external system. The external system expects a row for each location * every date. The goal is to represent the last score value for each location for each date. So if the score changed 3 times in a given date only the score closest to midnight would be considered that locations closing score for the day. I imagine this is similar to the challenge of creating a close of business inventory level fact table.

I have a handy star schema style date dimension table which has a row for every date fully covering this sample period and well into the future.

That table looks like

dw_dim_date:

  • date date PK
  • a bunch of other columns like week number, is_us_holiday etc.

So, if I had only 3 records in the score_history table...

1, 2019-01-01:10:13:01, 100, 5.0
2, 2019-01-05:20:00:01, 100, 5.8
3, 2019-01-05:23:01:22, 100, 6.2

The desired output would be:

2019-01-01, 100, 5.0 
2019-01-02, 100, 5.0 
2019-01-03, 100, 5.0
2019-01-04, 100, 5.0 
2019-01-05, 100, 6.2

3 Requirements:

  1. One row per day per location even if there are no score records for that day.
  2. If there are score records for that day the last one before midnight should be the score value for the row. In the event of a tie the greater of the two should "win".
  3. If there are zero score records for that day the score should be the most recent previous score.

I've been chasing my tail through subqueries and window functions.

Because I'm hesitant to post something without something I tried I'll share this trainwreck which produces output but of no meaning...

SELECT dw_dim_date.date,
       (SELECT score 
        FROM score_history 
        WHERE score_history.happened_at::DATE < dw_dim_date.date 
           OR score_history.happened_at::DATE = dw_dim_date.date 
        ORDER BY score_history.id desc limit 1) as last_score
FROM dw_dim_date
WHERE dw_dim_date.date > '2019-06-01'

Grateful for guidance or pointers to other questions to read.

Questioner
Nick
Viewed
89
Lukasz Szozda 2019-07-06 16:23

You could achieve it with usage of correlated subqueries and LATERAL:

SELECT sub.date, sub.location_id, score
FROM (SELECT * FROM dw_dim_date
      CROSS JOIN (SELECT DISTINCT location_id FROM score_history) s
      WHERE date >= '2019-01-01'::date) sub
,LATERAL(SELECT score FROM score_history sc 
         WHERE sc.happened_at::date <= sub.date
           AND sc.location_id = sub.location_id
         ORDER BY happened_at DESC LIMIT 1) l
,LATERAL(SELECT MIN(happened_at::date) m1, MAX(happened_at::date) m2 
         FROM score_history sc
         WHERE sc.location_id = sub.location_id) lm
WHERE sub.date BETWEEN lm.m1 AND lm.m2
ORDER BY location_id, date;

db<>fiddle demo

How it works:

1) s (it is cross join of all dates per location_id)

2) l (selecting score per location)

3) lm (selecting min/max date per location for filtering)

4) WHERE filter dates on range that is available, it could be relaxed if needed