Warm tip: This article is reproduced from serverfault.com, please click

Why "LAST_VALUE(birthdate)" window function clause necessary compare to "FIRST_VALUE(birthdate)"?

发布于 2020-11-28 00:41:06
SELECT 
    first_name + ' ' + last_name AS name,
    country,
    birthdate,
    -- Retrieve the birthdate of the oldest voter per country
    FIRST_VALUE(birthdate) 
    OVER (PARTITION BY country ORDER BY birthdate) AS oldest_voter,
    -- Retrieve the birthdate of the youngest voter per country
    LAST_VALUE(birthdate) 
        OVER (PARTITION BY country ORDER BY birthdate ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS youngest_voter
FROM voters
WHERE country IN ('Spain', 'USA');

The above query results in the following data:

name                country birthdate   oldest_vote youngest_voter
Caroline Griffin    Spain   1981-03-20  1981-03-20  1988-03-21
Christopher Jackson Spain   1981-04-15  1981-03-20  1988-03-21
Raul Raji           Spain   1981-04-25  1981-03-20  1988-03-21
Karen Cai           Spain   1981-05-03  1981-03-20  1988-03-21

If we remove the window function clause (ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) of the "LAST_VALUE(birthdate)" the result changes as below:

SELECT 
    first_name + ' ' + last_name AS name,
    country,
    birthdate,
    -- Retrieve the birthdate of the oldest voter per country
    FIRST_VALUE(birthdate) 
    OVER (PARTITION BY country ORDER BY birthdate) AS oldest_voter,
    -- Retrieve the birthdate of the youngest voter per country
    LAST_VALUE(birthdate) 
        OVER (PARTITION BY country ORDER BY birthdate) AS youngest_voter
FROM voters
WHERE country IN ('Spain', 'USA');
name                country birthdate   oldest_voter youngest_voter
Caroline Griffin    Spain   1981-03-20  1981-03-20   1981-03-20
Christopher Jackson Spain   1981-04-15  1981-03-20   1981-04-15
Raul Raji           Spain   1981-04-25  1981-03-20   1981-04-25
Karen Cai           Spain   1981-05-03  1981-03-20   1981-05-03

The question is

  • FIRST_VALUE(birthdate) is giving the first value of the ordered partition by country which we use in the oldest_voter.
  • Why do we need a window function clause for the LAST_VALUE(birthdate) for the similar result which we need for the youngest_voter?
  • When I remove the clause the youngest_voter results copies the birthdate column and not the LAST_VALUE(birthdate) similar to FIRST_VALUE(birthdate).
Questioner
N.S.
Viewed
0
seanb 2020-11-28 13:51:10

Last_Value (and First_Value) are somewhat strange because they're analytic functions.

Analytic functions deal with windows differently than normal aggregate functions do.

To demonstrate this, I'll take a detour and use a running total using SUM as a first example of the difference between aggregate functions and analytic functions.

  • SUM is normally an aggregate function (e.g., when not using windowed functions)
  • However, it becomes an analytic function when you do include an ORDER BY window function.

Say you have the following table

id   num_items
1    5
2    8
3    3
4    5

If you then ran SELECT SUM(num_items) AS Total FROM mytable the result is 21, as expected. This is the typical 'aggregate' version of the SUM function.

However, it you add ORDER BY to the SUM, it becomes an analytic function.

Running SELECT SUM(Num_items) OVER (ORDER BY id) AS Total FROM mytable; gives you the following - a running total.

Total
5
13
16
21

With analytic functions, window functions operate on the data to the current row only unless specified otherwise with the ROWS BETWEEN clause.

Now, in your example (birthdates) without the ROWS BETWEEN clause, we can run through the processing.

Let's take the first row to start.

  • The analytic functions will operate on that one row only
  • So both the first_value (ordered by birthdate) and last_value (ordered by birthdate) will be this row's value

Let's take the second row

  • The analytic functions will operate on the first two rows
  • The first_value will be from the first row, but the last_value will be from the second row

Only at the last row will the results be as you expect. For first_value, it is typically not a problem (as you have demonstrated) but it is a 'gotcha!' for last_value.

UPDATE: To overcome the issue, instead of specifying the ROWS BETWEEN component, you can sort it the other way and use First_Value e.g.,

  • instead of LAST_VALUE(birthdate) OVER (PARTITION BY country ORDER BY birthdate) AS youngest_voter
  • use FIRST_VALUE(birthdate) OVER (PARTITION BY country ORDER BY birthdate DESC) AS youngest_voter