江 Blog

Jiang in Thinking

CTE & Window Function: MySQL 8.0的新特性

2026-04-26


MySQL 8.0补齐了不少现代SQL能力, 其中最常用的两个是:

以前在MySQL 5.7里写Top N, 分组排名, 中位数这类题目, 经常需要用户变量或多层子查询. 到了MySQL 8.0, 写法终于接近PostgreSQL这类数据库.

这篇用LeetCode的Median Employee Salary作为例子.

测试数据

CREATE TABLE IF NOT EXISTS Employee (
    id INT PRIMARY KEY,
    company VARCHAR(255),
    salary INT
);

TRUNCATE TABLE Employee;

INSERT INTO Employee (id, company, salary) VALUES (1, 'A', 2341);
INSERT INTO Employee (id, company, salary) VALUES (2, 'A', 341);
INSERT INTO Employee (id, company, salary) VALUES (3, 'A', 15);
INSERT INTO Employee (id, company, salary) VALUES (4, 'A', 15314);
INSERT INTO Employee (id, company, salary) VALUES (5, 'A', 451);
INSERT INTO Employee (id, company, salary) VALUES (6, 'A', 513);
INSERT INTO Employee (id, company, salary) VALUES (7, 'B', 15);
INSERT INTO Employee (id, company, salary) VALUES (8, 'B', 13);
INSERT INTO Employee (id, company, salary) VALUES (9, 'B', 1154);
INSERT INTO Employee (id, company, salary) VALUES (10, 'B', 1345);
INSERT INTO Employee (id, company, salary) VALUES (11, 'B', 1221);
INSERT INTO Employee (id, company, salary) VALUES (12, 'B', 234);
INSERT INTO Employee (id, company, salary) VALUES (13, 'C', 2345);
INSERT INTO Employee (id, company, salary) VALUES (14, 'C', 2645);
INSERT INTO Employee (id, company, salary) VALUES (15, 'C', 2645);
INSERT INTO Employee (id, company, salary) VALUES (16, 'C', 2652);
INSERT INTO Employee (id, company, salary) VALUES (17, 'C', 65);

CTE是什么

CTE可以理解为"只在当前SQL语句里可见的临时命名结果集".

WITH ranked_employee AS (
    SELECT
        id,
        company,
        salary,
        ROW_NUMBER() OVER (PARTITION BY company ORDER BY salary, id) AS rn
    FROM Employee
)
SELECT *
FROM ranked_employee
WHERE company = 'A';

CTE不一定等价于临时表, 优化器可能会选择合并、物化或重写. 日常写SQL时, 它最大的价值是把复杂查询拆成几个可读的步骤.

递归CTE也很有用, 例如生成序列、处理树形结构. 本文先只讨论非递归CTE.

Window Function是什么

窗口函数和GROUP BY的差异在于:

例如按公司排序员工工资:

SELECT
    id,
    company,
    salary,
    ROW_NUMBER() OVER (PARTITION BY company ORDER BY salary, id) AS rn
FROM Employee
ORDER BY company, rn;

PARTITION BY company表示每个公司单独计算, ORDER BY salary, id表示公司内部先按工资排序, 工资相同再按id排序.

这里的id很重要. 如果只写ORDER BY salary, C公司有两个人工资都是2645, 它们的先后顺序在SQL语义上是不稳定的.

三个排名函数

排名函数里最常用的是ROW_NUMBER(), RANK(), DENSE_RANK().

用C公司数据看最清楚:

SELECT
    id,
    company,
    salary,
    ROW_NUMBER() OVER (PARTITION BY company ORDER BY salary, id) AS row_number_,
    RANK() OVER (PARTITION BY company ORDER BY salary) AS rank_,
    DENSE_RANK() OVER (PARTITION BY company ORDER BY salary) AS dense_rank_
FROM Employee
WHERE company = 'C'
ORDER BY salary, id;

结果:

+----+---------+--------+-------------+-------+-------------+
| id | company | salary | row_number_ | rank_ | dense_rank_ |
+----+---------+--------+-------------+-------+-------------+
| 17 | C       |     65 |           1 |     1 |           1 |
| 13 | C       |   2345 |           2 |     2 |           2 |
| 14 | C       |   2645 |           3 |     3 |           3 |
| 15 | C       |   2645 |           4 |     3 |           3 |
| 16 | C       |   2652 |           5 |     5 |           4 |
+----+---------+--------+-------------+-------+-------------+

区别:

如果要"每组取前3行", 用ROW_NUMBER(). 如果要"工资前三名, 并列也算", 用RANK()DENSE_RANK(), 取决于是否允许跳号.

中位数这题要取排序后第N/2附近的行, 所以用ROW_NUMBER()最直接.

解法

每个公司内部按工资升序排序, 同时算出公司总人数:

WITH ordered_employee AS (
    SELECT
        id,
        company,
        salary,
        ROW_NUMBER() OVER (PARTITION BY company ORDER BY salary, id) AS rn,
        COUNT(*) OVER (PARTITION BY company) AS cnt
    FROM Employee
)
SELECT
    id,
    company,
    salary
FROM ordered_employee
WHERE rn >= cnt / 2
  AND rn <= cnt / 2 + 1
ORDER BY company, salary, id;

为什么这个条件能处理奇偶?

输出:

+----+---------+--------+
| id | company | salary |
+----+---------+--------+
|  5 | A       |    451 |
|  6 | A       |    513 |
| 12 | B       |    234 |
|  9 | B       |   1154 |
| 14 | C       |   2645 |
+----+---------+--------+

注意C公司有两条2645. 因为SQL里写了ORDER BY salary, id, 所以中位数员工稳定为id = 14. 如果只按salary排序, MySQL, PostgreSQL, TiDB都不承诺两条2645的内部顺序, 结果可能变成id = 15.

另一种写法

也可以把中位数位置直接算出来:

WITH ordered_employee AS (
    SELECT
        id,
        company,
        salary,
        ROW_NUMBER() OVER (PARTITION BY company ORDER BY salary, id) AS rn,
        COUNT(*) OVER (PARTITION BY company) AS cnt
    FROM Employee
),
median_pos AS (
    SELECT
        id,
        company,
        salary,
        rn,
        FLOOR((cnt + 1) / 2) AS left_pos,
        FLOOR((cnt + 2) / 2) AS right_pos
    FROM ordered_employee
)
SELECT
    id,
    company,
    salary
FROM median_pos
WHERE rn IN (left_pos, right_pos)
ORDER BY company, salary, id;

这种写法更直观:

MySQL 8.0, PostgreSQL, TiDB的差异

这道题的核心SQL在三者上都能跑:

WITH ordered_employee AS (
    SELECT
        id,
        company,
        salary,
        ROW_NUMBER() OVER (PARTITION BY company ORDER BY salary, id) AS rn,
        COUNT(*) OVER (PARTITION BY company) AS cnt
    FROM Employee
)
SELECT id, company, salary
FROM ordered_employee
WHERE rn >= cnt / 2
  AND rn <= cnt / 2 + 1
ORDER BY company, salary, id;

但有几个细节值得注意.

1. MySQL 5.7不能这样写

CTE和窗口函数都是MySQL 8.0时代才正式进入MySQL主线的常用能力. MySQL 5.7没有WITH, 也没有ROW_NUMBER() OVER (...).

在MySQL 5.7里, 通常要用用户变量模拟排名:

SELECT
    id,
    company,
    salary
FROM (
    SELECT
        e.*,
        @rn := IF(@company = company, @rn + 1, 1) AS rn,
        @company := company
    FROM Employee e
    CROSS JOIN (SELECT @rn := 0, @company := '') vars
    ORDER BY company, salary, id
) ranked;

这类写法依赖MySQL用户变量求值顺序, 可读性和可移植性都差很多.

2. 窗口函数不能直接写在WHERE里

下面这个SQL不应该写:

SELECT
    id,
    company,
    salary
FROM Employee
WHERE ROW_NUMBER() OVER (PARTITION BY company ORDER BY salary, id) = 1;

原因是SQL的逻辑执行顺序里, WHERE早于窗口函数计算. MySQL官方文档也明确说窗口函数只能出现在select list和ORDER BY里.

所以通常做法是:

WITH ranked AS (
    SELECT
        id,
        company,
        salary,
        ROW_NUMBER() OVER (PARTITION BY company ORDER BY salary, id) AS rn
    FROM Employee
)
SELECT *
FROM ranked
WHERE rn = 1;

PostgreSQL和TiDB也是同样的思路. 如果用的是支持QUALIFY的数据库, 可以少一层CTE, 但MySQL 8.0和PostgreSQL原生语法里没有QUALIFY.

3. CTE物化行为不同

CTE主要是可读性工具, 不要天然假设它一定更快.

真实业务里, 复杂SQL写完后还是应该看EXPLAIN.

4. 返回类型可能不一样

PostgreSQL文档里row_number(), rank(), dense_rank()返回bigint.

MySQL和TiDB里一般也会把窗口函数结果当作整数列来处理, 但客户端驱动展示的类型细节可能不同. 如果应用层强依赖字段类型, 最好显式转换:

CAST(ROW_NUMBER() OVER (PARTITION BY company ORDER BY salary, id) AS SIGNED) AS rn

PostgreSQL里对应写法是:

CAST(ROW_NUMBER() OVER (PARTITION BY company ORDER BY salary, id) AS INTEGER) AS rn

5. 排序必须写完整

这是跨数据库最容易忽略的问题.

ROW_NUMBER()遇到并列值时, 如果ORDER BY不能唯一决定顺序, 数据库可以用任意顺序返回这些peer rows. 同一条SQL在不同数据库, 不同执行计划, 甚至同一数据库不同版本下, 都可能返回不同的ROW_NUMBER().

所以中位数员工这题应该写:

ORDER BY salary, id

而不是:

ORDER BY salary

这不是为了好看, 是为了结果稳定.

6. 这道题的输出没有本质差异

如果使用本文这条带ORDER BY salary, id的SQL, MySQL 8.0, PostgreSQL, TiDB的结果行应该一致:

+----+---------+--------+
| id | company | salary |
+----+---------+--------+
|  5 | A       |    451 |
|  6 | A       |    513 |
| 12 | B       |    234 |
|  9 | B       |   1154 |
| 14 | C       |   2645 |
+----+---------+--------+

差异主要在客户端展示格式和执行计划上. MySQL/TiDB命令行默认是上面这种表格样式, PostgreSQL的psql默认也是表格, 但边框样式不同. 真正影响结果的是排序条件是否完整, 而不是这三个数据库对ROW_NUMBER()语义的差异.

小结

CTE和Window Function让MySQL 8.0的SQL表达能力提升很多. 对应用开发来说, 最常见的收益是:

排名函数的选择可以简单记:

这道中位数题的关键是ROW_NUMBER()COUNT(*) OVER, 再用CTE在外层过滤目标行.

参考资料