sql – 子查询与MAX聚合函数的性能

MAX聚合与子查询:

这似乎是我最近编写的查询中一个反复出现的问题,我想发现哪种查询样式是:

>效率最高(时间和资源)
>更可靠,更易于维护
>使用最有意义

更多信息:

我写的查询往往总是从一个基表拉出来,也会加入到其他几个表中;但是,连接的表通常具有垂直方向,其中外键被多次引用,具有唯一的“描述符”和“响应”. (参见:表#MovieDescriptions的示例.)

请使用以下SQL查询作为测试方案:

 -- Drop temp tables if exist

IF OBJECT_ID('TempDB..#Movies','U') IS NOT NULL
     DROP TABLE #Movies

IF OBJECT_ID('TempDB..#MovieDescriptions','U') IS NOT NULL
     DROP TABLE #MovieDescriptions

-- Creating temp tables

CREATE TABLE #Movies
(
     MovieID int IDENTITY(1,1),
     MovieName varchar (100),
     ReleaseYear datetime,
     Director varchar (100)
)

CREATE TABLE #MovieDescriptions
(
     MovieDescID int IDENTITY(1,1),
     FK_MovieID varchar(100),
     DescriptionType varchar(100),
     DescriptionResponse varchar(100)
)

-- Inserting test data

INSERT INTO #Movies (MovieName, ReleaseYear, Director) VALUES ('Gone With the Wind', CONVERT(datetime,'12/15/1939'), 'Victor Fleming')
INSERT INTO #Movies (MovieName, ReleaseYear, Director) VALUES ('2001: A Space Odyssey', CONVERT(datetime,'01/01/1968'), 'Stanley Kubrick')


INSERT INTO #MovieDescriptions (FK_MovieID, DescriptionType, DescriptionResponse) VALUES ('1', 'Written By', 'Sideny Howard')
INSERT INTO #MovieDescriptions (FK_MovieID, DescriptionType, DescriptionResponse) VALUES ('1', 'Genre', 'Drama')
INSERT INTO #MovieDescriptions (FK_MovieID, DescriptionType, DescriptionResponse) VALUES ('1', 'Rating', 'G')

INSERT INTO #MovieDescriptions (FK_MovieID, DescriptionType, DescriptionResponse) VALUES ('2', 'Written By', 'Standley Kubrick')
INSERT INTO #MovieDescriptions (FK_MovieID, DescriptionType, DescriptionResponse) VALUES ('2', 'Genre', 'Sci-Fi')
INSERT INTO #MovieDescriptions (FK_MovieID, DescriptionType, DescriptionResponse) VALUES ('2', 'Rating', 'G')

-- Using subqueries

SELECT
     MovieName,
     ReleaseYear,
     (SELECT DescriptionResponse
      FROM #MovieDescriptions
      WHERE FK_MovieID = #Movies.MovieID AND DescriptionType = 'Genre'
      ) AS Genre,
     (SELECT DescriptionResponse
      FROM #MovieDescriptions
      WHERE FK_MovieID = #Movies.MovieID AND DescriptionType = 'Rating'
      ) AS Rating
FROM #Movies

-- Using aggregate functions

SELECT
     MovieName,
     ReleaseYear,
     MAX(CASE WHEN md.DescriptionType = 'Genre' THEN DescriptionResponse END) AS Genre,
     MAX(CASE WHEN md.DescriptionType = 'Rating' THEN DescriptionResponse END) AS Rating
FROM #Movies m
     INNER JOIN #MovieDescriptions md
     ON m.MovieID = md.FK_MovieID
GROUP BY MovieName, ReleaseYear

此外,如果有更好的方法来选择这些数据,那也会有所帮助.

最佳答案 假设一个更正常的设置,你的表被正确索引,并且外键关系列具有匹配的数据类型(提示提示:它们当前不匹配,int与varchar),那么你应该总是找到你的第二个查询(连接聚合)优于第一个(select子句中的子查询).少量数据可能不会引起注意.但是你的基表有越多的数据(#Movies),差异就越明显.

原因很简单.在您的第一个查询中:

SELECT
     MovieName,
     ReleaseYear,
     (SELECT DescriptionResponse
      FROM #MovieDescriptions
      WHERE FK_MovieID = #Movies.MovieID AND DescriptionType = 'Genre'
      ) AS Genre,
     (SELECT DescriptionResponse
      FROM #MovieDescriptions
      WHERE FK_MovieID = #Movies.MovieID AND DescriptionType = 'Rating'
      ) AS Rating
FROM #Movies

如果#Movies包含1000行,那么SQL Server别无选择,只能在#Movies上执行一次全表扫描,并且对于1000行中的每一行,它需要在#MovieDescriptions上执行2次额外查询.实际上,您正在执行总共2001次查询.因为您的子查询位于SELECT子句中,所以SQL Server别无选择,只能以这种方式执行查询.

另一方面,您的第二个查询:

SELECT
     MovieName,
     ReleaseYear,
     MAX(CASE WHEN md.DescriptionType = 'Genre' THEN DescriptionResponse END) AS Genre,
     MAX(CASE WHEN md.DescriptionType = 'Rating' THEN DescriptionResponse END) AS Rating
FROM #Movies m
     INNER JOIN #MovieDescriptions md
     ON m.MovieID = md.FK_MovieID
GROUP BY MovieName, ReleaseYear

因为您在此处使用连接,所以这使SQL Server能够灵活地找出加入#Movies和#MovieDescriptions数据的最有效方法.根据您的索引,过滤器,行数等,它可能决定进行散列连接,也许它将使用嵌套循环等.重点是SQL Server有更多选项,现在可以找出最佳方法减少2个表(和索引)中数据块读取的数量.

编辑:我还应该补充说,上面假设您正在获取查询返回的每一行.如果查询返回数千行,但您只获取前10行,那么在某些情况下,第一个查询实际上可能优于第二个查询.这是因为子查询只会在选择或获取时在行上执行.如果您从未获取某些行,则可能永远不会产生在这些未获取的行上执行子查询的成本.需要考虑的事情.

点赞