sql-server – 递归地对类似的项进行分组

我一直在阅读使用CTE的递归查询的以下
Microsoft article,似乎无法解决如何将它用于组常见项目.

我有一个包含以下列的表:

> ID
> FirstName
>姓氏
> DateOfBirth
> BirthCountry
> GroupID

我需要做的是从表格中的第一个人开始,遍历表格,找到所有具有相同(LastName和BirthCountry)或具有相同(DateOfBirth和BirthCountry)的人.

现在棘手的部分是我必须为它们分配相同的GroupID,然后对于该GroupID中的每个人,我需要查看是否有其他人拥有相同的信息然后将它们放在相同的GroupID中.

我想我可以用多个游标做到这一点,但它变得棘手.

这是样本数据和输出.

ID          FirstName  LastName   DateOfBirth BirthCountry GroupID
----------- ---------- ---------- ----------- ------------ -----------
1           Jonh       Doe        1983-01-01  Grand        100
2           Jack       Stone      1976-06-08  Grand        100
3           Jane       Doe        1982-02-08  Grand        100
4           Adam       Wayne      1983-01-01  Grand        100
5           Kay        Wayne      1976-06-08  Grand        100
6           Matt       Knox       1983-01-01  Hay          101

> John Doe和Jane Doe属于同一组(100),因为它们具有相同的(LastName和BirthCountry).
> Adam Wayne在Group(100),因为他和John Doe一样(BirthDate和BirthCountry).
> Kay Wayne在Group(100),因为她和Adam Wayne一样(LastName和BirthCountry),他已经在Group(100).
> Matt Knox是一个新组织(101),因为他与以前组中的任何人都不匹配.
> Jack Stone是一个团体(100),因为他和Kay Wayne一样(BirthDate和BirthCountry)已经在Group(100).

数据脚本:

CREATE TABLE #Tbl(
    ID              INT,
    FirstName       VARCHAR(50),
    LastName        VARCHAR(50),
    DateOfBirth     DATE,
    BirthCountry    VARCHAR(50),
    GroupID         INT NULL
);

INSERT INTO #Tbl VALUES
(1, 'Jonh', 'Doe',      '1983-01-01',   'Grand',    NULL),
(2, 'Jack', 'Stone',    '1976-06-08',   'Grand',    NULL),
(3, 'Jane', 'Doe',      '1982-02-08',   'Grand',    NULL),
(4, 'Adam', 'Wayne',    '1983-01-01',   'Grand',    NULL),
(5, 'Kay',  'Wayne',    '1976-06-08',   'Grand',    NULL),
(6, 'Matt', 'Knox',     '1983-01-01',   'Hay',      NULL);

最佳答案 这就是我想出来的.我很少编写递归查询,所以这对我来说是一个很好的做法.顺便说一句,Kay和Adam不会在您的样本数据中共享出生国家.

with data as (
    select
        LastName, DateOfBirth, BirthCountry,
        row_number() over (order by LastName, DateOfBirth, BirthCountry) as grpNum
    from T group by LastName, DateOfBirth, BirthCountry
), r as (
    select
        d.LastName, d.DateOfBirth, d.BirthCountry, d.grpNum,
        cast('|'  + cast(d.grpNum as varchar(8)) + '|' as varchar(1024)) as equ
    from data as d
    union all
    select
        d.LastName, d.DateOfBirth, d.BirthCountry, r.grpNum,
        cast(r.equ + cast(d.grpNum as varchar(8)) + '|' as varchar(1024))
    from r inner join data as d
            on      d.grpNum > r.grpNum
               and charindex('|' + cast(d.grpNum as varchar(8)) + '|', r.equ) = 0
               and (d.LastName = r.LastName or d.DateOfBirth = r.DateOfBirth)
               and  d.BirthCountry = r.BirthCountry
), g as (
    select LastName, DateOfBirth, BirthCountry, min(grpNum) as grpNum
    from r group by LastName, DateOfBirth, BirthCountry
)
select t.*, dense_rank() over (order by g.grpNum) + 100 as GroupID
from T as t 
    inner join g
        on      g.LastName = t.LastName
            and g.DateOfBirth = t.DateOfBirth
            and g.BirthCountry = t.BirthCountry

对于递归终止,必须跟踪等价(通过字符串连接),以便在每个级别只需要考虑新发现的等价(或连接,转换等).请注意,我避免使用单词组避免流入GROUP BY概念.

http://rextester.com/edit/TVRVZ10193

编辑:我使用几乎任意数字的等价,但如果你希望它们出现在基于最低ID的序列中,每个块都很容易.当然,不是使用row_number()说min(ID)作为grpNum,假设ID是唯一的.

点赞