##Hive数据去重

2023年3月4日 302次阅读来源: 葡萄喃喃呓语

Hive数据去重 – 菠萝大数据梦工厂（Free World） – 博客频道 – CSDN.NET http://blog.csdn.net/jiangshouzhuang/article/details/49401469

insert overwrite table ta_customers
select t.ta_id,t.ta_date from
( select ta_id,
ta_date ,
row_number() over(distribute by ta_id sort by ta_date desc) as rn
from ta_customers) t
where t.rn=1;

说明：

ta_id 为去重所依据的key， ta_date 表示多个 ta_id 的排列顺序，这个关键字将决定哪个 ta_id 将留下
t.rn=1表示重复的数据只保留第一个，本例中将保留最新日期的的 ta_id
distribute by 关键字指定分发的key，同一个key将分发到同一个reducer
sort by 是单机范围内排序，因此配合distribute by 就可以对某一个关键字全局排序

    原文作者：葡萄喃喃呓语
    原文地址: https://www.jianshu.com/p/abaf6f54a1fc
    本文转自网络文章，转载此文章仅为分享知识，如有侵权，请联系博主进行删除。