1. 需求
以电商数据为基础,结合hive数仓,实现大数据分析。
数据源可通过日志取得,数据清洗转换导入数据仓库,通过数仓中数据分析得到数据总结,用于企业决策。本项目基于以下表类进行电商数仓分析,分用户信息、用户订单日志、商品信息、商品种类
用户信息
1,jake,男,15390809998,24
2,tom,男,15279975648,22
3,rose,女,14590809887,18
4,mike,男,18978872134,24
5,lili,女,17568949931,21
6,john,男,19198578874,22
注意:
第一项:用户编号
第二项:名字
第三项:性别
第四项:手机号码
第五项:年龄
用户订单日志
2539329,1,1,2020-12-11,36.7.255.255,0,10001#10002#10003
2539330,1,2,2020-12-11,36.7.255.255,1,20001#10002
2539331,1,3,2020-12-21,36.7.255.255,0,10001#30001#20001#10003
2539332,2,1,2021-01-01,183.217.24.116,0,20003#30001
2539333,2,2,2021-01-06,183.217.24.116,0,10001#20001#20004
2539334,2,3,2021-01-10,183.217.24.116,1,30001#30002#20003#10002
2539335,2,4,2021-01-22,183.217.24.116,0,20004#10003#30001
2539336,3,1,2020-12-19,43.255.18.67,1,10002#20001
2539337,3,2,2021-01-22,43.255.18.67,0,10001#10003#30001
2539338,3,3,2021-01-27,43.255.18.67,0,30001#20002#10003
2539339,3,4,2021-02-05,43.255.18.67,0,20001#20002#20003#30001#30002
2539340,4,1,2020-11-28,110.76.159.255,0,10001#10002#20001#20002
2539341,4,2,2021-01-01,110.76.159.255,1,20001#20002
2539342,5,1,2021-02-11,122.200.47.255,0,10001#20001#30002#10003
2539343,5,2,2021-03-13,122.200.47.255,0,20001#20002#30001
2539344,5,3,2021-03-22,122.200.47.255,0,30001#30003#10002
2539345,5,4,2021-03-25,122.200.47.255,1,10001#30001
2539346,6,1,2020-12-30,153.119.255.255,0,20001#30003
注意:
第一项:日志编号
第二项:用户编号
第三项:用户下订单的顺序
第四项:订单生成的时间
第五项:订单生成的用户ip
第六项:订单状态,0表示完成,1表示取消
第七项:当前订单购买的商品编号
商品信息
10001,苹果,6,1
10002,香蕉,3,1
10003,雪梨,5,1
20001,白菜,2,2
20002,青菜,1.5,2
20003,萝卜,3,2
30001,牙刷,8,3
30002,牙膏,15,3
30003,毛巾,18,3
注意:
第一项:商品编号
第二项:商品的名称
第三项:商品的价格
第四项:商品的类型
商品的种类
1,水果
2,蔬菜
3,洗漱用品
注意:
第一项:商品的种类
第二项:分类
2. 完成数据仓库的构建及数据导入表
1.创建用户信息表:
create table user_information(
id int,
name string,
sex string,
phoneNumber string,
age int
)
comment '用户信息表'
row format delimited fields terminated by ','
stored as textfile;
将用户信息表导入到表中:
load data local inpath '/home/hdfs/hive_exam/user_information.txt' into table user_infoarmation;
2.用户订单日志
create table user_order_logs(
logs_id int,
user_id int,
purchase_order int,
order_Date date,
ipAddress string,
state int,
goods_id string)
comment '用户订单日志表'
row format delimited fields terminated by ','
stored as textfile;
用户订单日志数据导入:
Load data local inpath ‘/home/hdfs/hive_exam/user_order_logs.txt’into table user_order_logs
3.创建商品信息表
create table goods(
id int,
name string,
price double,
type int)
comment '商品信息表'
row format delimited fields terminated by ','
stored as textfile;
导入数据到商品信息表:
Load data local inpath ‘/home/hdfs/hive_exam/goods.txt’into table goods
4.创建商品种类表
create table goods_items(
type int,
classification string)
comment '商品种类表'
row format delimited fields terminated by ','
stored as textfile;
导入商品种类表数据:
Load data local inpath ‘/home/hdfs/hive_exam/goods_items.txt’ into table goods_items;
3. 基于数据仓库完成下列需求
- 统计每天完成订单总数是多少,给出排名
select count(*) as sum,order_Date
from user_order_logs
where state = 0
group by order_Date
order by sum desc;
- 统计每个用户平均完成多少订单
思路:订单总数/下单用户数
select count(*)/count(distinct user_id)
from user_order_logs
where state=0;
- 统计出每个地区完成订单的总数,给出排名
select count(*) as sum,ipaddress
from user_order_logs
where state=0
group by ipAddress
order by sum;
- 统计出取消订单占总订单的百分比
思路:取消订单数/总订单数
select a.sum/b.sum
from
(
select count(*) as sum
from user_order_logs
where state=1
) as a,(
select count(*) as sum
from user_order_logs
) as b;
- 统计每个用户平均一个订单购买商品总数是多少
每个用户订单总数:
select user_id,count(user_id) as sum
from user_order_logs
where state=0
group by user_id;
用户购买商品总数:
select user_id,sum(size(split(goods_id,'#'))) as sum
from user_order_logs
where state=0
group by user_id;
合并:
select a.user_id,b.sum/a.sum as aver
from (
select user_id,count(user_id) as sum
from user_order_logs
where state=0
group by user_id) as a,(select user_id,sum(size(split(goods_id,'#'))) as sum
from user_order_logs
where state=0
group by user_id) as b
where a.user_id=b.user_id
- 统计每个年龄段订单总数及每个年龄段人平均消费多少
1.每个年龄段订单总数(tab1)
select a.age,count(*) sum
from user_information as a,user_order_logs as b
where a.id=b.user_id
group by a.age
结果:
18 4
21 4
22 5
24 5
年龄段人平均消费多少:该年龄段的消费总额/该年龄段人数
获取年龄和商品列表:
select a.age,b.goods_id
from user_information as a,user_order_logs as b
where a.id=b.user_id and state=0
将商品列表拆分:(得到每个年龄段购买的商品)
select c.age,col
from (
select a.age,b.goods_id
from user_information as a,user_order_logs as b
where a.id=b.user_id and state=0
) as c
lateral view explode(split(c.goods_id,'#')) t as col;
结果:
24 10001
24 10002
24 10003
24 10001
24 30001
每个年龄段消费的总额:(tab2)
select a.age,sum(b.price) as money
from (
select c.age as age,col
from (
select a.age,b.goods_id
from user_information as a,user_order_logs as b
where a.id=b.user_id and state=0
) as c
lateral view explode(split(c.goods_id,'#')) t as col
)as a,goods b
where a.col=b.id
group by a.age
结果:
18 63.0
21 68.5
22 52.0
24 47.5
获取每个年龄段的总人数:(tab3)
select age,count(*) s
from user_information
group by age;
结果:
18 1
21 1
22 2
24 2
统计每个年龄段订单总数及每个年龄段人平均消费多少:
select x.age,x.cou,y.money/z.s avg_money
from
(select a.age,count(*) cou
from user_information as a,user_order_logs as b
where a.id=b.user_id
group by a.age) as x
,
(select a.age,sum(b.price) as money
from (
select c.age as age,col
from (
select a.age age,b.goods_id goods_id
from user_information as a,user_order_logs as b
where a.id=b.user_id and state=0
) as c
lateral view explode(split(c.goods_id,'#')) t as col
)as a,goods b
where a.col=b.id
group by a.age) as y
,
(select age,count(*) s
from user_information
group by age) as z
where x.age=y.age and y.age=z.age
- 统计订单完成男女比例及男、女消费平均值
订单完成男:(tab1)
select count(*) as sum_man
from user_information a,user_order_logs b
where a.id=b.user_id and a.sex='男' and state=0
订单完成女:(tab2)
select count(*) as sum_weman
from user_information a,user_order_logs b
where a.id=b.user_id and a.sex='女' and state=0
男女比例:tab1/tab2
将商品列表拆分:(得到男女购买的商品)
男性购买的商品:
select c.sex,col
from (
select a.sex,b.goods_id
from user_information as a,user_order_logs as b
where a.id=b.user_id and state=0 and a.sex='男'
) as c
lateral view explode(split(c.goods_id,'#')) t as col;
女性购买的商品:
select c.sex,col
from (
select a.sex,b.goods_id
from user_information as a,user_order_logs as b
where a.id=b.user_id and state=0 and a.sex='女'
) as c
lateral view explode(split(c.goods_id,'#')) t as col;
男女购买商品的总价格:
男性购买商品的总价格:(tab3)
select a.sex,sum(b.price) as man_money
from
(select c.sex sex,col
from (
select a.sex sex,b.goods_id goods_id
from user_information as a,user_order_logs as b
where a.id=b.user_id and state=0 and a.sex='男'
) as c
lateral view explode(split(c.goods_id,'#')) t as col) as a,
goods as b
where a.col=b.id
group by a.sex;
女性购买商品的总价格:(tab4)
select a.sex,sum(b.price) as man_money
from
(select c.sex sex,col
from (
select a.sex sex,b.goods_id goods_id
from user_information as a,user_order_logs as b
where a.id=b.user_id and state=0 and a.sex='女'
) as c
lateral view explode(split(c.goods_id,'#')) t as col) as a,
goods as b
where a.col=b.id
group by a.sex;
统计订单完成男女比例及男、女消费平均值:
select tab1.sum_man/tab2.sum_weman,tab3.man_money/tab4.weman_money
from
(select count(*) as sum_man
from user_information a,user_order_logs b
where a.id=b.user_id and a.sex='男' and state=0
) as tab1,
(select count(*) as sum_weman
from user_information a,user_order_logs b
where a.id=b.user_id and a.sex='女' and state=0
) as tab2,
(select a.sex,sum(b.price) as man_money
from
(select c.sex sex,col
from (
select a.sex sex,b.goods_id goods_id
from user_information as a,user_order_logs as b
where a.id=b.user_id and state=0 and a.sex='男'
) as c
lateral view explode(split(c.goods_id,'#')) t as col) as a,
goods as b
where a.col=b.id
group by a.sex
) as tab3,
(select a.sex,sum(b.price) as weman_money
from
(select c.sex sex,col
from (
select a.sex sex,b.goods_id goods_id
from user_information as a,user_order_logs as b
where a.id=b.user_id and state=0 and a.sex='女'
) as c
lateral view explode(split(c.goods_id,'#')) t as col) as a,
goods as b
where a.col=b.id
group by a.sex
) as tab4
- 找出订单中商品购买排名
1.将用户订单表中的所有商品拆分出来(tab1)
select logs_id,col
from user_order_logs
lateral view explode(split(goods_id,'#')) t as col
where state=0
2.tab1与商品表(goods)基于商品id连接,基于商品id分组,排序,得到最后结果;
找出订单中商品购买排名:
select b.id,count(b.id) count
from
(select logs_id,col
from user_order_logs
lateral view explode(split(goods_id,'#')) t as col
where state=0
) as a,goods as b
where a.col=b.id
group by b.id
order by count;
- 统计出每周的销售额
1.将用户订单表中的所有商品拆分出来(tab1)
select order_date,col
from user_order_logs
lateral view explode(split(goods_id,'#')) t as col
where state = 0;
2.将tab1与商品表(goods)基于商品id连接,年份和每年的第几周表示每周
统计出每周的销售额:(年份和每年的第几周组合表示每周)
select year(a.order_date) year,weekofyear(a.order_date) week,sum(b.price) sum
from
(select order_date,col
from user_order_logs
lateral view explode(split(goods_id,'#')) t as col
where state = 0
) as a,goods b
where a.col=b.id
group by year(a.order_date),weekofyear(a.order_date)
order by sum;
- 统计出每个地区商品种类销售排行
1.将用户订单表中的所有商品拆分出来(tab1)
select ipaddress,col
from user_order_logs
lateral view explode(split(goods_id,'#')) t as col
where state = 0;
结果:
36.7.255.255 10001
36.7.255.255 10002
36.7.255.255 10003
36.7.255.255 10001
183.217.24.116 20003
183.217.24.116 30001
183.217.24.116 10001
2.将tab1与商品表(goods)连接,查询地区,商品类型,总价格(tab2)
select ipaddress,b.type,sum(price)
from
(select ipaddress,col
from user_order_logs
lateral view explode(split(goods_id,'#')) t as col
where state = 0) as a,
goods as b
where a.col=b.id
group by ipaddress,type
结果:
110.76.159.255 1 9.0
110.76.159.255 2 3.5
122.200.47.255 1 14.0
122.200.47.255 2 5.5
122.200.47.255 3 49.0
3.将tab2与商品种类表连接,得到最终结果
统计出每个地区商品种类销售排行:
select ipaddress,b.classification,sum_money
from
(select ipaddress,b.type type,sum(price) sum_money
from
(select ipaddress,col
from user_order_logs
lateral view explode(split(goods_id,'#')) t as col
where state = 0) as a,
goods as b
where a.col=b.id
group by ipaddress,type
) as a,
goods_items as b
where a.type=b.type
order by sum_money desc;
4. 可扩展
可以将需求中的ip地址转为具体地区,思路是使用MapReduce程序对用户订单日志的数据进行处理,然后将处理好后的数据导入到表中。