电商数据分析——基于hive数仓,实现大数据分析

1. 需求

以电商数据为基础,结合hive数仓,实现大数据分析。

数据源可通过日志取得,数据清洗转换导入数据仓库,通过数仓中数据分析得到数据总结,用于企业决策。本项目基于以下表类进行电商数仓分析,分用户信息、用户订单日志、商品信息、商品种类

用户信息

    1,jake,男,15390809998,24
	2,tom,男,15279975648,22
	3,rose,女,14590809887,18
	4,mike,男,18978872134,24
	5,lili,女,17568949931,21
	6,john,男,19198578874,22

注意:

第一项:用户编号

第二项:名字

第三项:性别

第四项:手机号码

第五项:年龄

用户订单日志

2539329,1,1,2020-12-11,36.7.255.255,0,10001#10002#10003
2539330,1,2,2020-12-11,36.7.255.255,1,20001#10002
2539331,1,3,2020-12-21,36.7.255.255,0,10001#30001#20001#10003
2539332,2,1,2021-01-01,183.217.24.116,0,20003#30001
2539333,2,2,2021-01-06,183.217.24.116,0,10001#20001#20004
2539334,2,3,2021-01-10,183.217.24.116,1,30001#30002#20003#10002
2539335,2,4,2021-01-22,183.217.24.116,0,20004#10003#30001
2539336,3,1,2020-12-19,43.255.18.67,1,10002#20001
2539337,3,2,2021-01-22,43.255.18.67,0,10001#10003#30001
2539338,3,3,2021-01-27,43.255.18.67,0,30001#20002#10003
2539339,3,4,2021-02-05,43.255.18.67,0,20001#20002#20003#30001#30002
2539340,4,1,2020-11-28,110.76.159.255,0,10001#10002#20001#20002
2539341,4,2,2021-01-01,110.76.159.255,1,20001#20002
2539342,5,1,2021-02-11,122.200.47.255,0,10001#20001#30002#10003
2539343,5,2,2021-03-13,122.200.47.255,0,20001#20002#30001
2539344,5,3,2021-03-22,122.200.47.255,0,30001#30003#10002
2539345,5,4,2021-03-25,122.200.47.255,1,10001#30001
2539346,6,1,2020-12-30,153.119.255.255,0,20001#30003

注意:

第一项:日志编号

第二项:用户编号

第三项:用户下订单的顺序

第四项:订单生成的时间

第五项:订单生成的用户ip

第六项:订单状态,0表示完成,1表示取消

第七项:当前订单购买的商品编号

商品信息

10001,苹果,6,1
10002,香蕉,3,1
10003,雪梨,5,1
20001,白菜,2,2
20002,青菜,1.5,2
20003,萝卜,3,2
30001,牙刷,8,3
30002,牙膏,15,3
30003,毛巾,18,3

注意:

第一项:商品编号

第二项:商品的名称

第三项:商品的价格

第四项:商品的类型

商品的种类

1,水果
2,蔬菜
3,洗漱用品

注意:

第一项:商品的种类

第二项:分类

2. 完成数据仓库的构建及数据导入表

1.创建用户信息表:

create table user_information(
id int,
name string,
sex string,
phoneNumber string,
age int
)
comment '用户信息表'
row format delimited fields terminated by ','
stored as textfile;

将用户信息表导入到表中:

load data local inpath '/home/hdfs/hive_exam/user_information.txt' into table user_infoarmation;

2.用户订单日志

create table user_order_logs(
logs_id int,
user_id int, 
purchase_order int,
order_Date date,
ipAddress string,
state int,
goods_id string)
comment '用户订单日志表'
row format delimited fields terminated by ','
stored as textfile;

用户订单日志数据导入:

Load data local inpath ‘/home/hdfs/hive_exam/user_order_logs.txt’into table user_order_logs

3.创建商品信息表

create table goods(
id int,
name string,
price double,
type int)
comment '商品信息表'
row format delimited fields terminated by ','
stored as textfile;

导入数据到商品信息表:

Load data local inpath ‘/home/hdfs/hive_exam/goods.txt’into table goods

4.创建商品种类表

create table goods_items(
type int,
classification string)
comment '商品种类表'
row format delimited fields terminated by ','
stored as textfile;

导入商品种类表数据:

Load data local inpath ‘/home/hdfs/hive_exam/goods_items.txt’ into table   goods_items;

3. 基于数据仓库完成下列需求

  1. 统计每天完成订单总数是多少,给出排名
select count(*) as sum,order_Date

from user_order_logs

where state = 0

group by order_Date

order by sum desc;
  1. 统计每个用户平均完成多少订单
思路:订单总数/下单用户数
select count(*)/count(distinct user_id)
from user_order_logs
where state=0;
  1. 统计出每个地区完成订单的总数,给出排名
select count(*) as sum,ipaddress

from user_order_logs

where state=0

group by ipAddress

order by sum;
  1. 统计出取消订单占总订单的百分比
思路:取消订单数/总订单数
select a.sum/b.sum
from
(
    select count(*) as sum
    from user_order_logs
    where state=1
) as a,(
select count(*) as sum
    from user_order_logs
) as b;
  1. 统计每个用户平均一个订单购买商品总数是多少
每个用户订单总数:
select user_id,count(user_id) as sum
from user_order_logs
where state=0
group by user_id;

用户购买商品总数:
select user_id,sum(size(split(goods_id,'#'))) as sum 
from user_order_logs
where state=0
group by user_id;

合并:
select a.user_id,b.sum/a.sum as aver
from (
select user_id,count(user_id) as sum
from user_order_logs
where state=0
group by user_id) as a,(select user_id,sum(size(split(goods_id,'#'))) as sum 
from user_order_logs
where state=0
group by user_id) as b
where a.user_id=b.user_id
  1. 统计每个年龄段订单总数及每个年龄段人平均消费多少
1.每个年龄段订单总数(tab1)
select a.age,count(*) sum
from user_information as a,user_order_logs as b
where a.id=b.user_id
group by a.age
结果:
18      4
21      4
22      5
24      5

年龄段人平均消费多少:该年龄段的消费总额/该年龄段人数

获取年龄和商品列表:
select a.age,b.goods_id
from user_information as a,user_order_logs as b
where a.id=b.user_id and state=0

将商品列表拆分:(得到每个年龄段购买的商品)
select c.age,col
from (
select a.age,b.goods_id
from user_information as a,user_order_logs as b
where a.id=b.user_id and state=0
) as c
lateral view explode(split(c.goods_id,'#')) t as col;
结果:
24      10001
24      10002
24      10003
24      10001
24      30001
每个年龄段消费的总额:(tab2)
select a.age,sum(b.price) as money
from (
select c.age as age,col
from (
select a.age,b.goods_id
from user_information as a,user_order_logs as b
where a.id=b.user_id and state=0
) as c
lateral view explode(split(c.goods_id,'#')) t as col
)as a,goods b
where a.col=b.id
group by a.age
结果:
18      63.0
21      68.5
22      52.0
24      47.5

获取每个年龄段的总人数:(tab3)
select age,count(*) s
from user_information
group by age;
结果:
18      1
21      1
22      2
24      2

统计每个年龄段订单总数及每个年龄段人平均消费多少:
select x.age,x.cou,y.money/z.s avg_money
from 
(select a.age,count(*) cou
from user_information as a,user_order_logs as b
where a.id=b.user_id
group by a.age) as x
,
(select a.age,sum(b.price) as money
from (
select c.age as age,col
from (
select a.age age,b.goods_id goods_id
from user_information as a,user_order_logs as b
where a.id=b.user_id and state=0
) as c
lateral view explode(split(c.goods_id,'#')) t as col
)as a,goods b
where a.col=b.id
group by a.age) as y
,
(select age,count(*) s
from user_information
group by age) as z

where x.age=y.age and y.age=z.age
  1. 统计订单完成男女比例及男、女消费平均值
订单完成男:(tab1)
select count(*) as sum_man
from user_information a,user_order_logs b
where a.id=b.user_id and a.sex='男' and state=0

订单完成女:(tab2)
select count(*) as sum_weman
from user_information a,user_order_logs b
where a.id=b.user_id and a.sex='女' and state=0

男女比例:tab1/tab2

将商品列表拆分:(得到男女购买的商品)
男性购买的商品:
select c.sex,col
from (
select a.sex,b.goods_id
from user_information as a,user_order_logs as b
where a.id=b.user_id and state=0 and a.sex='男'
) as c
lateral view explode(split(c.goods_id,'#')) t as col;

女性购买的商品:
select c.sex,col
from (
select a.sex,b.goods_id
from user_information as a,user_order_logs as b
where a.id=b.user_id and state=0 and a.sex='女'
) as c
lateral view explode(split(c.goods_id,'#')) t as col;

男女购买商品的总价格:
男性购买商品的总价格:(tab3)
select a.sex,sum(b.price) as man_money

from 

(select c.sex sex,col
from (
select a.sex sex,b.goods_id goods_id
from user_information as a,user_order_logs as b
where a.id=b.user_id and state=0 and a.sex='男'
) as c
lateral view explode(split(c.goods_id,'#')) t as col) as a,

goods as b
where a.col=b.id
group by a.sex;

女性购买商品的总价格:(tab4)
select a.sex,sum(b.price) as man_money

from 

(select c.sex sex,col
from (
select a.sex sex,b.goods_id goods_id
from user_information as a,user_order_logs as b
where a.id=b.user_id and state=0 and a.sex='女'
) as c
lateral view explode(split(c.goods_id,'#')) t as col) as a,

goods as b
where a.col=b.id
group by a.sex;

统计订单完成男女比例及男、女消费平均值:
select tab1.sum_man/tab2.sum_weman,tab3.man_money/tab4.weman_money
from

(select count(*) as sum_man
from user_information a,user_order_logs b
where a.id=b.user_id and a.sex='男' and state=0
) as tab1,

(select count(*) as sum_weman
from user_information a,user_order_logs b
where a.id=b.user_id and a.sex='女' and state=0
) as tab2,

(select a.sex,sum(b.price) as man_money

from 

(select c.sex sex,col
from (
select a.sex sex,b.goods_id goods_id
from user_information as a,user_order_logs as b
where a.id=b.user_id and state=0 and a.sex='男'
) as c
lateral view explode(split(c.goods_id,'#')) t as col) as a,

goods as b
where a.col=b.id
group by a.sex
) as tab3,

(select a.sex,sum(b.price) as weman_money

from 

(select c.sex sex,col
from (
select a.sex sex,b.goods_id goods_id
from user_information as a,user_order_logs as b
where a.id=b.user_id and state=0 and a.sex='女'
) as c
lateral view explode(split(c.goods_id,'#')) t as col) as a,

goods as b
where a.col=b.id
group by a.sex
) as tab4
  1. 找出订单中商品购买排名
1.将用户订单表中的所有商品拆分出来(tab1)
select logs_id,col
from user_order_logs
lateral view explode(split(goods_id,'#')) t as col
where state=0


2.tab1与商品表(goods)基于商品id连接,基于商品id分组,排序,得到最后结果;

找出订单中商品购买排名:
select b.id,count(b.id) count
from 
(select logs_id,col
from user_order_logs
lateral view explode(split(goods_id,'#')) t as col
where state=0
) as a,goods as b
where a.col=b.id
group by b.id
order by count;
  1. 统计出每周的销售额
 1.将用户订单表中的所有商品拆分出来(tab1)
select order_date,col
from user_order_logs
lateral view explode(split(goods_id,'#')) t as col
where state = 0;

2.将tab1与商品表(goods)基于商品id连接,年份和每年的第几周表示每周

统计出每周的销售额:(年份和每年的第几周组合表示每周)
select year(a.order_date) year,weekofyear(a.order_date) week,sum(b.price) sum
from 
(select order_date,col
from user_order_logs
lateral view explode(split(goods_id,'#')) t as col
where state = 0
) as a,goods b
where a.col=b.id
group by year(a.order_date),weekofyear(a.order_date)
order by sum;
  1. 统计出每个地区商品种类销售排行
1.将用户订单表中的所有商品拆分出来(tab1)
select ipaddress,col
from user_order_logs
lateral view explode(split(goods_id,'#')) t as col
where state = 0;
结果:
36.7.255.255    10001
36.7.255.255    10002
36.7.255.255    10003
36.7.255.255    10001
183.217.24.116  20003
183.217.24.116  30001
183.217.24.116  10001

2.将tab1与商品表(goods)连接,查询地区,商品类型,总价格(tab2)

select ipaddress,b.type,sum(price)
from

(select ipaddress,col
from user_order_logs
lateral view explode(split(goods_id,'#')) t as col
where state = 0) as a,
goods as b
where a.col=b.id
group by ipaddress,type
结果:
110.76.159.255  1       9.0
110.76.159.255  2       3.5
122.200.47.255  1       14.0
122.200.47.255  2       5.5
122.200.47.255  3       49.0

3.将tab2与商品种类表连接,得到最终结果

统计出每个地区商品种类销售排行:
select ipaddress,b.classification,sum_money
from 

(select ipaddress,b.type type,sum(price) sum_money
from

(select ipaddress,col
from user_order_logs
lateral view explode(split(goods_id,'#')) t as col
where state = 0) as a,
goods as b
where a.col=b.id
group by ipaddress,type
) as a,
goods_items as b
where a.type=b.type
order by sum_money desc;

4. 可扩展

可以将需求中的ip地址转为具体地区,思路是使用MapReduce程序对用户订单日志的数据进行处理,然后将处理好后的数据导入到表中。

    原文作者:A_Zhong20
    原文地址: https://blog.csdn.net/A_Zhong20/article/details/116324387
    本文转自网络文章,转载此文章仅为分享知识,如有侵权,请联系博主进行删除。
点赞