kudu hdfs 数据分层 灵活存储
转载自:海底苍鹰
网上购物,很多人只会看最近三个月,或者半年的订单,更早的历史订单就很少有人关注了。这种场景还是挺适合kudu+hdfs.
Apache Kudu旨在
快速分析、快速变化
的数据。Kudu提供快速插入/更新和高效列扫描
的组合,以在单个存储层上实现多个实时分析工作负载。因此,Kudu非常适合作为存储需要实时查询的数据的仓库。此外,Kudu支持实时更新和删除行,以支持延迟到达的数据和数据更正。Apache HDFS旨在
以低成本实现无限的可扩展性
。它针对数据不可变的面向批处理的场景进行了优化,与Apache Parquet文件格式配合使用时,可以以极高的吞吐量和效率访问结构化数据。kudu存最近半年的订单数据,hdfs存历史的订单数据。
1.创建kudu测试表
[bigserver2:21000] order_test> CREATE TABLE tank_order_kudu
> (
> id BIGINT,
> month STRING,
> name STRING,
> order_time TIMESTAMP,
> PRIMARY KEY(id,month)
> )
> PARTITION BY HASH(id) PARTITIONS 5,
> RANGE(month) (
> PARTITION VALUE="2019-07",
> PARTITION VALUE="2019-08",
> PARTITION VALUE="2019-09",
> PARTITION VALUE="2019-10",
> PARTITION VALUE="2019-11",
> PARTITION VALUE="2019-12",
> PARTITION VALUE="2020-01"
> ) STORED AS KUDU;
Query: CREATE TABLE tank_order_kudu
(
id BIGINT,
month STRING,
name STRING,
order_time TIMESTAMP,
PRIMARY KEY(id,month)
)
PARTITION BY HASH(id) PARTITIONS 5,
RANGE(month) (
PARTITION VALUE="2019-07",
PARTITION VALUE="2019-08",
PARTITION VALUE="2019-09",
PARTITION VALUE="2019-10",
PARTITION VALUE="2019-11",
PARTITION VALUE="2019-12",
PARTITION VALUE="2020-01"
) STORED AS KUDU
+-------------------------+
| summary |
+-------------------------+
| Table has been created. |
+-------------------------+
Fetched 1 row(s) in 1.18s
2.创建hdfs测试表
[bigserver2:21000] order_test> CREATE TABLE tank_order_hdfs
> (
> id BIGINT,
> name STRING,
> order_time TIMESTAMP
> )
> PARTITIONED BY (month STRING)
> STORED AS PARQUET;
Query: CREATE TABLE tank_order_hdfs
(
id BIGINT,
name STRING,
order_time TIMESTAMP
)
PARTITIONED BY (month STRING)
STORED AS PARQUET
+-------------------------+
| summary |
+-------------------------+
| Table has been created. |
+-------------------------+
Fetched 1 row(s) in 0.14s
3.kudu表插入数据
[bigserver2:21000] order_test> INSERT INTO tank_order_kudu VALUES
> (100,"2019-07",'tank','2019-07-01'),
> (101,"2019-07",'test','2019-07-21'),
> (102,"2019-08",'zhang','2019-08-12'),
> (103,"2019-09",'tankzhang','2019-09-10'),
> (104,"2019-10",'hao', '2019-10-15'),
> (105,"2019-11",'tank1', '2019-11-09'),
> (106,"2019-12",'tank2', '2019-12-23'),
> (107,"2020-01",'tank3', '2020-01-11');
Query: INSERT INTO tank_order_kudu VALUES
(100,"2019-07",'tank','2019-07-01'),
(101,"2019-07",'test','2019-07-21'),
(102,"2019-08",'zhang','2019-08-12'),
(103,"2019-09",'tankzhang','2019-09-10'),
(104,"2019-10",'hao', '2019-10-15'),
(105,"2019-11",'tank1', '2019-11-09'),
(106,"2019-12",'tank2', '2019-12-23'),
(107,"2020-01",'tank3', '2020-01-11')
Query submitted at: 2020-01-14 17:42:28 (Coordinator: http://bigserver2:25000)
Query progress can be monitored at: http://bigserver2:25000/query_plan?query_id=db4c3023789860ec:195019ee00000000
Modified 8 row(s), 0 row error(s) in 3.90s
4.hdfs表插入数据
[bigserver2:21000] order_test> INSERT INTO tank_order_hdfs (id,name,order_time,month) VALUES
> (11,'test','2019-01-21',"2019-01"),
> (12,'zhang','2019-02-12',"2019-02"),
> (13,'tankzhang','2019-03-10',"2019-03"),
> (14,'hao', '2019-04-15',"2019-04"),
> (15,'tank1', '2019-05-09',"2019-05"),
> (16,'tank2', '2019-06-23',"2019-06");
Query: INSERT INTO tank_order_hdfs (id,name,order_time,month) VALUES
(11,'test','2019-01-21',"2019-01"),
(12,'zhang','2019-02-12',"2019-02"),
(13,'tankzhang','2019-03-10',"2019-03"),
(14,'hao', '2019-04-15',"2019-04"),
(15,'tank1', '2019-05-09',"2019-05"),
(16,'tank2', '2019-06-23',"2019-06")
Query submitted at: 2020-01-14 17:43:20 (Coordinator: http://bigserver2:25000)
Query progress can be monitored at: http://bigserver2:25000/query_plan?query_id=85464b9cc23eb494:23e5752e00000000
Modified 6 row(s) in 3.54s
kudu表的数据,根hdfs表的数据,没有月份上重复。
5.创建视图
[bigserver2:21000] order_test> CREATE VIEW tank_order_view AS
> SELECT id,month,name,order_time
> FROM tank_order_kudu
> WHERE month >= "2019-07"
> UNION ALL
> SELECT id,month,name,order_time
> FROM tank_order_hdfs
> WHERE month < "2019-07";
Query: CREATE VIEW tank_order_view AS
SELECT id,month,name,order_time
FROM tank_order_kudu WHERE month >= "2019-07"
UNION ALL
SELECT id,month,name,order_time
FROM tank_order_hdfs
WHERE month < "2019-07"
+------------------------+
| summary |
+------------------------+
| View has been created. |
+------------------------+
Fetched 1 row(s) in 0.05s
6.查看view和表数据
[bigserver2:21000] order_test> select * from tank_order_view order by id;
Query: select * from tank_order_view order by id
Query submitted at: 2020-01-14 17:44:39 (Coordinator: http://bigserver2:25000)
Query progress can be monitored at: http://bigserver2:25000/query_plan?query_id=7d4a82f5ff1a610c:2c92cb8d00000000
+-----+---------+-----------+---------------------+
| id | month | name | order_time |
+-----+---------+-----------+---------------------+
| 11 | 2019-01 | test | 2019-01-21 00:00:00 |
| 12 | 2019-02 | zhang | 2019-02-12 00:00:00 |
| 13 | 2019-03 | tankzhang | 2019-03-10 00:00:00 |
| 14 | 2019-04 | hao | 2019-04-15 00:00:00 |
| 15 | 2019-05 | tank1 | 2019-05-09 00:00:00 |
| 16 | 2019-06 | tank2 | 2019-06-23 00:00:00 |
| 100 | 2019-07 | tank | 2019-07-01 00:00:00 |
| 101 | 2019-07 | test | 2019-07-21 00:00:00 |
| 102 | 2019-08 | zhang | 2019-08-12 00:00:00 |
| 103 | 2019-09 | tankzhang | 2019-09-10 00:00:00 |
| 104 | 2019-10 | hao | 2019-10-15 00:00:00 |
| 105 | 2019-11 | tank1 | 2019-11-09 00:00:00 |
| 106 | 2019-12 | tank2 | 2019-12-23 00:00:00 |
| 107 | 2020-01 | tank3 | 2020-01-11 00:00:00 |
+-----+---------+-----------+---------------------+
Fetched 14 row(s) in 3.73s
[bigserver2:21000] order_test> select * from tank_order_kudu;
Query: select * from tank_order_kudu
Query submitted at: 2020-01-14 17:45:04 (Coordinator: http://bigserver2:25000)
Query progress can be monitored at: http://bigserver2:25000/query_plan?query_id=0642df8221628792:066e868100000000
+-----+---------+-----------+---------------------+
| id | month | name | order_time |
+-----+---------+-----------+---------------------+
| 100 | 2019-07 | tank | 2019-07-01 00:00:00 |
| 102 | 2019-08 | zhang | 2019-08-12 00:00:00 |
| 101 | 2019-07 | test | 2019-07-21 00:00:00 |
| 104 | 2019-10 | hao | 2019-10-15 00:00:00 |
| 105 | 2019-11 | tank1 | 2019-11-09 00:00:00 |
| 106 | 2019-12 | tank2 | 2019-12-23 00:00:00 |
| 107 | 2020-01 | tank3 | 2020-01-11 00:00:00 |
| 103 | 2019-09 | tankzhang | 2019-09-10 00:00:00 |
+-----+---------+-----------+---------------------+
Fetched 8 row(s) in 0.14s
[bigserver2:21000] order_test> select * from tank_order_hdfs;
Query: select * from tank_order_hdfs
Query submitted at: 2020-01-14 17:45:09 (Coordinator: http://bigserver2:25000)
Query progress can be monitored at: http://bigserver2:25000/query_plan?query_id=45489970a4fece53:13c264f200000000
+----+-----------+---------------------+---------+
| id | name | order_time | month |
+----+-----------+---------------------+---------+
| 12 | zhang | 2019-02-12 00:00:00 | 2019-02 |
| 16 | tank2 | 2019-06-23 00:00:00 | 2019-06 |
| 15 | tank1 | 2019-05-09 00:00:00 | 2019-05 |
| 13 | tankzhang | 2019-03-10 00:00:00 | 2019-03 |
| 11 | test | 2019-01-21 00:00:00 | 2019-01 |
| 14 | hao | 2019-04-15 00:00:00 | 2019-04 |
+----+-----------+---------------------+---------+
Fetched 6 row(s) in 0.12s
二张表的数据,都能在视图中体现
7.查看表分区
[bigserver2:21000] order_test> show range partitions tank_order_kudu;
Query: show range partitions tank_order_kudu
+-------------------+
| RANGE (month) |
+-------------------+
| VALUE = "2019-07" |
| VALUE = "2019-08" |
| VALUE = "2019-09" |
| VALUE = "2019-10" |
| VALUE = "2019-11" |
| VALUE = "2019-12" |
| VALUE = "2020-01" |
+-------------------+
Fetched 7 row(s) in 0.03s
[bigserver2:21000] order_test> show partitions tank_order_hdfs;
Query: show partitions tank_order_hdfs
+---------+-------+--------+--------+--------------+-------------------+---------+-------------------+--------------------------------------------------------------------------------------+
| month | #Rows | #Files | Size | Bytes Cached | Cache Replication | Format | Incremental stats | Location |
+---------+-------+--------+--------+--------------+-------------------+---------+-------------------+--------------------------------------------------------------------------------------+
| 2019-01 | -1 | 1 | 774B | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://bigdata1/home/cdh6/hive/warehouse/order_test.db/tank_order_hdfs/month=2019-01 |
| 2019-02 | -1 | 1 | 779B | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://bigdata1/home/cdh6/hive/warehouse/order_test.db/tank_order_hdfs/month=2019-02 |
| 2019-03 | -1 | 1 | 799B | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://bigdata1/home/cdh6/hive/warehouse/order_test.db/tank_order_hdfs/month=2019-03 |
| 2019-04 | -1 | 1 | 769B | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://bigdata1/home/cdh6/hive/warehouse/order_test.db/tank_order_hdfs/month=2019-04 |
| 2019-05 | -1 | 1 | 779B | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://bigdata1/home/cdh6/hive/warehouse/order_test.db/tank_order_hdfs/month=2019-05 |
| 2019-06 | -1 | 1 | 779B | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://bigdata1/home/cdh6/hive/warehouse/order_test.db/tank_order_hdfs/month=2019-06 |
| Total | -1 | 6 | 4.57KB | 0B | | | | |
+---------+-------+--------+--------+--------------+-------------------+---------+-------------------+--------------------------------------------------------------------------------------+
Fetched 7 row(s) in 0.02s
8.准备impala sql脚本
# cat tank_order.sql
use order_test; //选择数据库
//将kudu表某月数据插入到hdfs表,这个时候,二表有重复数据
INSERT INTO ${var:hdfs_table} PARTITION (month)
SELECT id,name,order_time,month
FROM ${var:kudu_table}
WHERE month = from_unixtime(unix_timestamp(add_months(now(), -6)),'yyyy-MM');
//优化表,impala查询更快
COMPUTE INCREMENTAL STATS ${var:hdfs_table};
//修改视图,这样解决数据重复问题
ALTER VIEW ${var:view_name} AS
SELECT id,month,name,order_time
FROM ${var:kudu_table}
WHERE month >= from_unixtime(unix_timestamp(add_months(now(), -5)),'yyyy-MM')
UNION ALL
SELECT id,month,name,order_time
FROM ${var:hdfs_table}
WHERE month < from_unixtime(unix_timestamp(add_months(now(), -5)),'yyyy-MM');
//删除kudu表旧分区,等于删除了kudu表中,在hdfs表中的数据
ALTER TABLE ${var:kudu_table}
DROP RANGE PARTITION VALUE=from_unixtime(unix_timestamp(add_months(now(), -6)),'yyyy-MM');
//增加kudu表新分区
ALTER TABLE ${var:kudu_table}
ADD RANGE PARTITION VALUE=from_unixtime(unix_timestamp(add_months(now(), +1)),'yyyy-MM');
9.执行脚本
[root@bigserver2 tanktest]# impala-shell -i bigserver2:21000 -f tank_order.sql --var=view_name=tank_order_view --var=kudu_table=tank_order_kudu --var=hdfs_table=tank_order_hdfs
Starting Impala Shell without Kerberos authentication
Opened TCP connection to bigserver2:21000
Connected to bigserver2:21000
Server version: impalad version 3.2.0-cdh6.3.1 RELEASE (build 3d5de689e9b9cfde0c01e1c64f4b4e830cee69c3)
Query: use order_test
Query: INSERT INTO tank_order_hdfs PARTITION (month)
SELECT id,name,order_time,month
FROM tank_order_kudu
WHERE month = from_unixtime(unix_timestamp(add_months(now(), -6)),'yyyy-MM')
Query submitted at: 2020-01-14 17:49:22 (Coordinator: http://bigserver2:25000)
Query progress can be monitored at: http://bigserver2:25000/query_plan?query_id=44439e6c6e711ac4:0105b2f900000000
Modified 2 row(s) in 0.67s
Query: COMPUTE INCREMENTAL STATS tank_order_hdfs
+-----------------------------------------+
| summary |
+-----------------------------------------+
| Updated 7 partition(s) and 3 column(s). |
+-----------------------------------------+
Fetched 1 row(s) in 1.32s
Query: ALTER VIEW tank_order_view AS
SELECT id,month,name,order_time
FROM tank_order_kudu
WHERE month >= from_unixtime(unix_timestamp(add_months(now(), -5)),'yyyy-MM')
UNION ALL
SELECT id,month,name,order_time
FROM tank_order_hdfs
WHERE month < from_unixtime(unix_timestamp(add_months(now(), -5)),'yyyy-MM')
+------------------------+
| summary |
+------------------------+
| View has been altered. |
+------------------------+
Fetched 1 row(s) in 0.07s
Query: ALTER TABLE tank_order_kudu
DROP RANGE PARTITION VALUE=from_unixtime(unix_timestamp(add_months(now(), -6)),'yyyy-MM')
+-----------------------------------+
| summary |
+-----------------------------------+
| Range partition has been dropped. |
+-----------------------------------+
Fetched 1 row(s) in 0.13s
Query: ALTER TABLE tank_order_kudu
ADD RANGE PARTITION VALUE=from_unixtime(unix_timestamp(add_months(now(), +1)),'yyyy-MM')
+---------------------------------+
| summary |
+---------------------------------+
| Range partition has been added. |
+---------------------------------+
Fetched 1 row(s) in 0.17s
10.执行后,查看数据
[bigserver2:21000] order_test> select * from tank_order_view order by id;
Query: select * from tank_order_view order by id
Query submitted at: 2020-01-14 17:50:50 (Coordinator: http://bigserver2:25000)
Query progress can be monitored at: http://bigserver2:25000/query_plan?query_id=0042af35d2363548:5e6046a400000000
+-----+---------+-----------+---------------------+
| id | month | name | order_time |
+-----+---------+-----------+---------------------+
| 11 | 2019-01 | test | 2019-01-21 00:00:00 |
| 12 | 2019-02 | zhang | 2019-02-12 00:00:00 |
| 13 | 2019-03 | tankzhang | 2019-03-10 00:00:00 |
| 14 | 2019-04 | hao | 2019-04-15 00:00:00 |
| 15 | 2019-05 | tank1 | 2019-05-09 00:00:00 |
| 16 | 2019-06 | tank2 | 2019-06-23 00:00:00 |
| 100 | 2019-07 | tank | 2019-07-01 00:00:00 |
| 101 | 2019-07 | test | 2019-07-21 00:00:00 |
| 102 | 2019-08 | zhang | 2019-08-12 00:00:00 |
| 103 | 2019-09 | tankzhang | 2019-09-10 00:00:00 |
| 104 | 2019-10 | hao | 2019-10-15 00:00:00 |
| 105 | 2019-11 | tank1 | 2019-11-09 00:00:00 |
| 106 | 2019-12 | tank2 | 2019-12-23 00:00:00 |
| 107 | 2020-01 | tank3 | 2020-01-11 00:00:00 |
+-----+---------+-----------+---------------------+
Fetched 14 row(s) in 0.29s
[bigserver2:21000] order_test> select * from tank_order_kudu;
Query: select * from tank_order_kudu
Query submitted at: 2020-01-14 17:51:06 (Coordinator: http://bigserver2:25000)
Query progress can be monitored at: http://bigserver2:25000/query_plan?query_id=a14e07360ee03232:d8be8fe000000000
+-----+---------+-----------+---------------------+
| id | month | name | order_time |
+-----+---------+-----------+---------------------+
| 107 | 2020-01 | tank3 | 2020-01-11 00:00:00 |
| 104 | 2019-10 | hao | 2019-10-15 00:00:00 |
| 103 | 2019-09 | tankzhang | 2019-09-10 00:00:00 |
| 102 | 2019-08 | zhang | 2019-08-12 00:00:00 |
| 105 | 2019-11 | tank1 | 2019-11-09 00:00:00 |
| 106 | 2019-12 | tank2 | 2019-12-23 00:00:00 |
+-----+---------+-----------+---------------------+
Fetched 6 row(s) in 0.14s
[bigserver2:21000] order_test> select * from tank_order_hdfs;
Query: select * from tank_order_hdfs
Query submitted at: 2020-01-14 17:51:53 (Coordinator: http://bigserver2:25000)
Query progress can be monitored at: http://bigserver2:25000/query_plan?query_id=204219b3ecc74213:8bbc30bc00000000
+-----+-----------+---------------------+---------+
| id | name | order_time | month |
+-----+-----------+---------------------+---------+
| 12 | zhang | 2019-02-12 00:00:00 | 2019-02 |
| 13 | tankzhang | 2019-03-10 00:00:00 | 2019-03 |
| 14 | hao | 2019-04-15 00:00:00 | 2019-04 |
| 15 | tank1 | 2019-05-09 00:00:00 | 2019-05 |
| 16 | tank2 | 2019-06-23 00:00:00 | 2019-06 |
| 101 | test | 2019-07-21 00:00:00 | 2019-07 |
| 100 | tank | 2019-07-01 00:00:00 | 2019-07 |
| 11 | test | 2019-01-21 00:00:00 | 2019-01 |
+-----+-----------+---------------------+---------+
Fetched 8 row(s) in 0.14s
从执行后的数据可以看出,视图数据是没有变的。kudu表的部分数据迁到了hdfs表中了。
11.查看分区
[bigserver2:21000] order_test> show range partitions tank_order_kudu;
Query: show range partitions tank_order_kudu
+-------------------+
| RANGE (month) |
+-------------------+
| VALUE = "2019-08" |
| VALUE = "2019-09" |
| VALUE = "2019-10" |
| VALUE = "2019-11" |
| VALUE = "2019-12" |
| VALUE = "2020-01" |
| VALUE = "2020-02" |
+-------------------+
Fetched 7 row(s) in 0.05s
[bigserver2:21000] order_test> show partitions tank_order_hdfs;
Query: show partitions tank_order_hdfs
+---------+-------+--------+--------+--------------+-------------------+---------+-------------------+--------------------------------------------------------------------------------------+
| month | #Rows | #Files | Size | Bytes Cached | Cache Replication | Format | Incremental stats | Location |
+---------+-------+--------+--------+--------------+-------------------+---------+-------------------+--------------------------------------------------------------------------------------+
| 2019-01 | 1 | 1 | 774B | NOT CACHED | NOT CACHED | PARQUET | true | hdfs://bigdata1/home/cdh6/hive/warehouse/order_test.db/tank_order_hdfs/month=2019-01 |
| 2019-02 | 1 | 1 | 779B | NOT CACHED | NOT CACHED | PARQUET | true | hdfs://bigdata1/home/cdh6/hive/warehouse/order_test.db/tank_order_hdfs/month=2019-02 |
| 2019-03 | 1 | 1 | 799B | NOT CACHED | NOT CACHED | PARQUET | true | hdfs://bigdata1/home/cdh6/hive/warehouse/order_test.db/tank_order_hdfs/month=2019-03 |
| 2019-04 | 1 | 1 | 769B | NOT CACHED | NOT CACHED | PARQUET | true | hdfs://bigdata1/home/cdh6/hive/warehouse/order_test.db/tank_order_hdfs/month=2019-04 |
| 2019-05 | 1 | 1 | 779B | NOT CACHED | NOT CACHED | PARQUET | true | hdfs://bigdata1/home/cdh6/hive/warehouse/order_test.db/tank_order_hdfs/month=2019-05 |
| 2019-06 | 1 | 1 | 779B | NOT CACHED | NOT CACHED | PARQUET | true | hdfs://bigdata1/home/cdh6/hive/warehouse/order_test.db/tank_order_hdfs/month=2019-06 |
| 2019-07 | 2 | 1 | 798B | NOT CACHED | NOT CACHED | PARQUET | true | hdfs://bigdata1/home/cdh6/hive/warehouse/order_test.db/tank_order_hdfs/month=2019-07 |
| Total | 8 | 7 | 5.35KB | 0B | | | | |
+---------+-------+--------+--------+--------------+-------------------+---------+-------------------+--------------------------------------------------------------------------------------+
kudu表,少了一个旧分区,多了一个新分区。
hdfs表,多了一个分区。
kudu hdfs 数据分层 灵活存储
https://www.gmtgo.com/13996.html