oracle大数据量python导入实践-1w/s

在项目中需要将一个80w+的csv数据(200+m)导入到oracle库,一开始使用的是Navicat for Oracle的导入工具。跑了五六分钟之后绝望了,因为才跑了2%,按这样的速度跑半天都跑不完。
回想了一下老本行mysql,觉得Oracle应该也会有一个批量插入的高效率语句的写法。于是翻看了oracle的官方文档精通 Oracle+Python,第 1 部分:查询最佳应践

一次多行

大型的插入操作不需求多次的单独插入,这是因为 Python 通过 cx_Oracle.Cursor.executemany 方法完全支持一次插入多行。
限制执行操作的数量极大地改善了程序性能,因此在编写存在大量插入操作的应用程序时应首先考虑这一功能。 

呲!这就是我想要的。
迅速仿造官方的demo写了一个简易的导入脚本,跑了一下,138.7s也就是将近2分多钟就跑完了这200多m,差不多88w行的数据。附上性能测试

总行数: 885640
2017-05-17 14:11:24
===begin===
2017-05-17 14:11:24
prepare end
2017-05-17 14:11:26 => 1 - 10000 ( 9999 ) finish
2017-05-17 14:11:27 => 10001 - 20000 ( 9999 ) finish
2017-05-17 14:11:28 => 20001 - 30000 ( 9999 ) finish
2017-05-17 14:11:30 => 30001 - 40000 ( 9999 ) finish
2017-05-17 14:11:31 => 40001 - 50000 ( 9999 ) finish
2017-05-17 14:11:33 => 50001 - 60000 ( 9999 ) finish
2017-05-17 14:11:34 => 60001 - 70000 ( 9999 ) finish
2017-05-17 14:11:35 => 70001 - 80000 ( 9999 ) finish
2017-05-17 14:11:37 => 80001 - 90000 ( 9999 ) finish
2017-05-17 14:11:38 => 90001 - 100000 ( 9999 ) finish
2017-05-17 14:11:39 => 100001 - 110000 ( 9999 ) finish
2017-05-17 14:11:42 => 110001 - 120000 ( 9999 ) finish
2017-05-17 14:11:44 => 120001 - 130000 ( 9999 ) finish
2017-05-17 14:11:45 => 130001 - 140000 ( 9999 ) finish
2017-05-17 14:11:46 => 140001 - 150000 ( 9999 ) finish
2017-05-17 14:11:47 => 150001 - 160000 ( 9999 ) finish
2017-05-17 14:11:49 => 160001 - 170000 ( 9999 ) finish
2017-05-17 14:11:50 => 170001 - 180000 ( 9999 ) finish
2017-05-17 14:11:51 => 180001 - 190000 ( 9999 ) finish
2017-05-17 14:11:53 => 190001 - 200000 ( 9999 ) finish
2017-05-17 14:11:56 => 200001 - 210000 ( 9999 ) finish
2017-05-17 14:11:58 => 210001 - 220000 ( 9999 ) finish
2017-05-17 14:12:00 => 220001 - 230000 ( 9999 ) finish
2017-05-17 14:12:02 => 230001 - 240000 ( 9999 ) finish
2017-05-17 14:12:04 => 240001 - 250000 ( 9999 ) finish
2017-05-17 14:12:06 => 250001 - 260000 ( 9999 ) finish
2017-05-17 14:12:10 => 260001 - 270000 ( 9999 ) finish
2017-05-17 14:12:14 => 270001 - 280000 ( 9999 ) finish
2017-05-17 14:12:16 => 280001 - 290000 ( 9999 ) finish
2017-05-17 14:12:17 => 290001 - 300000 ( 9999 ) finish
2017-05-17 14:12:19 => 300001 - 310000 ( 9999 ) finish
2017-05-17 14:12:25 => 310001 - 320000 ( 9999 ) finish
2017-05-17 14:12:26 => 320001 - 330000 ( 9999 ) finish
2017-05-17 14:12:28 => 330001 - 340000 ( 9999 ) finish
2017-05-17 14:12:29 => 340001 - 350000 ( 9999 ) finish
2017-05-17 14:12:30 => 350001 - 360000 ( 9999 ) finish
2017-05-17 14:12:51 => 360001 - 370000 ( 9999 ) finish
2017-05-17 14:12:53 => 370001 - 380000 ( 9999 ) finish
2017-05-17 14:12:54 => 380001 - 390000 ( 9999 ) finish
2017-05-17 14:12:56 => 390001 - 400000 ( 9999 ) finish
2017-05-17 14:12:58 => 400001 - 410000 ( 9999 ) finish
2017-05-17 14:13:36 => 410001 - 420000 ( 9999 ) finish
2017-05-17 14:13:37 => 420001 - 430000 ( 9999 ) finish
2017-05-17 14:13:39 => 430001 - 440000 ( 9999 ) finish
2017-05-17 14:13:40 => 440001 - 450000 ( 9999 ) finish
2017-05-17 14:13:42 => 450001 - 460000 ( 9999 ) finish
2017-05-17 14:13:43 => 460001 - 470000 ( 9999 ) finish
2017-05-17 14:13:45 => 470001 - 480000 ( 9999 ) finish
2017-05-17 14:13:47 => 480001 - 490000 ( 9999 ) finish
2017-05-17 14:13:48 => 490001 - 500000 ( 9999 ) finish
2017-05-17 14:13:50 => 500001 - 510000 ( 9999 ) finish
2017-05-17 14:13:51 => 510001 - 520000 ( 9999 ) finish
2017-05-17 14:13:52 => 520001 - 530000 ( 9999 ) finish
2017-05-17 14:13:54 => 530001 - 540000 ( 9999 ) finish
2017-05-17 14:13:56 => 540001 - 550000 ( 9999 ) finish
2017-05-17 14:13:57 => 550001 - 560000 ( 9999 ) finish
2017-05-17 14:13:59 => 560001 - 570000 ( 9999 ) finish
2017-05-17 14:14:01 => 570001 - 580000 ( 9999 ) finish
2017-05-17 14:14:03 => 580001 - 590000 ( 9999 ) finish
2017-05-17 14:14:05 => 590001 - 600000 ( 9999 ) finish
2017-05-17 14:14:06 => 600001 - 610000 ( 9999 ) finish
2017-05-17 14:14:07 => 610001 - 620000 ( 9999 ) finish
2017-05-17 14:14:09 => 620001 - 630000 ( 9999 ) finish
2017-05-17 14:14:10 => 630001 - 640000 ( 9999 ) finish
2017-05-17 14:14:11 => 640001 - 650000 ( 9999 ) finish
2017-05-17 14:14:13 => 650001 - 660000 ( 9999 ) finish
2017-05-17 14:14:14 => 660001 - 670000 ( 9999 ) finish
2017-05-17 14:14:16 => 670001 - 680000 ( 9999 ) finish
2017-05-17 14:14:17 => 680001 - 690000 ( 9999 ) finish
2017-05-17 14:14:19 => 690001 - 700000 ( 9999 ) finish
2017-05-17 14:14:20 => 700001 - 710000 ( 9999 ) finish
2017-05-17 14:14:23 => 710001 - 720000 ( 9999 ) finish
2017-05-17 14:14:28 => 720001 - 730000 ( 9999 ) finish
2017-05-17 14:14:34 => 730001 - 740000 ( 9999 ) finish
2017-05-17 14:14:37 => 740001 - 750000 ( 9999 ) finish
2017-05-17 14:14:40 => 750001 - 760000 ( 9999 ) finish
2017-05-17 14:14:43 => 760001 - 770000 ( 9999 ) finish
2017-05-17 14:14:46 => 770001 - 780000 ( 9999 ) finish
2017-05-17 14:14:49 => 780001 - 790000 ( 9999 ) finish
2017-05-17 14:14:51 => 790001 - 800000 ( 9999 ) finish
2017-05-17 14:14:55 => 800001 - 810000 ( 9999 ) finish
2017-05-17 14:15:15 => 810001 - 820000 ( 9999 ) finish
2017-05-17 14:15:17 => 820001 - 830000 ( 9999 ) finish
2017-05-17 14:15:19 => 830001 - 840000 ( 9999 ) finish
2017-05-17 14:15:20 => 840001 - 850000 ( 9999 ) finish
2017-05-17 14:15:22 => 850001 - 860000 ( 9999 ) finish
2017-05-17 14:15:24 => 860001 - 870000 ( 9999 ) finish
2017-05-17 14:15:27 => 870001 - 880000 ( 9999 ) finish
2017-05-17 14:15:28 => 880001 - 890000 ( 5639 ) finish
2017-05-17 14:15:28
execute end

请按任意键继续. . .
[Finished in 138.7s]

从这里可以看出,1w行的数据1s就可以搞定了,这效率杠杠的。文末献上粗糙的python导入脚本。

困难回顾

在实际操作时,可能会遇到以下问题

1.ORA-12899: value too large for column

大意就是值越界的,原来的数据库是没有这个问题的,于是猜测是编码集的原因,由于我用的是utf8,与库的编码集不一致,于是乎一个汉字被转成\uxxxx之类的。

解决方法:在脚本加入os.environ['NLS_LANG'] = 'SIMPLIFIED CHINESE_CHINA.UTF8',让数据库自行转换编码。

2. ORA-03135: connection lost contact

意思就是说与数据库的连接断开的,原因是我的执行时间太长,超过了Oracle设置允许的最大时间

解决方法:在Oracle安装目录下找到/network/admin/sqlnet.ora,如博主的目录/app/oracle/product/11.2.0/dbhome_1/network/admin/sqlnet.ora。在该文件追加
sqlnet.expire_time = 60
时间单位:分钟
取值范围:大于0
默认取值:无

以下是python demo

# -*- coding: utf8 -*-

import cx_Oracle
import csv
import time
import os

# 编码转换
os.environ['NLS_LANG'] = 'SIMPLIFIED CHINESE_CHINA.UTF8'

M = []
csvfile = file('C:\Users\Administrator\Desktop\Table.csv', 'rb')
reader = csv.reader(csvfile)


for line in reader:
    try:
        M.append((line[0],line[1],line[2],line[3]))
    except AttributeError:
        pass

csvfile.close() 



#创建数据库连接
conn = cx_Oracle.connect('oracle', '123456', '192.168.102.126:1521/ORCL')

#获取操作游标
cursor = conn.cursor()

print len(M)

print time.strftime('%Y-%m-%d %H:%M:%S',time.localtime(time.time()))
print '===begin==='

cursor.prepare("INSERT INTO MY_TABLE (ID, COMPANY, DEPARTMENT, NAME) VALUES (:1,:2,:3,:4)")

print time.strftime('%Y-%m-%d %H:%M:%S',time.localtime(time.time()))
print 'prepare end'

for i in range(1, 31):
    begin = (i - 1) * 30000
    end = i * 30000
    cursor.executemany(None, M[begin:end])
    print time.strftime('%Y-%m-%d %H:%M:%S',time.localtime(time.time())), '=>', begin, '-', end, '(',  len(M[begin:end]), ')','finish'

print time.strftime('%Y-%m-%d %H:%M:%S',time.localtime(time.time()))
print 'execute end'

conn.commit()
#885640
print time.strftime('%Y-%m-%d %H:%M:%S',time.localtime(time.time()))
print 'end'

r = cursor.execute("SELECT COUNT(*) FROM MY_TABLE")
print cursor.fetchone()

#关闭连接,释放资源
cursor.close()
conn.close()

听说使用数据泵的导入导出效率更高,有机会尝试一下。

    原文作者:HBLong
    原文地址: https://www.jianshu.com/p/005869ac722d
    本文转自网络文章,转载此文章仅为分享知识,如有侵权,请联系博主进行删除。
点赞