两个大表的个别字段的模糊匹配查询

需求:

有两个表A,和B,分别为300万和400万数据,B中path字段包含目录信息,需要匹配A中的path信息,及B.path in A.path且需要包含信息。

问题:

1 如果直接关联查询,如用like,instr,查询非常慢,执行一天都执行不出来结果。执行计划无法使用索引,直接全表扫描,效率奇差

2 采用 full text索引后,搜索出来的结果都是模糊查询,一大堆结果,所以无法联合查询

解决办法:

1 修改表引擎为MyISAM,设置表A.path字段为full text索引,采IN BOOLEAN MODE”查询

2 采用生产者消费者模式,然后逐一查询,

做个记号,不知道有没有其他更好的方法,目前依旧很慢

#encoding:utf-8
import sys
sys.path.append('/home/fastqweb/dyh/script')
reload(sys)
sys.setdefaultencoding('utf-8')
from setting import *
from util.mysqlclient import mysqlClient
from util.log import Log
loghander = Log(log_path, 'matchinfo.log')
import Queue
import time
import threading
mutex = threading.Lock()


def mysql_connect(loghander):
    mysql_db = mysqlClient(mysql_config['IP'], mysql_config['USER'], mysql_config['PASSWORD'], mysql_config['DB'], mysql_config['PORT'])
    try:
        mysql_db.open()
    except Exception, e:
        print e
        loghander.error('mysql can not connect,the ip is {}, the user is {}, the password is {}, '
                        'the db is {}, the port is {}'.format(mysql_config['IP'], mysql_config['USER'],
                          mysql_config['PASSWORD'], mysql_config['DB'], mysql_config['PORT']))
        exit(0)
    return mysql_db
def get_gf_histroy_data(q):
    print "I am hear"
    mysql_db = mysql_connect(loghander)
    sql = 'select * from B'
    path_cursor = mysql_db.select(sql)
    while True:
        text = path_cursor.fetchone()
        if text:
            while q.full():
              print 'full'
              time.sleep(5)
            q.put(text)
        else:
            q.put('FINISH')
            break

def generate_searchword(dir):
    b = dir.split('/')
    data = []
    for item in b:
        if item:
            item = '+' + item
            data.append(item)
    return ' '.join(data)

def get_sample_machine(q):
    print "I am in"
    mysql_db = mysql_connect(loghander)
    while True:
        mutex.acquire()
        status = q.empty()
        if not status:
            text = q.get()
            mutex.release()
            if text and text != 'FINISH':
                backup_dir = text[8]
                if backup_dir:
                    new_dir = generate_searchword(backup_dir)
                    path_sql = "select * from A where match(machine_path) against(\'{}\'IN BOOLEAN MODE)".format(new_dir)
                    path_cursor = mysql_db.select(path_sql)
                    result_text = path_cursor.fetchall()
                    if result_text:
                        for item in result_text:
                            if backup_dir in  item[12]:
                                mutex.acquire()
                                loghander.info('path match result:' + str(item ))
                                mutex.release()
            if text == 'FINISH':
                    break
        else:
            mutex.release()



if __name__ == '__main__':
    loghander.info('begin ')
    q = Queue.Queue(maxsize=1000)
    #g1 = gevent.spawn(get_gf_histroy_data, q)
    tread_list =[threading.Thread(target=get_gf_histroy_data, args=(q,))]
    for i in range(100):
        tread_list.append(threading.Thread(target=get_sample_machine, args=(q,)))
    for item in tread_list:
        item.start()
    for item in tread_list:
        item.join()
    loghander.info('finish')

    原文作者:dyh4201
    原文地址: https://blog.csdn.net/dyh4201/article/details/79759364
    本文转自网络文章,转载此文章仅为分享知识,如有侵权,请联系博主进行删除。
点赞