【计算机基础知识】海量字符生成与匹配实现报告

2024年2月11日 109次阅读

写在前面：此报告为2012年有一个问题引出，并进行了相关的测试与探究，最后得到。

源问题如下：

1 一个有n个单元的队列，每单元长为k字节。现有一个字串长为m字节，且m≦k。请设计一个查询程序，可以完全发现队列中符合字串的单元。并要求：

1）写出测试方法，形成n=1000,000,000时，k=20，m=17时，程序执行的效率（时间）；

提示：可以将英文字母用乱序（随机）方法写入队列，并用确定的如abcd……构成m字串，进行检测；

2）测试自己的程序效率；

3）其它方法，并形成总结性文章。

首先，这个题目得到确定结果的机率非常小，因为如果用26个英文字母（先不管大小写）随机不重复的形成17个字符的组合，共有26*25*24*23*22*21*20*19*18*17*16*15*14*13*12*11*10*9种可能，远远超出了规定的n的数量，所以检测的结果大部分应该都是找不到匹配。

接下来我就用K=20,m=10,取英文字母表中的前10个字母随机组合成10个字符的字符串，来测试程序的执行效率。

一：形成10亿条数据

这里总结了两种方式，第一种方式是将所有数据插入到数据库中（具体分析见下方法一），第二种是将所有数据保存到文件中（具体分析见下方法二）。

1.
有10亿条数据（每条数据是由26个英文小写字母中随机组合的17个字母），生成随机字符串程序见附录：程序一

1)
写出一个具有良好结构和精炼的程序来产生随机数据。

2)
利用程序一能产生符合要求的随机字符组和随机字符组的“索引”，这里的“索引”是指前三个随机字符组的多对应的字母表中的位置，是一个Int型整数，用来后期过滤不符合要求的字符组。

3)
这里的一个“索引”对应可能不止一个字符串单元。比如1123，可以分解为1，12，3或者11，2，3分别对应字母表中的不同字符，所以会产生两组字符串单元可以匹配，但这不影响匹配，即还是能排除大多数不符合要求的字串单元。

2.
将数据插入到数据库中

其中数据库结构为一个管理库，一个data库，管理库负责管理data库里的表（包括库的命名,这里库表的命名长度为64个字符，我这里表前缀为table，后接递增构成表名，所以理论上能形成不重复的999……99(59个9)个表名）,管理库中有一个manage表：

drop table manage if exists manage;

create table manage(

id
int auto_increment
comment ”,

tableName
varchar(250)
not null comment ”

)engine=myisam;

Data库存放所有数据，存储引擎待定。（InnoDB或者MyIsam）,Data库中数据表看起来像这样：

#每张表存储100000条数据，并在插入数据完毕后建立索引

drop table table1 if exists table1;

create table table1(

ind
int
comment ‘int型的基本单元的索引，最大是8位’,

unit
char(20)
comment ‘基本单元，20字节，存储26个英文字母随机的<=20的组合’;

);

这里产生了几个问题：

1)
表的命名长度，字段的命名长度，一个表中字段的最大数，一个表中记录的最大数跟什么有关？

2)
不同数据引擎的数据表文件（数据文件，结构文件等），存储的方式和访问的方式？和操作系统的文件有什么区别？

3)
怎样才能快速的将这么多的数据插入数据库中（批处理的量和数据库连接的量以及线程的量之间的权衡）。

4)
怎样才能让程序不出现意外死亡（合理的运用try-catch-finally以及循环，跳出）

5)
形成多少张表，每张表数据量多少，用什么存储引擎，怎样管理这些表？

首先源程序见附录：程序二

我查资料并做了一些测试得出：

1)
数据库名和数据表名的最大长度为64个字符，而数据库或者数据表的假名的最大长度为256个字符。（这个字符是英文中的字符，包括英文状态下的各种符号），而且数据库和数据表的名字长度跟操作系统文件的命名没有关系。表的字段的最大数这个比较麻烦，我做了一些测试后，发现不仅跟存储引擎有关，还跟数据类型有关（比如基本数值数据类型（tityint，smallint，mediumint，int，bigint，decimal，float，double，bit）在InnoDB引擎下只能有1000个字段，而在MyIsam引擎下则最大2599个字段）测试部分结果见下表：

InnoDB
MyIsam

基本数值类型
1000
2599

基本字符类型
456
1200

其他
没测试
没测试

一个表的最大记录数跟存储引擎和操作系统以及磁盘空间都有一定的关系

2)
每个Mysql数据表在磁盘上至少对应这一个.frm格式文件，这个文件包含着对数据表的结构的描述。

其他的不同的存储引擎存储方式不一样，MyIsam有.MYD文件（数据文件），.MYI文件（索引文件）

InnoDB除了.frm文件还有一个.ibd文件（所有数据和索引等信息都存放在这个文件里），这里有一个共享表空间的概念，现在还理解的不是很清楚。访问这些数据库文件的时候由Mysqld统一管理，数据目录按照库，表的分层形成树结构，然后统一通过Mysqld提供的接口连接外部应用程序。

3)
这里我没有想到很好办法，程序中以2000条为一批提交给数据库，程序对数据库有50个连接，每个连接对应一个线程，就是50个线程。主要的瓶颈为内存不够大，cpu也不行。

4)
经过测试，我在程序中形成了1000张表，每张表100w条记录，用了InnoDB存储引擎，其中之一原因是InnoDB支持事务处理。我新建了一个管理库来管理这些data库中的数据表。

注：

1.这个程序在运行的时候曾多次出现java.lang.OutOfMemoryError: GC overhead limit exceeded错误，查资料得知这个是java非配的内存将要用完的时候运行了gc，但是gc又长时间没有相应，所以就出现了这个错误，可能因为我的程序用了多线程，JVM在调用gc的时候需要消耗内存和cpu资源。，这里有一些不太明白的地方，比如gc是由哪个启动的？跟主线程和运行中的程序有什么关系。

2.运行这个程序时中间出现过一次断电，由于程序没有添加断电后恢复通电后继续执行的功能，所以我程序“挂掉”了。

程序运行的结果为：

生成数据完毕！

程序运行的时间为：6 hour 50 min

数据库中共有 1000000000条记录

二.对数据表建立索引

对每个表的“ind”字段建立字段，首先根据这个字段进行过滤数据，如果符合条件则进行下一步的匹配

三.进行数据检索并生成结果文件

这里检索的速度很快，我测试了几次，但没找到匹配项，之后从数据库中指定一条数据，运行程序能快速的匹配到。平均时间约为40sec左右。结果文件由于程序改动，暂时没写在文档上。

四.自己对程序进行的一些优化：

1.每条记录设定为10个字符，这10个字符是英文字母表中的前10个字母的随机组合（可重复）。将每条记录存储在能存储20个字符的每个单元内。字符组产生程序尽量的精简，不用if语句，list，map等占内存的存储方式，少用循环等，因为这个程序要运行10亿次，所以需要尽量的优化代码。

2.在数据表的结构中添加“索引”字段，虽然增加了将近一倍的空间，但是查询速度得到了显著的提高

3.当出现OutOfMemoryError时，捕获异常并加以处理，利用try-catch-finally和适当的循环，break来是程序的鲁棒性提高点。

附程序代码：

代码一：生成随机字串RandomFactory.java

package task1;

import java.util.Random;

/**

* 随机字符工厂，产生位于10个英文字母中间的随机字符组（0-20个为一组）

* */

public class RandomFactory {

private int index;//每一个字符组都给一个“索引“

private char[] base={

‘a’,’b’,’c’,’d’,’e’,’f’,’g’,’h’,’i’,’j’

};

/**

* 随机字符组生产，这里暂时只测试10个小写字母

* */

public String factoryWork(){

String integer=””;

String str=””;

for(int m=0;m<3;m++){

Random rand=new Random();

int num=rand.nextInt(10);

integer=Integer.toString(num)+integer;

str=base[num]+str;

}

for(int m=0;m<7;m++){

Random rand=new Random();

int num=rand.nextInt(10);

str=base[num]+str;

}

index=Integer.parseInt(integer);

return str;

}

/**

* 得到随机字符组的”索引”

* */

public int getCharatorSetIndex(){

return index;

}

public static void main(String[] args){

RandomFactory f=new RandomFactory();

f.factoryWork();

}

代码二：建立数据表并插入数据DBDataLoader.java

import java.util.concurrent.*;

import wqm.dboperation.DBConn;

/**

* 数据库数据装载器

* 运用多线程，动态建表，建表命名规则，插入“索引”

* */

public class DBDataLoader {

final int recordPerTable=1000000;//定义一个表的最大记录数,程序设定需是10的倍数

private int tableCount=0;//表的总数，初始为0

private ExecutorService exec=Executors.newCachedThreadPool();//建立线程池，里面有20个线程

private long time;

DBDataLoader(){

doShutDownWork(Runtime.getRuntime());

time=System.currentTimeMillis();

}

private synchronized void addTableCount(){

tableCount=tableCount+1;

}

//开始启动工作线程进行灌入数据

private void load(){

for(int i=0;i<50;i++){

exec.execute(new Loader());

}

exec.shutdown();//当所有任务执行完成后关闭线程池

}

class Loader implements Runnable{

private DBConn con=new DBConn();

private void runLoader(){

// TODO Auto-generated method stub

addTableCount();

String table=”table”+tableCount;

String sql=”create table “+table+”(ind
int,unit
char(20));”;

try{

con.insert(sql);

System.out.println(“建表完成：”+table);

//插入数据

RandomFactory factory=new RandomFactory();

String str[]=new String[20000];//这里

String charSet=factory.factoryWork();//得到字符数组

int index=factory.getCharatorSetIndex();//得到字符数组对应的整数索引

str[0]=”insert into “+table+” values (‘”+index+”‘,'”+charSet+”‘)”;

//尽量使这个for循环里的判断最少，运算最少，所以优化之后添加了前面三句和for循环起始点

for(int i=1;i<=recordPerTable;i++){

charSet=factory.factoryWork();//得到字符数组

index=factory.getCharatorSetIndex();//得到字符数组对应的整数索引

str[i%20000]=”insert into “+table+” values (‘”+index+”‘,'”+charSet+”‘)”;

if((i%20000==0)){

con.executeSQLBatch(str);

}

}catch(Exception e){

e.printStackTrace();

System.out.println(Thread.currentThread()+” “+tableCount);

}

public void run() {

try{

con.connectDB();

for(int i=0;i<20;i++){

runLoader();

}

}catch(Exception e){

e.printStackTrace();

}

finally{

con.closeDB();

}

//关闭jvm的时候做的工作

public void doShutDownWork(Runtime run){

run.addShutdownHook(new Thread(){

@Override

public void run() {

System.out.println(“生成数据完毕！”);

System.out.println(“程序运行的时间为：”+(System.currentTimeMillis()-time));

System.out.println(“数据库中共有：”+(recordPerTable*tableCount)+” 条记录！”);

}

});

}

public static void main(String[] args){

DBDataLoader loader=new DBDataLoader();

loader.load();

}

程序三：建立data库中数据表的索引

import java.io.IOException;

import java.sql.ResultSet;

import java.sql.SQLException;

import java.util.Stack;

import java.util.concurrent.ExecutorService;

import java.util.concurrent.Executors;

import wqm.dboperation.*;

/**

* 将剩下的工作做完

* 包括建立管理表，建立索引

* */

public class IndexCreator {

private Stack<String> list=new Stack<String>();//存储管理表中的所有数据

private int error=5;

private long time;

IndexCreator(){

doShutDownWork(Runtime.getRuntime());

time=System.currentTimeMillis();

DBConn con=new DBConn();

while(true){

try{

con.connectOtherDB();

String sql=”select * from manage1″;//从管理表获得全部记录表的名称

ResultSet rs=con.select(sql);

while(rs.next()){

list.add(rs.getString(“tableName”));//存入list中

}

ExecutorService exec=Executors.newCachedThreadPool();//多线程执行查找

for(int i=0;i<100;i++){//开启20个线程

exec.execute(new Creator());

System.out.println(“开启了第 “+i+” 个线程 –“+Thread.currentThread());

}

exec.shutdown();

break;

}catch(Exception e){

e.printStackTrace();

if(error<0){

break;

}

System.out.println(“无法取得管理表记录，重试…”+(5-error)+” 次”);

error-=1;

}finally{

con.closeDB();

}

private synchronized String getFromList(){

return list.pop();

}

//create index for each table

class Creator implements Runnable{

public void run(){

boolean unfinish=true;

DBConn con=new DBConn();

while(unfinish){

try{

con.connectDB();

while(true){

String tableName=null;

try{

tableName=getFromList();

System.out.println(“creating index for “+tableName);

}catch(Exception e){

e.printStackTrace();

unfinish=false;

break;

}

String sql=”create index ind_”+tableName+” on “+tableName+” (ind);”;

con.insert(sql);

System.out.println(“表 “+tableName+” has finished！”);

}

}catch(Exception e){

e.printStackTrace();

}finally{

con.closeDB();

}

//关闭jvm的时候做的工作

public void doShutDownWork(Runtime run){

run.addShutdownHook(new Thread(){

@Override

public void run() {

//程序结束时进行的操作 ,将结果数据保存到一个txt文件中

System.out.println(“程序完成用时 “+(System.currentTimeMillis()-time));

}

});

}

public static void main(String[] args){

IndexCreator f=new IndexCreator();

}

程序四：随机生成一个字符串，并以此字符串来检索数据库中匹配的信息，找到则记录下来StringFounder.java

import wqm.dboperation.DBConn;

import java.io.BufferedWriter;

import java.io.File;

import java.io.FileWriter;

import java.io.IOException;

import java.sql.*;

import java.util.Stack;

import java.util.concurrent.ExecutorService;

import java.util.concurrent.Executors;

/**

* 数组查找器

* 在20亿条记录中查找随机产生的一条字符串

* */

public class StringFounder {

private Stack<String> list=new Stack<String>();//存储管理表中的所有数据

private RandomFactory random=new RandomFactory();

private String string;//随机产生的一条字符串

private int index;//index of the random string produced above

private
int error=5;//错误的次数，重试的次数，限定为5次

private long time;//记录程序运行的时间

private Infor infor=new Infor();//存储查找结果的bean

StringFounder(){

doShutDownWork(Runtime.getRuntime());

string=random.factoryWork();

index=random.getCharatorSetIndex();

time=System.currentTimeMillis();

}

private void executeFounder(){

DBConn con=new DBConn();

while(true){

try{

con.connectOtherDB();

String sql=”select * from manage”;//从管理表获得全部记录表的名称

ResultSet rs=con.select(sql);

while(rs.next()){

list.add(rs.getString(“name”));//存入list中

}

ExecutorService exec=Executors.newCachedThreadPool();//多线程执行查找

for(int i=0;i<2;i++){//开启20个线程

exec.execute(new Founder());

}

exec.shutdown();

System.out.println(“shudown”);

break;

}catch(Exception e){

e.printStackTrace();

if(error<0){

break;

}

//
System.out.println(“无法取得管理表记录，重试…”+(5-error)+” 次”);

System.out.println(“cant reach the mangeTable Rescord,retring…”+(5-error)+”times”);

error-=1;

}finally{

con.closeDB();

}

private synchronized String getFromList(){

return list.pop();

}

class Founder implements Runnable{

public void run(){

DBConn con=new DBConn();//连接数据库，每个线程对应一个连接

try{

con.connectDB();

while(true){

try{

int count=0;//在这个表中共有匹配的数量

String tableName=getFromList();

String sql=”select unit from “+tableName+” where ind='”+index+”‘;”;//根据”索引”过滤大部分

System.out.println(Thread.currentThread()+” “+sql+” “+string);

ResultSet rs=con.select(sql);

while(rs.next()){

if(rs.getString(“unit”).equals(string)){//找到了

System.out.println(“找到了记录”);

count+=1;

}

if(count!=0){//如果找到了记录，则将结果存入bean中

infor.setResult(tableName,count);

}

}catch(Exception out){//报这个异常说明list被取空了，关闭数据库连接，跳出循环

con.closeDB();

break;

}

}catch(Exception e){

e.printStackTrace();

}finally{

con.closeDB();

}

//关闭jvm的时候做的工作

public void doShutDownWork(Runtime run){

run.addShutdownHook(new Thread(){

@Override

public void run() {

//程序结束时进行的操作 ,将结果数据保存到一个txt文件中

for(int i=0;i<5;i++){//如果失败则重试，共5次

try{

fileResult();

break;

}catch(Exception e){

//
System.out.println(“生成结果文件失败，正在重试…”);

System.out.println(“cant give the result-file retring…”);

}

});

}

private void fileResult() throws IOException{

String result=infor.getResult();//得到结果

File file=new File(“D://result.txt”);

BufferedWriter output = new BufferedWriter(new FileWriter(file));

long a=(System.currentTimeMillis()-time)/1000L;

long b=a/60L;

int hour=(int)b/60;

int second=(int)(a%60L);

int min=(int)(b%60L);

//
result=result+”\r\n全部查找完毕，共用时间为：”+min+”分”+second+”秒\r\n” +

//
“一共有 “+infor.getCountTable()+” 表含有匹配数据\r\n” +

//
“总共找到 “+infor.getCountField()+” 条数据匹配\r\n”;

result=result+”\r\nAll Founder was Finish，Time-Consuming ：”+min+”min”+second+”seconds\r\n” +

“There are all “+infor.getCountTable()+” tables has the results\r\n” +

“And all “+infor.getCountField()+” pieces of data match\r\n”;

output.write(result);

output.close();

}

public static void main(String[] args){

StringFounder f=new StringFounder();

f.executeFounder();

}