使用Lucene建立索引

2024年1月25日 143次阅读来源: jiang617325814

最近在看Lucene in Action这本书的原著，第一步就是建立要建立文件索引，当然文件必须为统一的格式，Lucene不支持异构文件。
我练习了书中的listing 1.1：Indexer，由于我使用的是Lucene3.5版本，而书中使用的3.0，3.5和3.0是有一些差异的！

package org.apache.lucene.indexer;

import java.io.File;
import java.io.FileReader;
import java.io.IOException;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.index.CorruptIndexException;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.util.Version;

public class Indexer {

	private IndexWriter writer = null;
	
	/**
	 * 完成索引建立
	 * @param args
	 * @throws Exception 
	 */
	public static void main(String[] args) throws Exception {
		if(args.length != 2)
		{
			throw new Exception("Usage: java"+Indexer.class.getName()
					+"<index dir> <data dir>");
		}
		//存放索引的目录
		String indexDir = args[0];
		//需要被索引的文件存放的目录
		String dataDir = args[1];
		//用来测试Lucene建立索引所耗时间
		long start = System.currentTimeMillis();
		Indexer indexer = new Indexer(indexDir);
		int numIndexed = indexer.index(dataDir);
		indexer.close();
		long end = System.currentTimeMillis();
		System.out.println("Indexing "+numIndexed+" use "+(start-end)+"milliseconds");
	}

	/**
	 * 构造函数
	 * @param indexDir
	 * @throws IOException 
	 */
	public Indexer(String indexDir) throws IOException {
		//  Creates an FSDirectory instance, trying to pick the best implementation given the current environment.
		Directory dir = FSDirectory.open(new File(indexDir));
		//3.5版本之后使用这种方式来建立writer，参数：版本号，标准分词器
		IndexWriterConfig iwc = new IndexWriterConfig(Version.LUCENE_35, new StandardAnalyzer(Version.LUCENE_35));
		writer = new IndexWriter(dir, iwc);
	}
	
	/**
	 * 把存放的文件建立为索引
	 * @param dataDir
	 * @return writer.numDocs
	 * @throws IOException 
	 */
	private int index(String dataDir) throws IOException {
		//得到所有文件，listFiles为遍历文件
		File[] files = new File(dataDir).listFiles();
		for(int i=0; i<files.length; i++)
		{
			File f = files[i];
			if(!f.isDirectory() && !f.isHidden()
				&& f.canRead() && acceptFile(f))
			{
				indexFile(f);
			}
		}
		return writer.numDocs();
	}

	/**
	 * 把文档加入Lucene的索引中
	 * @param f
	 * @throws IOException 
	 */
	private void indexFile(File f) throws IOException {
		//getCanonicalPath，更加通用，可以包含特殊字符，跨平台
		//Returns the canonical pathname string of this abstract pathname.
		System.out.println("Indexing"+f.getCanonicalPath());
		Document doc = getDocument(f);
		if(doc != null)
		{
			writer.addDocument(doc);
		}
	}

	/**
	 * 对文档进行索引，此处只对文档的内容和名称域进行索引
	 * @param f
	 * @return doc
	 * @throws IOException 
	 */
	private Document getDocument(File f) throws IOException {
		Document doc = new Document();
		//内容可以不做存储
		doc.add(new Field("contents", new FileReader(f)));
		//文件名存储，建立索引，但是依据情况没有必要对其进行分词
		doc.add(new Field("filename", f.getCanonicalPath(),Field.Store.YES,Field.Index.NOT_ANALYZED));
		return doc;
	}

	/**
	 * 只允许对.txt文件作出处理，把它进行索引
	 * @param f
	 * @return 
	 */
	private boolean acceptFile(File f) {
		return f.getName().endsWith(".txt");
	}

	/**
	 * 当索引建立成功时，要记得关闭writer
	 * @throws IOException 
	 * @throws CorruptIndexException 
	 */
	private void close() throws CorruptIndexException, IOException {
		writer.close();
	}
}

在输入参数（D:\abc\lucene\index03 D:\abc\lucene）之后，运行结果为：

IndexingD:\abc\lucene\abc.txt
IndexingD:\abc\lucene\car.txt
IndexingD:\abc\lucene\hello.txt
Indexing 3 use -1156milliseconds

以上程序中用到的类简要介绍：

IndexWriter

这个类可以建立一个新的索引或者打开一个已经存在的索引，

可以对索引进行增删改，但是不可以搜索和读取

Directory是IndexWriter存放索引的地方

FSDirectory在文件系统中存，RAMDirectory存在内存中（这样会更小、更快捷、应用关闭的时候就会销毁
但是缺点是不能持久化）这种方法适合于需要快速访问索引的时候，包括建立索引和搜索索引。

Analyer

IndexWriter不能索引文件除非它被分割成单个的词。

首先把文件的内容转变为有格式的形式。

它是一个抽象类，Lucene提供了一些实现方法。

分词器通过处理停词（不能作为区分文档的词，如a，the等）

把字符转为小写以便搜索的时候不区分大小写

一个合适的分词器能够对搜索的准确性提供很大帮助

Document

分词器需要一个包含独立字段的文档来索引

文档中有很多字段，这些字段都可以存放到索引中

Lucene只处理text格式的文档，

Feild

每个字段有若干个名称和其对应的值

    原文作者：jiang617325814
    原文地址: https://blog.csdn.net/jiang617325814/article/details/7677151
    本文转自网络文章，转载此文章仅为分享知识，如有侵权，请联系博主进行删除。