Project2--配置Lucene, 对ccer数据建立索引和查询系统

本文主要是介绍Project2--配置Lucene, 对ccer数据建立索引和查询系统，希望对大家解决编程问题提供一定的参考价值，需要的开发者们随着小编来一起学习吧！

Step 1 读取文件夹下的所有文件

public static String getFiles(File f) { if(f.isDirectory()) { File[]fs=f.listFiles(); for(int i=0;i<fs.length;i++) { getFiles(fs[i]); } } else { String name=f.getName(); if(name.endsWith("shtml")||name.endsWith("html")||name.endsWith("htm") ||name.endsWith("asp")||name.endsWith("php")||name.endsWith("aspx")) { sb.append(f.getPath()); sb.append("/"); } } String s=sb.toString(); return s; }

使用递归的方法读取所有文件名，将其保存在一个String中，以“/”将文件名隔开，之后便可以使用使用

String []list=s.split("/");得到文件名列表

由于是对文本建立索引，所以目前只需要网页文件，在遍历过程中对所有的文件做了一下判断，只取shtml、html、asp、php、htm等后缀的文件。

Step 2 建立索引

通过CCER抓取到的数据存放的位置建立一个File，然后为其下面的所有网页文件建立索引

writer = new IndexWriter(FSDirectory.open(new File(indexPath)),
analyzer, true, IndexWriter.MaxFieldLength.LIMITED);

分别为网页文件名、标题和正文建立了索引

Field field=new Field("name", list[i], Store.YES, Field.Index.NOT_ANALYZED); doc.add(field); String title=pp.getTitle(); field=new Field("title",title,Store.YES,Field.Index.ANALYZED); doc.add(field); String content=pp.getBody(); field=new Field("content",content,Store.YES,Field.Index.ANALYZED); doc.add(field);

Field field=new Field("name", list[i], Store.YES, Field.Index.NOT_ANALYZED); doc.add(field); String title=pp.getTitle field=new Field("title", title, Store.YES, Field.Index.ANALYZED); doc.add(field); String content=pp.getBody field=new Field("content",content,Store.YES,Field.Index.ANALYZED); doc.add(field);

其中对于网页正文内容的读取使用HTMLParser和正则表达式。利用HTMLParser读取网页中title和body节点，然后利用正则表达式去掉body中诸如div或者script的节点，得到正文。

下图为索引建立过程：

CreateIndex