lucene – 像这样的模糊和更像这样的差异？

2023年2月5日 284次阅读

Lucene的
MoreLikeThis(mlt)有什么区别

我正在通过Elasticsearch(ES)评估两种查询类型,我发现它们在概念上非常相似：

> mlt：将现有文档字段与其他文档字段对比
> flt：将字符串与其他文档的字段进行比较

但是,flt性能似乎比mlt查询慢一个数量级.

我正在使用最新的ES,后者又使用Lucene 4.5.

Fuzzifies ALL terms provided as strings and then picks the best n differentiating terms. In effect this mixes the behaviour of FuzzyQuery and MoreLikeThis but with special consideration of fuzzy scoring factors. This generally produces good results for queries where users may provide details in a number of fields and have no knowledge of boolean query syntax and also want a degree of fuzzy matching and a fast query.
For each source term the fuzzy variants are held in a BooleanQuery with no coord factor (because we are not looking for matches on multiple variants in any one doc). Additionally, a specialized TermQuery is used for variants and does not use that variant term’s IDF because this would favor rarer terms, such as misspellings. Instead, all variants use the same IDF ranking (the one for the source query term) and this is factored into the variant’s boost. If the source query term does not exist in the index the average IDF of the variants is used.

最佳答案您正在将
more like this query与
fuzzy like this query进行比较.虽然后者在“更像这个”查询中添加了一些模糊性,但它与
fuzzy query不同,后者在下面使用.

“更像这个”允许您指定like_text和字段列表.因此,将返回在指定字段中包含该文本的文档.您可以调整术语的频率,以控制何时返回或忽略文档,以便根据您的要求获取相似且有趣的文档.

“像这样的模糊”具有类似的结构,实际上更像是这个查询,它也在内部使用模糊查询来查找类似的文档.这意味着返回的文档不仅包含您在like_text中请求的术语,还包含类似的术语,对它们应用一些模糊性.其速度较慢的原因实际上是模糊查询,它更昂贵,尽管它使用Lucene 4.x进行了大量改进.