neo4j – 查找最常用的不同术语集

2019年7月28日 160次阅读

想象一下由用于描述它们的URL和标签组成的图形数据库.由此我们想要找到最常使用的标签集,并确定哪些URL属于每个标识的集合.

我试图创建一个数据集,在cypher中简化了这个问题：

CREATE (tech:Tag { name: "tech" }), (comp:Tag { name: "computers" }), (programming:Tag { name: "programming" }), (cat:Tag { name: "cats" }), (mice:Tag { name: "mice" }), (u1:Url { name: "http://u1.com" })-[:IS_ABOUT]->(tech), (u1)-[:IS_ABOUT]->(comp), (u1)-[:IS_ABOUT]->(mice), (u2:Url { name: "http://u2.com" })-[:IS_ABOUT]->(mice), (u2)-[:IS_ABOUT]->(cat), (u3:Url { name: "http://u3.com" })-[:IS_ABOUT]->(tech), (u3)-[:IS_ABOUT]->(programming), (u4:Url { name: "http://u4.com" })-[:IS_ABOUT]->(tech), (u4)-[:IS_ABOUT]->(mice), (u4)-[:IS_ABOUT]->(acc:Tag { name: "accessories" })

使用它作为参考(neo4j console example here),我们可以查看它并直观地确定最常用的标签是技术和鼠标(对此的查询是微不足道的)都引用了3个URL.最常用的标签对是[tech,mice],因为它(在这个例子中)是由2个url(u4和u1)共享的唯一对.重要的是要注意,此标记对是匹配的URL的子集,它不是两者的整个集合.任何网址都没有3个标签的组合.

如何编写密码查询以确定哪些标签组合最常用(成对或N大小组)？也许有更好的方法来构建这些数据,这将使分析更容易？或者这个问题不适合图形数据库？一直在努力想出这个问题,任何帮助或想法将不胜感激！

最佳答案它看起来像组合学上的问题.

// The tags for each URL, sorted by ID
MATCH (U:Url)-[:IS_ABOUT]->(T:Tag)
WITH U, T ORDER BY id(T)
WITH U, 
     collect(distinct T) as TAGS 

// Calc the number of combinations of tags for a node,
// independent of the order of tags
// Since the construction of the power in the cyper is not available, 
// use the logarithm and exponent
//
WITH U, TAGS, 
     toInt(floor(exp(log(2) * size(TAGS)))) as numberOfCombinations

// Iterate through all combinations
UNWIND RANGE(0, numberOfCombinations) as combinationIndex
WITH U, TAGS, combinationIndex

// And check for each tag its presence in combination
// Bitwise operations are missing in the cypher,
// therefore, we use APOC 
// https://neo4j-contrib.github.io/neo4j-apoc-procedures/#_bitwise_operations
//
UNWIND RANGE(0, size(TAGS)-1) as tagIndex
WITH U, TAGS, combinationIndex, tagIndex, 
     toInt(ceil(exp(log(2) * tagIndex))) as pw2
     call apoc.bitwise.op(combinationIndex, "&", pw2) YIELD value
WITH U, TAGS, combinationIndex, tagIndex,  
     value WHERE value > 0

// Get all combinations of tags for URL
WITH U, TAGS, combinationIndex, 
     collect(TAGS[tagIndex]) as combination

// Return all the possible combinations of tags, sorted by frequency of use
RETURN combination, count(combination) as freq, collect(U) as urls 
       ORDER BY freq DESC

我认为最好在标记时使用此算法计算和存储标记组合.查询将是这样的：

MATCH (Comb:TagsCombination)<-[:IS_ABOUT]-(U:Url)
WITH Comb, collect(U) as urls, count(U) as freq
MATCH (Comb)-[:CONTAIN]->(T:Tag)
RETURN Comb, collect(T) as Tags, urls, freq ORDER BY freq DESC