Neo4j索引和遗留数据

2019年8月4日 276次阅读

我有一个遗留数据集(
ENRON data表示为GraphML),我想查询.在
comment的相关问题中,
@StefanArmbruster建议我使用Cypher查询数据库.我的查询用例很简单：给定一个消息id(Message节点的一个属性),检索具有该id的节点,并检索该消息的发送者和接收者节点.

似乎要在Cypher中执行此操作,我首先必须创建节点的索引.从graphML文件加载数据时,有没有办法自动执行此操作？ (我曾使用Gremlin加载数据并创建数据库.)

我还有一个外部Lucene数据索引(我需要它用于其他目的).有两个索引是否有意义？例如,我可以将Neo4J节点ID索引到我的外部索引中,然后根据这些ID查询图形.我担心的是这些ID的持续存在. (通过类比,Lucene文档ID不应被视为持久性.)

那么,我应该：

>在内部索引Neo4j图以使用Cypher查询消息ID？ (如果是这样,最好的方法是什么：用一些合适的咒语重新生成数据库以构建索引？在已经存在的数据库上构建索引？)
>将Neo4j节点ID存储在我的外部Lucene索引中,并通过这些存储的ID检索节点？

UPDATE

我一直试图让自动索引与Gremlin和嵌入式服务器一起工作,但没有运气.在documentation它说

The underlying database is auto-indexed, see Section 14.12, “Automatic Indexing” so the script can return the imported node by index lookup.

但是当我在加载新数据库后检查图形时,似乎没有索引存在.

Neo4j documentation on auto indexing说需要一堆配置.除了设置node_auto_indexing = true之外,还必须对其进行配置

To actually auto index something, you have to set which properties
should get indexed. You do this by listing the property keys to index
on. In the configuration file, use the node_keys_indexable and
relationship_keys_indexable configuration keys. When using embedded
mode, use the GraphDatabaseSettings.node_keys_indexable and
GraphDatabaseSettings.relationship_keys_indexable configuration keys.
In all cases, the value should be a comma separated list of property
keys to index on.

那么Gremlin应该设置GraphDatabaseSettings参数吗？我尝试将地图传递到Neo4jGraph构造函数中,如下所示：

    Map<String,String> config = [
        'node_auto_indexing':'true',
        'node_keys_indexable': 'emailID'
        ]
    Neo4jGraph g = new Neo4jGraph(graphDB, config);
    g.loadGraphML("../databases/data.graphml");

但这对索引创建没有明显影响.

更新2

我没有使用Gremlin配置数据库,而是使用了Neo4j documentation中给出的示例,这样我的数据库创建就像这样(在Groovy中)：

protected Neo4jGraph getGraph(String graphDBname, String databaseName) {
    boolean populateDB = !new File(graphDBName).exists();
    if(populateDB)
        println "creating database";
    else
        println "opening database";

    GraphDatabaseService graphDB = new GraphDatabaseFactory().
        newEmbeddedDatabaseBuilder( graphDBName ).
        setConfig( GraphDatabaseSettings.node_keys_indexable, "emailID" ).
        setConfig( GraphDatabaseSettings.node_auto_indexing, "true" ).
        setConfig( GraphDatabaseSettings.dump_configuration, "true").
        newGraphDatabase();
    Neo4jGraph g = new Neo4jGraph(graphDB);

    if (populateDB) {
        println "Populating graph"
        g.loadGraphML(databaseName);
    }

    return g;
}

我的检索是这样完成的：

ReadableIndex<Node> autoNodeIndex = graph.rawGraph.index()
    .getNodeAutoIndexer()
    .getAutoIndex();
def node = autoNodeIndex.get( "emailID", "<2614099.1075839927264.JavaMail.evans@thyme>" ).getSingle();

这似乎有效.但请注意,Neo4jGraph对象上的getIndices()调用仍然返回一个空列表.所以结果是我可以正确地运用Neo4j API,但Gremlin包装器似乎无法反映索引状态.表达式g.idx(‘node_auto_index’)(记录在Gremlin Methods中)返回null.

最佳答案懒惰地创建自动索引.也就是说 – 当您启用自动索引时,首先会在索引第一个属性时创建实际索引.确保在检查索引是否存在之前插入数据,否则可能不会显示.

对于某些自动索引代码(使用编程配置),请参阅例如https://github.com/neo4j-contrib/rabbithole/blob/master/src/test/java/org/neo4j/community/console/IndexTest.java(这与Neo4j 1.8一起使用

/彼得