使用-jsonArray时mongoimport的速度非常慢

2023年10月29日 350次阅读

我有一个超过25百万行的15GB文件,这是json格式(
mongodb接受导入：

[
    {"_id": 1, "value": "\u041c\..."}
    {"_id": 2, "value": "\u041d\..."}
    ...
]

当我尝试使用以下命令在mongodb中导入它时,我的速度只有每秒50行,这对我来说真的很慢.

mongoimport --db wordbase --collection sentences --type json --file C:\Users\Aleksandar\PycharmProjects\NLPSeminarska\my_file.json -jsonArray

当我尝试使用python和pymongo将数据插入集合时,速度更加糟糕.我也尝试提高流程的优先级,但没有任何区别.

我尝试的下一件事是同样的事情,但没有使用-jsonArray,虽然我的速度有了很大的提升(~4000 /秒),但它说所提供的JSON的BSON表示太大了.

我还尝试将文件拆分为5个单独的文件,并将它们从单独的控制台导入到同一个集合中,但我将所有文件的速度降低到大约20个文档/秒.

当我在网上搜索时,我看到人们的速度超过8K文件/秒,我看不出我做错了什么.

有没有办法加快这个速度,或者我应该将整个json文件转换为bson并以这种方式导入它,如果是这样,那么转换和导入的正确方法是什么？

非常感谢.

最佳答案我对160Gb转储文件有完全相同的问题.我花了两天时间用-jsonArray加载3％的原始文件,用这些更改加载15分钟.

首先,删除初始[和尾随]字符：

sed 's/^\[//; s/\]$/' -i filename.json

然后在没有-jsonArray选项的情况下导入：

mongoimport --db "dbname" --collection "collectionname" --file filename.json

如果文件很大,sed将花费很长时间,也许你会遇到存储问题.你可以使用这个C程序(不是我写的,所有的荣耀都归于@guillermobox)：

int main(int argc, char *argv[])
{
    FILE * f;
    const size_t buffersize = 2048;
    size_t length, filesize, position;
    char buffer[buffersize + 1];

    if (argc < 2) {
        fprintf(stderr, "Please provide file to mongofix!\n");
        exit(EXIT_FAILURE);
    };

    f = fopen(argv[1], "r+");

    /* get the full filesize */
    fseek(f, 0, SEEK_END);
    filesize = ftell(f);

    /* Ignore the first character */
    fseek(f, 1, SEEK_SET);

    while (1) {
        /* read chunks of buffersize size */
        length = fread(buffer, 1, buffersize, f);
        position = ftell(f);

        /* write the same chunk, one character before */
        fseek(f, position - length - 1, SEEK_SET);
        fwrite(buffer, 1, length, f);

        /* return to the reading position */
        fseek(f, position, SEEK_SET);

        /* we have finished when not all the buffer is read */
        if (length != buffersize)
            break;
    }

    /* truncate the file, with two less characters */
    ftruncate(fileno(f), filesize - 2);

    fclose(f);

    return 0;
};

P.S.：我没有权力建议移植这个问题,但我认为这可能会有所帮助.