完成拼写搜检器(spell check)

2024年3月19日 189次阅读来源: Bowen

本文同时发在我的github博客上，迎接star

在百度或许Google搜刮的时刻，有时会小手一抖，打错了一般字母，比方我们想搜刮apple，错打成了appel，但奇异的是，纵然我们敲下回车，搜刮引擎也会自动搜刮apple而不是appel，这是怎样完成的呢？本文就将重新完成一个JavaScript版的拼写搜检器

基础理论

起首，我们要肯定怎样量化敲错单词的几率，我们将底本想打出的单词设为origin(O)，错打的单词设为error(E)

由贝恭弘=叶恭弘斯定理我们可知：P(O|E)=P(O)*P(E|O)/P(E)

P(O|E)是我们须要的效果，也就是在打出毛病单词E的状况下，底本想打的单词是O的几率

P(O)我们能够看做是O涌现的几率，是先验几率，这个我们能够从大批的语料环境中猎取

P(E|O)是底本想打单词O却打成了E的几率，这个能够用最短编辑间隔模仿几率，比方底本想打的单词是apple，打成applee(最短编辑间隔为1)的几率比appleee(最短编辑间隔为2)天然要大

P(E)因为我们已知E，这个观点是牢固的，而我们须要对照的是P(O1|E)、P(O2|E)…P(On|E)的几率，不须要准确的盘算值，我们能够不必管它

详细完成

这部份的完成我参考了natural的代码，传送门

起首是组织函数：

function SpellCheck(priorList) {
    //to do trie
    this.priorList = priorList;
    this.priorHash = {};
    priorList.forEach(item => {
        !this.priorHash[item] && (this.priorHash[item] = 0);
        this.priorHash[item]++;
    });
}

priorList是语料库，在组织函数中我们对priorList中的单词进行了涌现次数的统计，这也就能够被我们看做是先验几率P(O)

接下来是check函数，用来检测这个单词是不是在语料库中涌现

SpellCheck.prototype.check = function(word) {
    return this.priorList.indexOf(word) !== -1;
};

然后我们须要猎取单词指定编辑间隔内的一切可能性：

SpellCheck.prototype.getWordsByMaxDistance = function(wordList, maxDistance) {
    if (maxDistance === 0) {
        return wordList;
    }
    const listLength = wordList.length;
    wordList[listLength] = [];
    wordList[listLength - 1].forEach(item => {
        wordList[listLength].push(...this.getWordsByOneDistance(item));
    });
    return this.getWordsByMaxDistance(wordList, maxDistance - 1);
};
SpellCheck.prototype.getWordsByOneDistance = function(word) {
    const alphabet = "abcdefghijklmnopqrstuvwxyz";
    let result = [];
    for (let i = 0; i < word.length + 1; i++) {
        for (let j = 0; j < alphabet.length; j++) {
            //插进去
            result.push(
                word.slice(0, i) + alphabet[j] + word.slice(i, word.length)
            );
            //替代
            if (i > 0) {
                result.push(
                    word.slice(0, i - 1) +
                        alphabet[j] +
                        word.slice(i, word.length)
                );
            }
        }
        if (i > 0) {
            //删除
            result.push(word.slice(0, i - 1) + word.slice(i, word.length));
            //前后替代
            if (i < word.length) {
                result.push(
                    word.slice(0, i - 1) +
                        word[i] +
                        word[i - 1] +
                        word.slice(i + 1, word.length)
                );
            }
        }
    }
    return result.filter((item, index) => {
        return index === result.indexOf(item);
    });
};

wordList是一个数组，它的第一项是只要原始单词的数组，第二项是寄存间隔原始单词编辑间隔为1的单词数组，以此类推，直到抵达了指定的最大编辑间隔maxDistance

以下四种状况被视为编辑间隔为1:

插进去一项，比方ab->abc
替代一项，比方ab->ac
删除一项，比方ab->a
前后替代，比方ab->ba

猎取了一切在指定编辑间隔的单词候全集，再比较它们的先验几率：

SpellCheck.prototype.getCorrections = function(word, maxDistance = 1) {
    const candidate = this.getWordsByMaxDistance([[word]], maxDistance);
    let result = [];
    candidate
        .map(candidateList => {
            return candidateList
                .filter(item => this.check(item))
                .map(item => {
                    return [item, this.priorHash[item]];
                })
                .sort((item1, item2) => item2[1] - item1[1])
                .map(item => item[0]);
        })
        .forEach(item => {
            result.push(...item);
        });
    return result.filter((item, index) => {
        return index === result.indexOf(item);
    });
};

末了获得的就是修正后的单词

我们来测试一下：

const spellCheck = new SpellCheck([
    "apple",
    "apples",
    "pear",
    "grape",
    "banana"
]);
spellCheck.getCorrectionsByCalcDistance("appel", 1); //[ 'apple' ]
spellCheck.getCorrectionsByCalcDistance("appel", 2); //[ 'apple', 'apples' ]

能够看到，在第一次测试的时刻，我们指定了最大编辑间隔为1，输入了毛病的单词appel，末了返回修正项apple；而在第二次测试时，将最大编辑间隔设为2，则返回了两个修正项

语料库较少的状况

上面的完成要领是先猎取了单词一切指定编辑间隔内的候选项，而在语料库单词较少的状况下，这类要领比较消耗时候，我们能够改成先猎取语料库中相符指定最短编辑间隔的单词

盘算最短编辑间隔是一种比较典范的动态计划(leetcode:72)，dp即可。这里的盘算最短编辑间隔与leetcode的状况略有不同，须要多斟酌一层邻近字母摆布替代的状况

leetcode状况下的状况转换方程：

dp[i][j]=0 i===0,j===0
dp[i][j]=j i===0,j>0
dp[i][j]=i j===0,i>0
min(dp[i-1][j-1]+cost,dp[i-1][j]+1,dp[i][j-1]+1) i,j>0

个中当word1[i-1]===word2[j-1]时，cost为0，否则为1

斟酌邻近字母摆布替代的状况，则须要在i>1,j>1且word1[i - 2] === word2[j - 1]&&word1[i - 1] === word2[j - 2]为true的条件下，再作min(dp[i-1][j-1]+cost,dp[i-1][j]+1,dp[i][j-1]+1,dp[i-2][j-2]+1)

拿到语料库中相符指定最短编辑间隔的单词在对先验几率作比较，代码以下：

SpellCheck.prototype.getCorrectionsByCalcDistance = function(
    word,
    maxDistance = 1
) {
    const candidate = [];
    for (let key in this.priorHash) {
        this.calcDistance(key, word) <= maxDistance && candidate.push(key);
    }
    return candidate
        .map(item => {
            return [item, this.priorHash[item]];
        })
        .sort((item1, item2) => item2[1] - item1[1])
        .map(item => item[0]);
};
SpellCheck.prototype.calcDistance = function(word1, word2) {
    const length1 = word1.length;
    const length2 = word2.length;
    let dp = [];
    for (let i = 0; i <= length1; i++) {
        dp[i] = [];
        for (let j = 0; j <= length2; j++) {
            if (i === 0) {
                dp[i][j] = j;
                continue;
            }
            if (j === 0) {
                dp[i][j] = i;
                continue;
            }
            const replaceCost =
                dp[i - 1][j - 1] + (word1[i - 1] === word2[j - 1] ? 0 : 1);
            let transposeCost = Infinity;
            if (
                i > 1 &&
                j > 1 &&
                word1[i - 2] === word2[j - 1] &&
                word1[i - 1] === word2[j - 2]
            ) {
                transposeCost = dp[i - 2][i - 2] + 1;
            }
            dp[i][j] = Math.min(
                replaceCost,
                transposeCost,
                dp[i - 1][j] + 1,
                dp[i][j - 1] + 1
            );
        }
    }
    return dp[length1][length2];
};

末了

这份代码另有许多能够优化的处所，比方check函数运用的是indexOf推断单词是不是在语料库中涌现，我们能够改用单词查找树(Trie)或许hash的体式格局加快查询

    原文作者：Bowen
    原文地址: https://segmentfault.com/a/1190000018357143
    本文转自网络文章，转载此文章仅为分享知识，如有侵权，请联系博主进行删除。