bash – 如何将一个列表(例如2和3)上的数字与另一个列表(例如5)上的近似值相匹配？

2023年10月30日 276次阅读

我试图将一些音频文件与一些书面文本段落相匹配.

我开始时只有一个人阅读打字段落的音频文件.然后,我在每个静默期间使用sox分割音频文件,并且类似地分割类型文本,使得每个唯一的句子在唯一的行上.

然而,在每个时期都没有完美地发生分裂,但每当说话者暂停时.我需要创建一个列表,其中哪些音频文件对应于哪些类型的句子,例如：

0001.wav This is a sentence.
0002.wav This is another sentence.

请注意,有时2个或更多音频文件对应于单个句子,例如：

> 0001.wav(“这是一个”)0002.wav(“句子”)=“这是一个句子.”

为了帮助匹配文本,我使用软件来计算音频中的音节并计算键入文本中的音节.

这个数据我有两个文件.第一个,“sentences.txt”,是文本中所有句子的列表,每行一个,带有它们的音节数,例如：

5 This is a sentence.
7 This is another sentence.
8 This is yet another sentence.
9 This is still yet another sentence.

我可以使用awk -f“”{print $1} sentences.txt删除句子数据,以获得此syllables_in_text.txt：

第二个文件syllables_in_audio.txt有一个音频文件列表,顺序相同,具有近似的音节计数.有时略低于文本中的实际数字,因为音节计数软件并不完美：

0001.wav 3
0002.wav 2
0003.wav 4
0004.wav 5
0005.wav 7
0006.wav 3
0007.wav 2
0008.wav 3

如何打印音频文件的列表(“output.txt”),使音频文件文件名与“sentences.txt”中的文本句子出现在同一行,例如：

0001.wav 0002.wav
0003.wav 0004.wav
0005.wav
0006.wav 0007.wav 0009.wav

下面是两个文件的表格,用于演示两个文件如何并排排列.文件“0001.wav”和“0002.wav”都需要使句子“这是一个句子”.这些文件名在“output.txt”的第1行列出,而相应的句子在“sentences.txt”的行中以文本形式写入：

Contents of "output.txt":    | Contents of "sentences.txt":
0001.wav 0002.wav            | 5 This is a sentence.
0003.wav 0004.wav            | 7 This is another sentence.
0005.wav                     | 8 This is yet another sentence.
0006.wav 0007.wav 0009.wav   | 9 This is still yet another sentence.

最佳答案您可以按如下方式创建awk脚本.伪代码：

BEGIN { 
        init counter=1
        read your first file (syllables_in_text.txt) with getline till the end (while...)
            store its value in firstfile[counter]
            counter++
        # when you had finished reading your first file
        init another_counter=1
        read your second file (syllables_in_audio.txt) with getline till the end (while...)
            if $2 (second col from your file) <= firstfile[another_counter]
                 store $1 like o[another_counter]=" " $1 
               else
                 another_counter++  
                 store $1 like o[another_counter]=" " $1
        finally loop over the o array after sorint it
            print its contents after removing the leading space
}

但还有其他解决方案……