bash – 如何使用sed / awk解析文件的内容？

2023年3月12日 210次阅读

我的输入文件的内容采用以下格式,其中每列用“空格”分隔

string1<space>string2<space>string3<space>YYYY-mm-dd<space>hh:mm:ss.SSS<space>string4<space>10:1234567890<space>0e:Apple 1.2.3.4<space><space>string5<space>HEX

“0e：Apple 1.2.3.4”后面有2个“空格”,因为此字段/列中没有第14位数字.整个“0e：Apple 1.2.3.4space”被视为该列的单个值.

在第7列中,10：表示以下字符串中的字符数.

在第8列中,0e：表示十六进制值14.因此,HEX值提及后面的字符串中的字符数.

喜欢：

"0e:Apple 1.2.3.4 "--> this is the actual value in 8th column without " "  
    (I've mentioned " " to show that the 14th digit is empty)  

It's counted as  
0e:A p p l e   1 . 2 .   3  . 4    
   | | | | | | | | | |   |  | | |  
   1 2 3 4 5 6 7 8 9 10 11 12 1314

让我们从输入文件中考虑第一行：

string1 string2 string3 yyyy-mm-dd 23:50:45.999 string4 10:1234567890 0e:Apple 1.2.3.4  string5 001e

哪里：

> string1是第1列中的值
> string2是第二列中的值
> string3是第3列中的值
> yyyy-mm-dd排在第4位
> 23：50：50.999,排在第5位
> string3 in 6th
> 10：1234567890在第7位//最后没有空格,因为它有10位数字
> 0e：最后的第8个空间中的Apple 1.2.3.4
> string5 in 9th
>第10期的001e

预期产量：

string1,string2,string3,yyyy-mm dd,23:50:50.999,string3,1234567890,Apple_1.2.3.4,string5,30

要求：

>从第7列和第8列消除计数(10：& 0e 🙂
>空间b / w Apple和1.2.3.4应替换为“_”
>最后一列中的十六进制值应转换为十进制值.
>用“,”替换列之间的“空格”
>我这里仅在第10列使用了十六进制值.如果它在几列中怎么办？有什么方法可以将其转换为特定的列？

我试过用这个：

$cat input.txt |sed 's/[a-z0-9].*://g'

输出为：

string1,string2,string3,yyyy-mm-dd,45.999,string4,1234567890,Apple,1.2.3.4,,string5,001e

最佳答案这将在您的示例输入中执行您想要的操作：

awk -F "[ ]" '{sub(/.*:/, "", $7) sub(/.*:/, "", $8); printf "%s,%s,%s,%s,%s,%s,%s,%s_%s,%s,%s,%d\n", $1, $2, $3, $4, $5, $6, $7, $8, $9, $10, $11, "0x"$12}' input.txt

零件说明：

awk printf允许您指定输出格式,因此您可以手动指定要分隔的字段以及要用_分隔的字段.

-F "[ ]"强制字段分隔符为单个空格,以便它知道两个单独空格之间存在空字段.默认行为是允许多个空格成为单个分隔符,根据问题,这不是您想要的.

sub函数允许您进行正则表达式替换,在这种情况下,删除字段7和8中的..：前缀.

对于字段12,我们告诉printf输出一个数字(%d)并将前缀为0x的字符串作为输入,以便将其解释为十六进制.

注意：如果您不希望输出为$8_ $9,那么您实际上需要解析十六进制前缀并对字符进行计数以确定字段的结束位置.如果是这种情况,我个人更愿意用其他东西来写整篇文章,例如：Python.