bash – 生成二元选择的不同方法

2023年9月20日 226次阅读

我目前正在阅读Advanced
Bash-Scripting Guide,并发现以下内容：

   # Generate binary choice, that is, "true" or "false" value.
   BINARY=2
   T=1
   number=$RANDOM

   let "number %= $BINARY"
   #  Note that    let "number >>= 14"    gives a better random distribution
   #+ (right shifts out everything except last binary digit).
   if [ "$number" -eq $T ]
   then
       echo "TRUE"
   else
       echo "FALSE"
   fi  

   echo

为什么建议采用第15位而不是第1位？一些二元决策的运行显示两者之间没有显着差异.

//更新
既然我被问到如何计算分布,我们就去了.我生成了几个$RANDOM数字,取了每个数字的第15位和第1位并创建了两个二进制序列.之后我循环查看这些序列,检查1和0链(运行),计算出最大长度序列将生成多少个(供参考)并将所有内容打印到一个令人困惑的表中.这是所有它的荣耀中的代码(对于脏代码感到抱歉……)：

#! /bin/bash
COUNT=10000
RUN=1

# generate 2 sequences based on the same $RANDOM numbers
# seq1 = modulo 2, seq2 = bitshift 14
while [ $RUN -le $COUNT ]
do
    number=$RANDOM 
    let 'var1=number%2'
    var2=$number 
    let 'var2 >>= 14'
    seq1="${seq1}${var1}"
    seq2="${seq2}${var2}"
    (( RUN+=1 ))
done

# loop through sequences and check for chains of 1 and 0 (runs)
length=${#seq1}
prevSym=${seq1:0:1}
currRun="${prevSym}"
for (( i=1; i<length; i++ )); do
    currSym=${seq1:$i:1}
    if (( currSym==prevSym )); then
        currRun="${currRun}${currSym}"
        (( i!=length-1 )) && continue
        (( runStat1[${#currRun}]++ ))               #case: ends with run length > 1
        break
    fi
    (( runStat1[${#currRun}]++ ))
    (( prevSym=currSym ))
    (( i==length-1 )) && (( runStat1[1]++ ))             #case: ends with run length = 1
    currRun="${currSym}"
done

length=${#seq2}
prevSym=${seq2:0:1}
currRun="${prevSym}"
for (( i=1; i<length; i++ )); do
    currSym=${seq2:$i:1}
    if (( currSym==prevSym )); then
        currRun="${currRun}${currSym}"
        (( i!=length-1 )) && continue
        (( runStat2[${#currRun}]++ ))               #case: ends with run length > 1
        break
    fi
    (( runStat2[${#currRun}]++ ))
    (( prevSym=currSym ))
    (( i==length-1 )) && (( runStat2[1]++ ))             #case: ends with run length = 1
    currRun="${currSym}"
done

# print results and expected frequency
# number of expected runs with runlength k:
# 1/2**k if k<n, 1/2**(k-1) if k=n  
# $RANDOM generates random numbers in the range 0 to 32768 thus n=15
n=15
echo -e "Length L of run | # of runs with %2 | # of runs with >>14 | # of runs with MLS (calculated)\n "
echo -e "L\t|%2\t|>>14\t|MLS"
echo -e "-----------------------------------\n"
sorted="${!runStat1[*]} ${!runStat2[*]}" 
sorted=$(echo $sorted | tr ' ' '\n' | sort -n | uniq)
for a in $sorted; do
    k=${a}
    (( ${a}==${n} )) && (( k=a-1 ))
    prob=$(awk -v k=${a} -v c=${COUNT} 'BEGIN { print (((1/2)**k)*c)/k}')
    echo -e "${a} \t| ${runStat1[$a]} \t| ${runStat2[$a]} \t| ${prob} "
done

运行它会打印出这些内容：

Length L of run | # of runs with %2 | # of runs with >>14 | # of runs with MLS (calculated)
L   |%2 |>>14   |MLS
-----------------------------------

1   | 2495  | 2450  | 5000 
2   | 1219  | 1212  | 1250 
3   | 638   | 621   | 416.667 
4   | 300   | 329   | 156.25 
5   | 162   | 166   | 62.5 
6   | 75    | 81    | 26.0417 
7   | 46    | 34    | 11.1607 
8   | 23    | 26    | 4.88281 
9   | 13    | 7     | 2.17014 
10  | 2     | 6     | 0.976562 
11  | 1     | 1     | 0.443892 
13  | 3     |   | 0.0939002 
15  |   | 2     | 0.0203451 
21  |   | 1     | 0.000227065

这让我得出的结论是,毫无疑问并且在所有bash引用中也提到过,$RANDOM是随机性的可怕来源……但“number>> = 14”也没有比“number”更好的随机分布对于二元选择,％= 2“.

……或者在这个愚蠢的计算中,我犯了一个大错误.你告诉我.

最佳答案使用高阶位的建议是因为许多随机数发生器被实现为
linear congruential generators,这在低阶位中产生差的随机性.

例如,以下RNG实施过去非常常见. (我相信它是作为C89标准中的一个例子.)

unsigned old_rand() {
  next = next * 1103515245 + 12345;
  return next;
}

现在看看这会产生什么样的数字.

2140733074   // even
3902869603   // odd
4012135520   // even
2255314201   // odd
3913576926   // even
2626310079   // odd
4159329932   // even
1903014357   // odd

位1根本不是随机的.

即使是像Java中使用的更高质量的LCG,也会受到这种影响,正如nice graphical demonstration所示.所以不要相信未知RNG的低阶位.