了解java字符串中包含的文本是否包含UTF-8编码字符的最佳方法

2019年8月3日 236次阅读

有没有其他方法可以知道
java String是否包含UTF-8编码的字符编码,例如阿拉伯语单词.

我尝试了这段代码：但它是否准确并能完成这项工作？

char c = 'أ';
int num = (int) c;

if(num> 128)
// then UTF-8 characters exists

最佳答案 (假设UTF-8 ==非ASCII)

您可以做的是编码然后解码ASCII中的字符串并将其结果与原始字符进行比较.如果它们不相等,则存在非ASCII字符.

但是,你自己的样本也会起作用(几乎应该是> = 128),因为以下证明确实所有的字符都是< 128是ASCII：

To allow backward compatibility, the 128 ASCII and 256 ISO-8859-1 (Latin 1) characters are assigned Unicode/UCS code points that are the same as their codes in the earlier standards.

The first plane (code points U+0000 to U+FFFF) contains the most frequently used characters and is called the Basic Multilingual Plane or BMP. Both UTF-16 and UCS-2 encode valid code points in this range as single 16-bit code units that are numerically equal to the corresponding code points.

(“UTF-16”和“ASCII”,维基百科)

字符是UTF-16“代码单元”.

但是,从整个问题来看,你可能最好先阅读The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!).