.net – 识别并删除将破坏FOR XML的字符

创建
XML时出错

Msg 6841, Level 16, State 1, Line 26 FOR XML could not serialize the
data for node ‘value’ because it contains a character (0x000C) which
is not allowed in XML. To retrieve this data using FOR XML, convert it
to binary, varbinary or image data type and use the BINARY BASE64
directive.

想出了如何使用TSQL解决这个问题

我的问题是如何预防它

这些数据是通过.NET C#加载的
已经做了一些清理工作:
– 删除前导和尾随空格
– 将多个空格连接到单个空格

哪些字符会破坏FOR XML?

如何在.NET C#中识别和删除这些字符?
在输入之前,数据甚至进入SQL.

XML是使用TSQL FOR XML(而不是通过.NET)生成的.

找到这个链接
Valid characters in XML

Unicode code points in the following code point ranges are always
valid in XML 1.1 documents:[2] U+0001–U+D7FF, U+E000–U+FFFD: this
includes most C0 and C1 control characters, but excludes some (not
all) non-characters in the BMP (surrogates, U+FFFE and U+FFFF are
forbidden); U+10000–U+10FFFF: this includes all code points in
supplementary planes, including non-characters.

我不知道如何测试U 0001-U D7FF.

答案不仅仅是问题.
如问题中所述,我已经在执行其他输入过滤.
我只想添加xml.
在实际应用程序中将筛选所有控制字符,因为此用户数据不应具有任何控制字符.
win1252部分是与存储在SQL char(byte)中的数据对齐.

在1.0中允许使用1.0字符集,因为我的FOR XML是允许的.
也只适用于Int16,因为char是Int16 in .NET.

public static string RemoveDiatricsXMLsafe(string unicodeString, bool toLower, bool toWin1252)
{
    // cleary could just create the Regex and validXMLsingle once in the ctor
    unicodeString = Regex.Replace(unicodeString, @"\s{2,}", " ");
    //U+0009, U+000A, U+000D: these are the only C0 controls accepted in XML 1.0;
    //U+0020–U+D7FF, U+E000–U+FFFD    
    Int16[] validXMLsingle = new Int16[4];
    validXMLsingle[0] = Int16.Parse("0020", System.Globalization.NumberStyles.HexNumber);
    validXMLsingle[1] = Int16.Parse("0009", System.Globalization.NumberStyles.HexNumber);
    validXMLsingle[2] = Int16.Parse("000A", System.Globalization.NumberStyles.HexNumber);
    validXMLsingle[3] = Int16.Parse("000D", System.Globalization.NumberStyles.HexNumber);

    unicodeString = unicodeString.Trim();
    Int16 u16;
    StringBuilder sb = new StringBuilder();
    bool validXML = false;
    if (toLower) unicodeString = unicodeString.ToLowerInvariant();
    foreach (char c in unicodeString.Normalize(NormalizationForm.FormD)) // : NormalizationForm.FormKD) breaks 
    {
        switch (CharUnicodeInfo.GetUnicodeCategory(c))
        {
            case UnicodeCategory.NonSpacingMark:
            case UnicodeCategory.SpacingCombiningMark:
            case UnicodeCategory.EnclosingMark:
                //do nothing
                break;
            default:
                u16 = (Int16)c;
                validXML = false; 
                if      (u16 >= validXMLsingle[0]) validXML = true;
                else if (u16 == validXMLsingle[1]) validXML = true;
                else if (u16 == validXMLsingle[2]) validXML = true;
                else if (u16 == validXMLsingle[3]) validXML = true;
                if (validXML) sb.Append(c);
                break;
        }
    }
    if (!toWin1252)
    {
        return sb.ToString();
    }
    else
    {
        Encoding win1252 = Encoding.GetEncoding("Windows-1252");
        Encoding unicode = Encoding.Unicode;

        // Convert the string into a byte array. 
        byte[] unicodeBytes = unicode.GetBytes(sb.ToString());

        // Perform the conversion from one encoding to the other. 
        byte[] win1252Bytes = Encoding.Convert(unicode, win1252, unicodeBytes);

        // Convert the new byte[] into a char[] and then into a string. 
        char[] win1252Chars = new char[win1252.GetCharCount(win1252Bytes, 0, win1252Bytes.Length)];
        win1252.GetChars(win1252Bytes, 0, win1252Bytes.Length, win1252Chars, 0);
        return new string(win1252Chars);
        //string win1252String = new string(win1252Chars);
        //return win1252String;
    }
}

最佳答案 在.Net方面,您应该能够使用正则表达式来查看您是否有一只奇怪的鸟:

var reg = new Regex("[^[\u0001-\ud7ff\ue000-\ufffd)]");
if(reg.IsMatch(...)
{
    // do what you want if you find something you don't want
}
点赞