5

While parsing an XML file Stax produces an error:

Unicode(0xb) error-An invalid XML character (Unicode: 0xb) was found in the element content of the document.

Just click on the link below with the xml line with special character as "VI". It's not an alphabetical character: when you try to copy and paste it in Notepad, you will get it as some symbol. I have tried parsing it using Stax. It was showing the above-mentioned error.

enter image description here

Please can somebody give me a solution for this?

Thanks in advance.

CC BY-SA 3.0

3 Answers 3

8

0xB (vertical tab) is not a valid character in XML. The only valid characters before ASCII 32 (0x20, space) are 0x9 (tab), 0xA (carriage return) and 0xD (line feed).

In short, what you are trying to parse is NOT XML.

CC BY-SA 3.0
2
  • sorry for late and thanks for the reply..problem is i dont have control on the generation of XML file..its been generated from an application,from where however i have to parse it..so try to give me a proper solution Jan 15, 2013 at 2:39
  • The "proper solution" is to go to the people who write/supply the software and get them to fix it. They're not generating XML. They're generating some stuff with lots of '<' and '>' in that looks quite a lot like XML, but isn't. If this isn't an option, you may be able to get away with filtering the data before giving it to an XML parser, but I'd strongly advise that you try and get the problem fixed at source.
    – dty
    Jan 15, 2013 at 10:23
4

Whenever invalid xml character comes xml, it gives such error. When u open it in notepad++ it look like VT, SOH,FF like these are invalid xml chars. I m using xml version 1.0 and i validate text data before entering it in database by pattern

Pattern p = Pattern.compile("[^\u0009\u000A\u000D\u0020-\uD7FF\uE000-\uFFFD\u10000-\u10FFF]+");
retunContent = p.matcher(retunContent).replaceAll("");

It will ensure that no invalid special char will enter in xml

CC BY-SA 3.0
3

According to the XML W3C Recommendation 0xb is not allowed in an XML file:

Character Range [2] Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF] /* any Unicode character, excluding the surrogate blocks, FFFE, and FFFF. */

So strictly speaking your input file is not an XML file.

CC BY-SA 3.0
2
  • sorry for late and thanks for the reply..problem is i dont have control on the generation of XML file..its been generated from an application,from where however i have to parse it..so pls try to give me a proper solution.. Jan 15, 2013 at 2:40
  • 1
    You could try to sanitize the file before you parse it.
    – Henry
    Jan 15, 2013 at 6:08

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service, privacy policy and cookie policy

Not the answer you're looking for? Browse other questions tagged or ask your own question.