We have a huge database where users can create custom fields. Every UTF-8 character is allowed in their name. Until a few weeks ago, when they export their data in XML, only invalid characters that users had in their tables were slash /
and whitespace characters, and we replaced them with underscores.
Now I see that some users who need an export in XML are using in their field names *
, !
... So if their field name instead valid_name
is named for example invalid*name!
, this script will break.
Part of code used for defining tag name:
$doc = new DOMDocument();
$elementName = is_numeric($key) ? (string)$name : (string)$key;
$elementName = str_replace(array('/', ' '), '_', trim($elementName));
$node = $doc->createElement($elementName); // here I get error "invalid character name"
Sample of valid XML:
<?xml version="1.0"?>
<rows total="621" page="1">
<row>
<valid_name>60E49542D19D16EDB633A40</valid_name>
....
I don't need for users to see in their element name !
, *
... I need to know what are characters that aren't allowed to be in element name, And I will replace them probably with an underscore, I am opened also if you have better proposition instead of replacing them with an underscore.
@Quentin suggest the better way. Using dynamic node names mean that you can not define an XSD/Schema, your XML files will be wellformed only. You will not be able to make full use of validators. So a <field name="..."/>
is a better solution from a machine readability and maintenance point of view.
However, NCNames (non-colonized names) allow for quite a lot characters. Here is what I implemented in my library for converting JSON.
$nameStartChar
defines letters and several Unicode ranges. $nameChar
adds some more characters to that definition (like the digits).
The first RegExp removes any character that is NOT a name char. The second removes any starting character that is NOT defined in $nameStartChar
. If the result is empty it will return a default name.
function normalizeString(string $string, string $default = '_'): string {
$nameStartChar =
'A-Z_a-z'.
'\\x{C0}-\\x{D6}\\x{D8}-\\x{F6}\\x{F8}-\\x{2FF}\\x{370}-\\x{37D}'.
'\\x{37F}-\\x{1FFF}\\x{200C}-\\x{200D}\\x{2070}-\\x{218F}'.
'\\x{2C00}-\\x{2FEF}\\x{3001}-\\x{D7FF}\\x{F900}-\\x{FDCF}'.
'\\x{FDF0}-\\x{FFFD}\\x{10000}-\\x{EFFFF}';
$nameChar =
$nameStartChar.
'\\.\\d\\x{B7}\\x{300}-\\x{36F}\\x{203F}-\\x{2040}';
$result = \preg_replace(
[
'([^'.$nameChar.'-]+)u',
'(^[^'.$nameStartChar.']+)u',
],
'',
$string
);
return empty($result) ? $default : $result;
}
An qualified XML node name can consist of two NC names separated by ':'. The first part would be the namespace prefix.
$examples = [
'123foo',
'foo123',
' foo ',
' ',
'foo:bar',
'foo-bar'
];
foreach ($examples as $example) {
var_dump(normalizeString($example));
}
Output:
string(3) "foo"
string(6) "foo123"
string(3) "foo"
string(1) "_"
string(6) "foobar"
string(7) "foo-bar"