std.uri.decodeComponent decodes invalid UTF-8
kdevel
kdevel at vogtner.de
Tue Aug 5 03:09:39 UTC 2025
```D
import std;
void main ()
{
writeln (decodeComponent ("%c0%80")); // prints U+0000
writeln (decodeComponent ("%c0%af")); // prints U+002F
(SOLIDUS)
string s = [0xc0, 0xaf];
validate (s); // throws Invalid UTF-8 sequence (at index 2)
}
```
Quote from [1]
```
Implementations of the decoding algorithm above MUST protect
against
decoding invalid sequences. For instance, a naive
implementation may
decode the overlong UTF-8 sequence C0 80 into the character
U+0000,
or the surrogate pair ED A1 8C ED BE B4 into U+233B4. Decoding
invalid sequences may have security consequences or cause other
problems. See Security Considerations (Section 10) below.
```
Quote from [2]
```
Perhaps the most famous UTF-8 attack was against unpatched
Microsoft Internet Information Server (IIS) 4 and IIS 5 servers.
If an attacker made a request that looked like this
http://servername/scripts/..%c0%af../winnt/system32/ cmd.exe
the server didn't correctly handle %c0%af in the URL. What do you
think %c0%af means? It's 11000000 10101111 in binary; and if it's
broken up using the UTF-8 mapping rules, we get this: 11000000
10101111. Therefore, the character is 00000101111, or 0x2F, the
slash (/) character! The %c0%af is an invalid UTF-8
representation of the / character. Such an invalid UTF-8 escape
is often referred to as an overlong sequence.
```
Has the UTF-8 decoding been implemented in multiple places? [3]
[1] *UTF-8, a transformation format of ISO 10646* (2003)
https://www.rfc-editor.org/rfc/rfc3629#page-5
[2] *CAPEC-80: Using UTF-8 Encoding to Bypass Validation Logic*
https://capec.mitre.org/data/definitions/80.html
[3] *Re: Reducing the cost of autodecoding* (2016)
https://forum.dlang.org/post/htxicoxzningxnpeyzui@forum.dlang.org
More information about the Digitalmars-d
mailing list