Unicode BOM and endianness
Thomas Kuehne
thomas-dloop at kuehne.cn
Fri Aug 4 13:42:05 PDT 2006
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Hasan Aljudy schrieb am 2006-08-04:
>
>
> Derek Parnell wrote:
>> On Fri, 04 Aug 2006 00:36:21 -0300, Tim Locke wrote:
>>
>>
>>>How do I acquire and determine the BOM and endianness of a file I am
>>>reading?
>>>
>>>Thanks
>>
>>
>> You might check out http://en.wikipedia.org/wiki/Byte_Order_Mark
>>
>
> Are GNU tools really as ignorant of Unicode as that page implies?
>
> [quote]
> While UTF-8 does not have byte order issues, a BOM encoded in UTF-8 may
> be used to mark text as UTF-8. Quite a lot of Windows software
> (including Windows Notepad) adds one to UTF-8 files. However in
> Unix-like systems (which make heavy use of text files for configuration)
> this practice is not recommended, as it will interfere with correct
> processing of important codes such as the hash-bang at the start of an
> interpreted script.
Let's have 2 UTF-8 files with BOMs: A and B
cat A B > C
A's BOM will remain a BOM but B's BOM is going to be interpreted as
"zero-width no-break space". Thus using BOMs in combination with streaming,
concating etc. will allways cause problems. In contrast to Windows, Linux
- - home to the GNU tools - treats "text" and "binary" files as "binary" files.
Thomas
-----BEGIN PGP SIGNATURE-----
iD8DBQFE076MLK5blCcjpWoRAk2+AKCkpgjpZxJLcTOjcfZLWbfyZqnJgQCgjQTk
aVnsQBdsGsq/IehsN4xYAHs=
=FlZk
-----END PGP SIGNATURE-----
More information about the Digitalmars-d-learn
mailing list