Unicode BOM and endianness

Thomas Kuehne thomas-dloop at kuehne.cn
Fri Aug 4 13:42:05 PDT 2006


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hasan Aljudy schrieb am 2006-08-04:
>
>
> Derek Parnell wrote:
>> On Fri, 04 Aug 2006 00:36:21 -0300, Tim Locke wrote:
>> 
>> 
>>>How do I acquire and determine the BOM and endianness of a file I am
>>>reading?
>>>
>>>Thanks
>> 
>> 
>> You might check out http://en.wikipedia.org/wiki/Byte_Order_Mark
>> 
>
> Are GNU tools really as ignorant of Unicode as that page implies?
>
> [quote]
> While UTF-8 does not have byte order issues, a BOM encoded in UTF-8 may 
> be used to mark text as UTF-8. Quite a lot of Windows software 
> (including Windows Notepad) adds one to UTF-8 files. However in 
> Unix-like systems (which make heavy use of text files for configuration) 
> this practice is not recommended, as it will interfere with correct 
> processing of important codes such as the hash-bang at the start of an 
> interpreted script.

Let's have 2 UTF-8 files with BOMs: A and B

cat A B > C

A's BOM will remain a BOM but B's BOM is going to be interpreted as
"zero-width no-break space". Thus using BOMs in combination with streaming,
concating etc. will allways cause problems. In contrast to Windows, Linux
- - home to the GNU tools - treats "text" and "binary" files as "binary" files.

Thomas


-----BEGIN PGP SIGNATURE-----

iD8DBQFE076MLK5blCcjpWoRAk2+AKCkpgjpZxJLcTOjcfZLWbfyZqnJgQCgjQTk
aVnsQBdsGsq/IehsN4xYAHs=
=FlZk
-----END PGP SIGNATURE-----



More information about the Digitalmars-d-learn mailing list