Parsing D files with non-unicode characters

Wed Nov 7 18:05:57 UTC 2018

On Wednesday, 7 November 2018 at 17:37:11 UTC, Neia Neutuladh 
wrote:
> On Wed, 07 Nov 2018 15:51:13 +0000, Jonathan Marler wrote:
>> I hadn't seen that you provided a link to the file.  After I 
>> found it, I played with it a bit.  It looks like if you add a 
>> UTF-8 BOM in the beginning then DMD successfully parses it. 
>> However, from my quick scan of lexer.d, I didn't see anywhere 
>> in the code that actually changes how it decodes the file 
>> based on the the presence of the BOM.  Does anyone know if it 
>> does?  Is DMD supposed to allow multi-byte UTF-8 characters if 
>> there is no BOM?  If so, then this is a bug.
>
> It can handle multibyte UTF8 characters without a byte order 
> mark. Should be straightforward to test this:
>
>   echo '/* ſ™🅻 */' > file.d
>   dmd -c file.d
>
> On dmd 2.081.1, the byte order mark changes nothing for me.

Ok, that matches what I saw in the code.  It looks like my editor 
was actually changing the file.  The encoding being used in that 
file is not valid UTF8, so the solution is to re-encode the files 
in utf8.