Parsing D files with non-unicode characters

Wed Nov 7 15:51:13 UTC 2018

On Wednesday, 7 November 2018 at 05:33:20 UTC, Arun 
Chandrasekaran wrote:
> On Tuesday, 6 November 2018 at 17:21:13 UTC, Jonathan Marler 
> wrote:
>> On Monday, 5 November 2018 at 23:50:46 UTC, Arun 
>> Chandrasekaran wrote:
>>> I'm converting a large amount of header files from C to D 
>>> using DStep and I'm stuck at 
>>> https://github.com/jacob-carlborg/dstep/issues/215
>>>
>>> https://dlang.org/spec/intro.html shows that ASCII and UTF 
>>> char formats are accepted. How do I go about converting a 
>>> large code base like this?
>>>
>>> Is this a bug in D to reject non-unicode chars in comments?
>>>
>>> Arun
>>
>> So you have code that has characters that are neither ascii 
>> nor unicode?  What encoding is it using?  And what characters 
>> does it contain that can't be represented with unicode?
>
> I was not able to find the character encoding. file -i said 
> unknown-8bit. Ultimately https://github.com/BYVoid/uchardet 
> helped me to determine the charset, it was SHIFT-JIS.

I hadn't seen that you provided a link to the file.  After I 
found it, I played with it a bit.  It looks like if you add a 
UTF-8 BOM in the beginning then DMD successfully parses it. 
However, from my quick scan of lexer.d, I didn't see anywhere in 
the code that actually changes how it decodes the file based on 
the the presence of the BOM.  Does anyone know if it does?  Is 
DMD supposed to allow multi-byte UTF-8 characters if there is no 
BOM?  If so, then this is a bug.