Announcement and Request: Typesafe Coordinate Systems for High-Throughput Sequencing Applications

Arne Ludwig arne.ludwig at posteo.de
Wed Sep 1 09:01:27 UTC 2021


On Wednesday, 1 September 2021 at 05:36:53 UTC, James Blachly 
wrote:
> In another post, I've just announced our D-based high 
> throughput sequencing library, dhtslib.
>
> One feature that is, AFAIK, novel in the field is leveraging 
> the compiler's type system to enforce correctness regarding 
> different genome/reference sequence coordinate systems. 
> Clearly, the encoding of domain specific knowledge in a 
> language's type system is nothing new, but it is surprising 
> that this has not been done before in bioinformatics, and it is 
> an idea that IMO is long overdue given the trainwreck of 
> different coordinate systems in our field.
>
> You can find dhtslib's develop branch, with Typesafe 
> Coordinates merged and ready to use, here:
>
> https://github.com/blachlylab/dhtslib/
>
>
> **Now the request:**
> We've drafted a manuscript describing Typesafe Coordinates as a 
> sort of low-key endorsement of the D language and our library 
> package `dhtslib`. You can find the manuscript here:
>
> https://github.com/blachlylab/typesafe-coordinates/
>
> We would be very grateful to those of you who would take the 
> time to read the manuscript and post comments (publicly or 
> privately), _especially if we have made any incorrect 
> statements_ or our language regarding type systems is awkward 
> or nonstandard.
>
> We did praise D, and gently criticized Rust and OCaml* somewhat 
> as it appeared to me that they lacked the features required to 
> implement Typesafe Coordinate Systems in as ergonomic a way as 
> we could in D. However, being a true novice at both of these 
> other languages there is the possibility that I've missed 
> something significant, and that the Rust and OCaml 
> implementations could be retooled to match the D 
> implementation. I'd still be glad to hear it if that's the case.
>
> I plan to make a few minor cleanups and submit this to a 
> preprint server as well as a scientific journal in the next 
> week or so.
>
> Kind regards
>
> James S Blachly, MD
> The Ohio State University
>
>
> * as a side note, I actually find the OCaml code quite 
> attractive in its terseness: `let j = cl_interval_of_ho 
> (ob_interval_of_zb i)`

Hi James and Charles,

I am happy to hear of your latest idea of creating type-safe 
coordinate systems. It's a great idea!

After reading the code on GitHub, I have only one major remark: 
IMHO, it would be great to separate the novel coordinates systems 
from any `htslib` dependencies ([see lines 
47-50](https://github.com/blachlylab/dhtslib/blob/e3b5af14e9eefa54bcc27bc0fcc9066dc3a4ea54/source/dhtslib/coordinates.d#L47-L50)) as there are only auxiliary functions that use both the novel coordinates systems and `htslib`. The greater goal I have in mind is to provide the coordinate systems in a separate DUB sub-package (e.g. `dhtslib:coordinates`) that requires only a D compiler. That makes integration into existing projects that do not need `htslib` much easier.

Also, I have a short list of minor, technical remarks:

1. The returned type in [line 
114](https://github.com/blachlylab/dhtslib/blob/e3b5af14e9eefa54bcc27bc0fcc9066dc3a4ea54/source/dhtslib/coordinates.d#L114) has a typo, there is an additional 's'.
2. The array of identifiers `CoordSystemLabels` in [line 
203](https://github.com/blachlylab/dhtslib/blob/e3b5af14e9eefa54bcc27bc0fcc9066dc3a4ea54/source/dhtslib/coordinates.d#L203) is a bit unsafe and not strictly required for two reasons:
     1. It can by generated by the compiler using `enum 
CoordSystemLabels = __traits(allMembers, CoordSystem);`.
     2. As far as I can tell its only application is in [line 
376](https://github.com/blachlylab/dhtslib/blob/e3b5af14e9eefa54bcc27bc0fcc9066dc3a4ea54/source/dhtslib/coordinates.d#L376). The same result can be achieved safely using `cs.stringof.split('.')[$ - 1]` or without use of `std.array.split`: `cs.stringof[CoordSystem.stringof.length + 1 .. $]`.
3. The function `unionImpl` in [line 
326](https://github.com/blachlylab/dhtslib/blob/e3b5af14e9eefa54bcc27bc0fcc9066dc3a4ea54/source/dhtslib/coordinates.d#L326) actually computes the convex hull of the two intervals which should be noted in the doc comment for completeness' sake.
4. I have noted that you use operator overloading for union and 
intersection of `Interval`s. You may also add overloads for the 
`offset` function in both `Interval` and `Coordinate` with `auto 
opBinary(string op, T)(T off) if ((op == '+' || op == '-') && 
isIntegral!T)` and `auto opBinaryRight(string op, T)(T off) if 
((op == '+' || op == '-') && isIntegral!T)`.

I enjoyed reading the manuscript. It highlights the issue clearly 
and presents the solution without getting lost in details. 
Ignoring typos at this stage, I have no remarks on it – keep 
going!


Cheers!

-- Arne


More information about the Digitalmars-d-announce mailing list