Dub, Cargo, Go, Gradle, Maven
H. S. Teoh
hsteoh at quickfur.ath.cx
Fri Feb 16 18:16:12 UTC 2018
On Mon, Feb 12, 2018 at 10:35:06AM +0000, Russel Winder via Digitalmars-d wrote:
> In all the discussion of Dub to date, it hasn't been pointed out that
> JVM building merged dependency management and build a long time ago.
> Historically:
>
> Make → Ant → Maven → Gradle
>
> and Gradle can handle C++ as well as JVM language builds.
>
> So the integration of package management and build as seen in Go,
> Cargo, and Dub is not a group of outliers. Could it be then that it is
> the right thing to do. After all package management is a dependency
> management activity and build is a dependency management activity, so
> why separate them, just have a single ADG to describe the whole thing.
I have no problem with using a single ADG/DAG to describe the whole
thing. However, a naïve implementation of this raises a few issues:
If a dependent node requires network access, it forces network access
every time the DAG is updated. This is slow, and also unreliable: the
shape of the DAG could, in theory, change arbitrarily at any time
outside the control of the user. If I'm debugging a program, the very
last thing I want to happen is that the act of building the software
also pulls in new library versions that cause the location of the bug to
shift, thereby ruining any progress I may have made on narrowing down
its locus. It would be nice to locally cache such network-dependent
nodes so that they are only refreshed on demand.
Furthermore, a malicious external entity can introduce arbitrary changes
into the DAG, e.g., hijack an intermediate DNS server so that network
lookups get redirected to a malicious server which then adds
dependencies on malware to your DAG. The next time you update: boom,
your software now contains a trojan horse. (Even better if you have
integrated package dependencies with builds all the way to deployment:
now all your customers have a copy of the trojan deployed on their
machines, too.) To mitigate this, some kind of security model would be
required (e.g., verifiable server certificates, cryptographically signed
package payloads). Which adds to the cost of refreshing network nodes,
and hence is another big reason why this should be done on-demand, NOT
automatically every time you ask for a new software build.
Also, if the machine I'm working on happens to be offline, it would
totally suck to be unable to build my project just because of that.
The whole point of having a DAG is reliable builds, and having the graph
depend on remote resources over an inherently unreliable network defeats
the purpose. That is why caching is basically mandatory, as is control
over when the network is accessed.
And furthermore, one always has to be mindful of the occasional need to
rollback. Generally, source code control is used for the local source
code component -- if you need to revert a change, just checkout an
earlier revision from your repo. But if a network resource that used to
provide library X v1.0 now has moved on to X v2.0, and has dropped all
support for v1.0 so that it is no longer downloadable from the server,
then rollback is no longer possible. You are now unable to reproduce a
build you made 2 years ago. (Which you might need to, if a customer
environment is still running the old version and you need to debug it.)
IOW, the network is inherently unreliable. Some form of local caching /
cache revision control is required.
[...]
> Then, is a DevOps world, there is deployment, which is usually a
> dependency management task. Is a totally new tool doing ADG
> manipulation really needed for this?
My answer is: the ADG/DAG manipulation should be a *library*, a reusable
component that can be integrated into diverse systems that require it.
Multiple systems that implement functionality X is not necessarily a
valid reason to argue for merging said systems into a single monolithic
monster. Rather, what it *does* suggest is to factor out functionality
X so that it can be reused across said systems.
[...]
> Merging ideas from Dub, Gradle, and Reggae, into a project management
> tool for D (with C) projects is relatively straightforward of plan
> albeit really quite a complicated project. Creating the core ADG
> processing is the first requirement. It has to deal with external
> dependencies, project build dependencies, and deployment dependencies.
Your last sentence already shows that such a project is ill-advised,
because while all of them in an abstract sense reduce to nothing but DAG
manipulation, that is not an argument for integrating all systems that
happen to use DAGs as a core algorithm into a single monolithic system.
Rather, it's an indication that DAG manipulation code ought to be a
common library that's reused across systems that require such
functionality, i.e., external dependencies, build dependencies, and
deployment dependencies.
It's really very simple. If your code has function X and function Y,
and X and Y have a lot of code in common, it does not mean you should
write function Z that can perform the role of both X and Y. Rather, it
means you should factor out the common parts into function W, and reuse
W from X and Y. (Alas, the former is seen all too often in large
"enterprise" software, where functions start out being straightforward
with a clean API, and end up being a monstrous chimera with 50
non-orthogonal, sometimes mutually-contradictory parameters, that can
nevertheless do everything you want -- if you can only figure out what
exactly each parameter means and which subset of parameters are actually
relevant to what you want.)
Similarly, if you have systems P, Q, and R, and they all have DAG
manipulation as a common functionality, that is an argument for
factoring out said DAG manipulation as a reusable component. It is not
an argument for making a new system S that includes everything that P,
Q, and R can do. (Unless S can also provide new functionality that P, Q,
and R could not have been able to achieve without such integration.)
[...]
> (*) The O(N) vs. O(n), SCons vs. Tup thing that T raised in another
> thread is important, but actually it is an implementation thing of how
> do you detect change, it isn't an algorithmic issue at a system design
> level. But it is important.
The O(N) vs. O(n) issue is actually very important once you generalize
beyond the specifics of build dependencies, esp. if you start talking
about network-dependent DAGs. If a task has a DAG that depends on, say,
100 network nodes, then I absolutely do NOT want the dependency
resolution tool to be querying all 100 nodes every time I ask for a
refresh. That's just ridiculously inefficient. Rather, the tool should
subscribe for updates from the network servers so that they inform it
when their part of the DAG changes. IOW, the amount of network traffic
should be proportional to the number of *changes* in the remote nodes,
NOT the *total* number of nodes.
Similarly, for deployment management, if my project has 100 installation
targets (remote customer machines), each of which has 1000 entities
(let's say files, like data files and executables), then I really do NOT
want to have to scan all 1000 entities on all 100 installation targets,
just to decide that only 50 files on 2 installation targets have
changed. I should be able to push out only the files that have changed,
and not everything else. IOW, the size of the update should be
proportional to the size of the change, NOT the total size of the entire
deployment. Otherwise it is simply not scalable and will quickly become
impractical as project sizes grow.
If such considerations are not integrated into the system design at the
top level, you can be sure that there will be inherent design flaws that
preclude efficient implementation later on. IOW, DAG updates must be
proportional to the size of the DAG change. Nowhere must there be any
algorithm that requires scanning the entire DAG (unless the changeset
covers the entire DAG).
T
--
Guns don't kill people. Bullets do.
More information about the Digitalmars-d
mailing list