Whats holding ~100% D GUI back?

Fri Nov 29 13:27:17 UTC 2019

On Friday, 29 November 2019 at 10:08:59 UTC, Ethan wrote:
> On Friday, 29 November 2019 at 02:42:28 UTC, Gregor Mückl wrote:
>> They don't concern themselves with how the contents of these 
>> quads came to be.
>
> Amazing. Every word of what you just said is wrong.
>

I doubt this, but I am open to discussion. Let's try to remain 
civil and calm.

> What, you think stock Win32 widgets are rendered with CPU code 
> with the Aero and later compositors?
>

Win32? Probably still are. WPF and later? No. That has always had 
a DirectX rendering backend. And at least WPF has a reputation of 
being sluggish. I haven't had performance issues with either so 
far, though.

> You're treating custom user CPU rasterisation on pre-defined 
> bounds as the entire rendering paradigm. And you can be assured 
> that your code is reading to- and writing from- a quarantined 
> section of memory that will be later composited by the layout 
> engine.
>
> If you're going to bring up examples, study WPF and UWP. 
> Entirely GPU driven WIMP APIs.
>
> But I guess we still need homework assignments.
>

OK, I'll indulge you in the interest of a civil discussion.

> 1) What is a Z buffer?
>

OK, back to basics. When rendering a 3D scene with opaque 
surfaces, the resulting image only contains the surfaces nearest 
to the camera. The rest is occluded. Solutions like depth sorting 
the triangles and rendering back to front are possible (see e.g. 
the DOOM engine and it's BSP traversal for rendering), but they 
have drawbacks. E.g. even a set of three triangles may be 
mutually overlapping in a way that no consistent z ordering of 
the entire primitives exist. You need to split primitives to make 
that work. And you still need to guarantee sorted input.

A z buffer does solves that problem by storing the minimum z 
value for each pixel that was thus far encountered. When drawing 
a new primitive over that pixel, that primitive's z value is 
first compared to the stored value and when it's further away, it 
is discarded.

Of course, a hardware z buffer can be configured in various other 
interesting ways. E.g. restricting the z value range to half of 
the NDC space, alternating half spaces and simultaneously 
flipping between min and max tests is an old trick to skip 
clearing the z buffer between frames.

There's still more to this topic: transformation of stored z 
values to retain precision on 24 bit integer z buffers, 
hierarchical z buffers, early z testing... I'll just cut it short 
here.

> 2) What is a frustum? What does "orthographic" mean in relation 
> to that?
>

The view frustum is the volume that is mapped to NDC. For 
perspective projection, it's a truncated four-sided pyramid. For 
orthographic projection, it's a cuboid. Fun fact: for correct 
stereo rendering to a flat display, you need asymmetrical 
perspective frustums; doing it with symmetric frustums rotated 
towards the vergence point leads to distortions.

> 3) Comparing the traditional and Aero+ desktop compositors, 
> which one has the advantage with redraws of any kind? Why?
>

I'm assuming that by traditional you mean a CPU compositor. In 
that case, the GPU compositor has the full image of all top level 
windows cached as textures. All it needs to do is render these to 
the screen as textured quads. This is fast and, in simple terms, 
it can be done without interfering with the vertical scanout of 
the image to the screen to avoid tearing. Because the window 
contents is cached, applications don't need to redraw their 
contents when z order changes (good bye damage events!) and as a 
side effect, moving and repositioning top level windows is smooth.

> 4) Why does ImGui's code get so complicated behind the scenes? 
> And what advantage does this present to a programmer who wishes 
> to use the API?
>

One word: batching. I'll briefly describe the Vulkan rendering 
process of ImGUI, as far as I remember it from the top of my 
head: it creates a single big vertex buffer for all draw 
operations with a pretty uniform vertex layout, regardless of the 
primitive involved. All drawing state that doesn't need pipeline 
changes goes into the vertex buffer (world space coords, UV 
coords, vertex color...). It also retains a memory of the 
pipeline state required to draw the current set of primitives. 
All high level primitives are broken down into triangles, even 
lines and bezier curves. This trick reduces the number of draw 
calls later. The renderer retains a list of spans in the vertex 
buffer and their associated pipeline state. Whenever the higher 
level drawing code does something that requires a state change, 
the current span is terminated and a new one for the new pipeline 
state is started. As far as I remember, the code only has two 
pipelines: one for solid, untextured primitives, and one for 
textured primitives that is used for text rendering.

In this model, the higher level rendering code can just emit draw 
calls for individual primitives, but these are only recorded and 
not executed immediately. In a second pass, the vertex buffer is 
uploaded in a single transfer and the list of vertex buffer spans 
is processed, switching pipelines, setting descriptors and 
emitting the draw call for the relevant vertex buffer range for 
each span in order.

The main reason why this works is a fundamental ordering 
guarantee given by the Vulkan API: primitives listed in a vertex 
buffer must be rendered in such a way that the result is as if 
the primitives were processed in the order given in the buffer. 
For example, when primitives overlap, the last one in the buffer 
is the one that covers the overlap region in the resulting image.

> 5) Using a single untextured quad and a pixel shader, how would 
> you rasterise a curve?
>

I'll let Jim Blinn answer that one for you:

https://developer.nvidia.com/gpugems/GPUGems3/gpugems3_ch25.html

I'd seriously mess up the math if I were to try to explain in 
detail. Bezier curves aren't my strong suit. I'm solving 
rendering equations for a living.

> (I've written UI libraries and 3D scene graphs in my career as 
> a console engine programmer, so you're going to want to be 
> *very* thorough if you attempt to answer all these.)
>
> On Friday, 29 November 2019 at 08:45:30 UTC, Gregor Mückl wrote:
>> GPUs are vector processors, typically 16 wide SIMD. The 
>> shaders and compute kernels for then are written from a 
>> single-"threaded" perspective, but this is converted to SIMD 
>> qith one "thread" really being a single value in the 16 wide 
>> register. This has all kinds of implications for things like 
>> branching and memory accesses. Thus forum is not rhe place to 
>> go into them.
>
> No, please, continue. Let's see exactly how poorly you 
> understand this.
>

Where is this wrong? Have you looked at CUDA or compute shaders? 
I'm honestly willing to listen and learn.

I've talked about GPUs in these terms with other experts (Intel 
and nVidia R&D guys, among others) and this is a common model for 
how GPUs work. So I'm frankly puzzled by your response.

> On Friday, 29 November 2019 at 09:00:20 UTC, Gregor Mückl wrote:
>> All of these things can be done on GPUs (most of it has), but 
>> I highly  doubt that this would be that much faster. You need 
>> lots of different shaders for these primitives and switching 
>> state while rendering is expensive.
>
> When did you last use a GPU API? 1999?
>

Last weekend, in fact. I'm bootstrapping a Vulkan/RTX raytracer 
as pet project. I want to update an OpenGL based real time room 
acoustics rendering method that I published a while ago.

> Top-end gaming engines can output near-photorealistic complex 
> scenes at 60FPS. How many state changes do you think they 
> perform in any given scene?
>

As few as possible. They *do* take time, although they have 
become cheaper. Batching by shader is still a thing. Don't take 
my word for it. See the "Pipelines" section here:

https://devblogs.nvidia.com/vulkan-dos-donts/

And that's with an API that puts pipeline state creation up front!

I don't have hard numbers for state changes and draw calls in 
recent games, unfortunately. The only number that I remember was 
something like about 2000 draw calls for a frame in Ashes of the 
Singularity. While that game shows masses of units, I don't find 
the graphics particularly impressive. There's next to no 
animation on the units. The glitz is mostly decals and particle 
effects. There's also not a lot of screen space post processing 
going on. So I don't consider that to be representative.

> It's all dependent on API, driver, and even operating system. 
> The WDDM introduced in Vista made breaking changes with XP, 
> splitting a whole ton of the stuff that would traditionally be 
> costly with a state change out of kernel space code and in to 
> user space code. Modern APIs like DirectX 12, Vulkan, Metal etc 
> go one step further and remove that responsibility from the 
> driver and in to user code.
>

Ok, this is some interesting information. I haven't ever had to 
care for where user/kernel mode transitions happen in the driver 
stack. I guess I've been lucky that I have been able to file that 
under generic driver overhead so far.

Phew, this has become a long reply and it has taken me a lot of 
time to write it. I hope I could prove to you that I generally 
know what I'm writing about. I could point to my history as some 
additional proof, but I'd rather let this response stand for what 
it is.