Hello everyone!
Now that people are working on porting my renderer over to the Babylon.js engine, I was thinking that this would be a good time and place to give sort of a general overview of how ray/path tracing works inside the Webgl2 system and the browser. That way, all who are interested, but who might not be able to contribute to the source, might benefit from the details and approaches that I took.
Warning: sorry, this will be an epic-length, multi-part post! I hope everyone is ok with that! 
Part 1 - Background
I’ll begin with some of the differences between traditional rendering and ray tracing. Traditional 3d polygon rendering makes up probably 99.9% of real-time games and graphics, no matter what system or platform you’re on. So if you’re using DirectX, openGL, WebGL2, software rasterizer, etc., you probably will do it/see it done this way. In a nutshell, vertices of 3d triangles are created on the host side (Javascript, C++, etc.) and sent to the graphics card via a vertex shader. The graphics card (from here on referred to as ‘GPU’), takes the input triangle vertex info and usually performs some kind of matrix transformation (perspective camera, etc.) on them to project the 3d vertices to 2d vertices on your screen. They are then fed to the fragment shader which rasterizes (fills in or draws) any flattened triangles’ pixels which are contained inside them.
The major benefit in doing it this way for several decades now, is that as GPUs got more widespread, they were built for this very technique and are now mind-numbingly fast at rendering millions of triangles on the screen at 60fps!
One of the major drawbacks to this seemingly no-brainer fast approach is that in the process of projecting the 3d scene onto the flat screen as small triangles, you lose global scene information that might be just outside the camera’s view. Most importantly, you lose lighting and shadow info from surrounding triangles that didn’t make the cut to get projected to the user’s final view, and there is no way of retrieving that info once the GPU is done with its work. You would have to do an exhaustive per-triangle look-up of the entire scene, which would make the game/graphics screech to a halt. Therefore, many techniques have been employed and much research has been done to make it seem to the user that none of the global Illumination and global scene info had been lost. Shadow maps, screen-space reflections, screen-space ambient occlusion, reflection probes, light maps, spherical harmonics, cube maps, etc. have all been used to try and capture the global scene info that is lost if you choose to render this way. Keep in mind that all these techniques are ultimately ‘fake’ and are sort of ‘hacks’ that are subject to error and require a lot of tweaking to get to look good for even one scene/game - then everything might not look right for the next title and the process begins again. Optical phenomena such as true global reflections, refractions through water and glass, caustics, transparency, become very difficult and even impossible with this approach. BTW this is known as a scene-centric or geometry-centric approach. In other words, vertex info and scene geometry take the front seat, global illumination and per pixel effects take a back seat.
Now another completely different rendering approach has been around the CG world for nearly as long - ray tracing. If the rasterizing approach was geometry-centric, then this ray tracing approach can be classified as pixel-centric. In this scenario on a traditional CPU, the renderer looks at 1 pixel at a time and deals with that pixel and only that pixel, then when it is done, it moves over to the next pixel, and the next, etc. When it is done with the bottom right pixel on your screen, the rendering is complete. Taking the first pixel, rather than giving it a vertex to possibly fill in or rasterize, the renderer asks the pixel, “What color are you supposed to be?”. So in this pixel-centric approach, the pixel color/lighting takes a front seat, and then the scene geometry is queried once it is required. The pixel answers, “Well let me shoot out a straight ray originating from the camera through me (that pixel of the screen viewing plane) out into the world and I’ll let you know!” Let’s say it runs into a bright light or the sun, great - a bright white or yellowish white color is recorded and we’re done! - well with that one pixel. Say the next pixel’s view ray hits a red ball. Ok, we have a choice - we can either record full red and move on to the next pixel and thus having a boring non-lit scene, or we could keep querying further for more info: like, “now that we’re on the ball, can you see the sun/light source from this part of the red ball?” If so we’ll make you bright red and move on, if you’re line of sight to the light is blocked, we’ll make you a darker red. How do we find this answer? We must spawn another ray out of the red ball’s surface, aim it at the light, and again see what the ray hits. If we turn the red ball into a mirror ball, a ray must be sent out according to the simple optical reflection formula, and then see what the reflected ray hits. If the ball was instead made out of glass, at least 2 more rays would need to be sent, one moving into the glass sphere, and one emerging from the back of the sphere out into the world behind it, whatever might be there. Although more complex, there are well known optical refraction formulas to aid us. But the main point is that with this rendering approach, the pixels are asked what they see along their rays, which might involve a complex path of many rays for one pixel, until either a ray escapes the scene, hits a light source, or a maximum number of reflections is reached (think of a room of mirrors - it would crash any computer if not stopped at some point). From the very first ray that comes from the camera, it has to query the entire scene to find out what it runs into and ‘sees’. Then, if it must continue on a reflected/refracted/shadow-query path, it must at each juncture query the entire scene again! No hacks or assumptions can be made. The whole global scene info must be made available at every single turn in the ray’s life.
As far as we know, this is how light behaves in the real world. Although a major difference is that everything in the real world happens in reverse: the rays spawn from the light source, out into the scene, reflecting/refracting, and only a tiny fraction is able to successfully ‘find’ and enter that first pixel in your camera’s sensor. However, this is a computational nightmare because the vast majority of light rays are ‘wasted’ - they don’t happen to enter that exact location of your camera’s single pixel. So we do it in reverse: loop through every viewing plane pixel (which is infinitely more efficient computationally) and instead send a ray out into the world hoping that it’ll eventually hit objects and ultimately a light source. Therefore the field of ray tracing is actually ‘backwards’ ray tracing.
The benefits of rendering this way are that, since we are synthesizing the physics of light rays and optics, images can be generated that are photo-realistic. Sometimes it is impossible to tell if an image is from a camera or a ray/path tracer. Since at each juncture of the ray’s path, the entire ‘global’ scene must be made available and interacted with, this technique is often referred to as global Illumination. All hacks and approximations disappear, and once difficult or impossible effects such as true reflections/refractions, area-light soft shadows, transparency, etc become easy, efficient, and nearly automatic. And these effects remain photo-realistic. No errors or tweaks on a per-scene basis necessary.
This all sounds great, but the biggest drawback to rendering this way is speed. Since we must query each pixel, and info cannot be shared between pixels (it is a truly parallel operation with individual ray paths for each individual pixel), we must wait for every pixel to run through the entire scene, making all the calculations, just to return a single color for 1 pixel out of a possible 1080p monitor. This is CPU style, by the way - the GPU hasn’t entered, yet. Many famous renderers and movies have been made using this approach, coded in C or C++, and run traditionally on the CPU only.
Then fairly recently, maybe 15 years ago, when programmable shaders for GPUs became a more widespread option, graphics programmers started to use GPUs to do some of the heavy lifting in ray tracing. Since it is an inherently parallel operation, and GPUs are built with parallelism in mind, the speed of rendering has gone way up. Renderings for movies that once took days, now take hours or even minutes. And in the cutting edge graphics world, ray tracing reflections and shadows (and to lesser extent, path tracing with true global Illumination), has even become real time at 30-60 fps!
My inspiration for starting the path tracing renderer was being fascinated by older Brigade 1 and Brigade 2 YouTube videos. Search for the Sam Lapere channel, who has 173 insane historic videos. He and his team (which included Jacco Bikker at one point) were able to make 10-15 year old computers approach real time path tracing. Also I was inspired by Kevin Beason’s smallPT: that he could fit a non real-time CPU renderer complete with global Illumination, into just 100 lines of C++! I used his simple but effective code to start with and moved it to the GPU with the help of Webgl and the browser. In the next part, I’ll get into details of how I had to set up everything in Webgl (with the help of a Javascript library) in order to get my first image on the screen inside of a browser.
So in conclusion, this was a really long way of saying that when you choose to go down this road, you are going against the grain, because GPUs were meant to rasterize as many triangles as possible in a traditional projection scheme. However, with a lot of help of some brilliant shader coders / ray tracing legends of yesteryear out there, and a decent acceleration structure, the dream of real time path tracing can be made into reality. And if you’re willing to work around the speed obstacles, you can have photo-realistic images (and even real time renderings) inside any commodity device with a common GPU and a browser! Part 2 will be coming soon! 