In this project we're going to be implementing Neural Radiance Fields... aka Nerfs!! At a high level, Nerfs will let us take images of a certain scene/object from many angles as input, then output a model that can take in some view coordinate/direction and output a novel view of our scene/object! This has a ton of uses ranging from creating meshes to rendering a 3d circling gif of an object.
The first part of this project involves training a model to recreate a 2D image given coordinate input. We want our model to take in some coordinate (x, y) then output the associated (r, g, b) values that should show up in the picture. This is a simplified version of the broader Nerf problem and will set up some of the work to be done later! To start with, we'll be working on recreating the following image of a fox:
First, we need to define the Multi-Layer Perceptron (MLP) that we will be using to train our model. We'll be using the following layer structure with 4 linear layers, a ReLU between each one, and a sigmoid at the end to constrain our RGB output to [0, 1]. We'll also be encoding our input coordinates with a positional encoding ~ PE(x). This will allow us to overfit to the actual coordinates of our picture which will give us very fine/detailed results!
Now, we need to tune some of the hyperparameters for our model! To do this, we're going to train it while varying different values for our learning rate and L (which is proportional to the added dimensionality in our positional encoding). We'll use PSNR (which is based on the mean squared error between two images) as a metric to check how close our reconstructed image is to the original.
Varying these values, I got the following PSNR curves and found that the best hyperparameters were L=35 and lr=5e-3:
Now that we have our final hyperparameters, let's train our final model! After training for 3000 iterations (batching 10k pixel coordinates per iteration) I got the following PSNR curve:
And here is how the image looked at various iterations in the training process! As you can see, the final result is very detailed and pretty darn close to the original image!
Now let's do reconstruction on an image from the movie Up! Here, I used the same hyperparameters as before (L=35, lr=5e-3) and got the following results:
In this next part of the project, we're going to be implementing the full Nerf architecture to reconstruct a novel 3d view of a lego dump truck! Let's go step by step into how this was done.
First, we need to create a dataloader that takes in the given camera extrinsics data and outputs the origin and direction for rays stemming from each camera in our dataset. To do this, we first have to do a couple conversions using the C2W matrix (which converts some camera coordinates into world coordinates) and the intrinsic matrix K (which converts pixel coordinates into camera coordinates). This is an overall view of how those matrices fit together when converting coordinates between the pixel and world spaces (although the diagram is doing the inverse transformation in that it is going from world coordinates to pixel coordinates):
Once we are able to convert from pixel coordinates to world coordinates, we need to convert from world coordinates to rays. To do this, we have to get the ray origin (the location of our camera) and the ray direction (where our ray is shooting). To get this done, we use the following formulas:
Finally, after getting our rays, we need to sample along them. To do this, I just created a linear space along the length of the ray that samples points along it (with some perturbation during training). Our dataloader returns these rays and the samples along them for training/inference by our model. This is a cool visualization of a sample of 100 rays and the sampled points along them (along with the cameras used):
Now we need to define our model! We'll be using an MLP like in Part 1 but with a few additions. First, we'll positioanlly encode both of our inputs (3d coords and ray directions). Our first input will be the 3d coordinate of our current sample along some ray. Over the course of the MLP, we will concatenate our original coordinates and our second input (the ray direction) into our layer output such that it won't "forget" those. This is a common technique for very deep models and is a great way to achieve stable results! Beyond these skip connections, we'll have our model branch out into two outputs in the end: An rgb output like before with has a final sigmoid activation to place it between [0, 1] and a density output which will be ReLU'd such that it is greater than 0.
After getting our model output, we need to be able to render what pixel should be shown given the current viewing direction. To do that, we need the density and rgb output from our model along each of the 64 samples on a ray. Then, we use the following tractable equation to solve for the pixel output depending on the density at each point along the ray and the predicted rgb values at that point on the ray:
The way we then get our loss is by using this volume rendering on our model output from all the samples on a ray, and comparing the outputted pixel values to our actual image's pixel values with MSE loss.
Now, after training our model with lr=5e-4, L_x=10, and L_rd=4 for 10k iterations... we get these results!
Rendering of model output on first validation image over 10k iterations. It gets really good at the end!!
Now, let's use our model on the test set and generate a 3d circling gif around our lego dump truck! It's so crisp let's gooo.
This project was really awesome. I've seen Nerfs in action a lot in recent years so it was a really fun experience implementing them from scratch and seeing how (at their core) they're pretty simple. There's definitely a lot of auxilliary code but at the end of the day they do boil down to a simple MLP which is a nice change of pace from the big fancy model architectures of the modern day like transformers.