Disclaimer: This article’s structure was supported by insights from OpenAI’s ChatGPT, while the content was curated by the human author. Any potential inaccuracies or errors within the article are solely the responsibility of the author.
Embarking on the Oxford Summer School’s Advanced Deep Learning course’s first week was an engaging and insightful journey into the rich landscape of artificial intelligence (AI). The unique cadre of attendees further amplified the learning experience. Among the participants were 15 brilliant computer scientists from Saudi Arabia, cherry-picked from a competitive pool of 8000 applicants, adding a fresh dynamic to our learning environment.
The course also included the participation of accomplished Ph.D. holders from various corners of Europe, each deeply involved in pioneering research within diverse AI domains. These erudite individuals brought their considerable knowledge and varied perspectives to our discussions, resulting in a learning journey that was as multifaceted as it was enlightening.
Each day was meticulously planned, balancing theory with practice. Mornings were dedicated to unraveling the intricate mathematics powering today’s most advanced AI technologies, dissecting complex algorithms, and comprehending the theoretical underpinnings of neural networks. These academic sessions set the stage for our afternoons, where we turned theoretical knowledge into practical applications.
Utilizing Python, one of the most widely used programming languages in the AI field, we applied our newfound understanding to the development of tangible deep learning models. This was where I introduced an innovative approach for one of our assignments, leading to the creation of two sophisticated models. While many chose to use Autoencoders for image colorization, a common method in the field, I had an intuition for a different route, leveraging the potential of Conditional Generative Adversarial Networks (Conditional GANs), outperforming any other solution.
The second model I developed truly set a new standard in the course. Utilizing a Diffusion Model, I crafted a system capable of generating fresh and distinctive faces for characters from the iconic TV show, The Simpsons.
The uniqueness of this model was twofold. Firstly, its structure was distinct, employing a sophisticated interplay between an image tensor and a time parameter tensor. This introduced a novel dynamic element to the image generation process. Secondly, I utilized a specific set of operations and arrangements in the model architecture – from a well-planned contracting path (encoder) and expanding path (decoder) to a well-crafted bottleneck section – which made it capable of learning and generating intricate details with remarkable accuracy.
The innovative technique I used in this model was unique, leading to a superior performance that outshined the alternative solutions. This model underscored how the right combination of theoretical understanding, technical skill, and creative application can push the boundaries of what AI can achieve.
1st Model: Image Colorization using Conditional GANs
The first model I developed harnesses the power of Conditional Generative Adversarial Networks (Conditional GANs) to transform black-and-white images into colored ones. The mechanics of this model draw from the foundations of GANs, which consist of two neural networks engaged in a continuous competition: the generator, which endeavors to create realistic outputs, and the discriminator, which strives to distinguish the generated outputs from actual data.
In the context of our image colorization model, the generator takes a black-and-white image and attempts to colorize it, aiming to make the result indistinguishable from genuine colored images. Meanwhile, the discriminator’s objective is to discern these AI-colored images from real colored images.
The term “conditional” comes into play because this GAN is given specific information – the black-and-white image – that guides its generation process. As a result, the model produces vibrant, colored images that closely align with reality, effectively replicating the results a human might achieve through manual colorization. Applications for this technology extend from historical photo restoration to modern media production, offering the potential to vividly recolor our past and enrich our visual experience.
The GAN is trained using CIFAR10 dataset, where the grayscale images are fed into the generator model which then produces colorized images. These colorized images and the original color images are both given to the discriminator, which then tries to tell them apart.
This is how the script flows:
- The script starts by importing the required libraries and modules.
- Constants are defined for parameters such as batch size, image dimensions, learning rate, and other configurations.
- Loss functions (Mean Squared Error and Binary Cross Entropy) and optimizers (Adam) are initialized for both generator and discriminator.
- The CIFAR10 dataset is downloaded, processed, and saved into a designated directory.
- Images are then loaded, converted to grayscale, resized, and normalized.
- These images are split into training and testing datasets.
- Generator and discriminator models are created using Keras. The generator model takes grayscale images as input and generates color images, while the discriminator takes both the original and generated images and tries to distinguish between them.
- The loss functions for both models are defined. The generator model uses the Mean Squared Error to calculate the loss between the generated image and the original image. The discriminator model uses Binary Cross Entropy to calculate the loss between its predictions and the real labels (real or fake).
- The training process is implemented using a custom training step where the gradient descent is performed explicitly. In each training step, the generator generates color images from grayscale images, and these color images are fed into the discriminator along with the original images.
- The training process continues for a certain number of epochs, with the generator and discriminator being updated after each batch of images. The training process can also stop early if the discriminator’s loss does not improve for a certain number of epochs.
- After the training process, the script plots the losses of the generator and discriminator models to illustrate the training process.
- Finally, the script generates some color images from the test grayscale images and plots them alongside the original grayscale images and the ground truth color images to visualize the results.
The complexities of diving into a new domain, such as deep learning, can sometimes result in overlooking crucial aspects of model development and evaluation. In this instance, due to the limited time available for learning from scratch and being new to Keras, TensorFlow, and deep learning as a whole, an analysis of the validation loss was omitted.
This pivotal metric, vital for guarding against overfitting, was unfortunately not included in the initial evaluation. While this learning curve is part of the process, it’s an essential reminder that, despite time constraints and new learning environments, thorough evaluation methods should always be incorporated to ensure an accurate representation of a model’s performance.
Model results:
2nd Model: Simpson Face Generation using Diffusion Model
Now, let’s dive into the fun part of this project: creating Springfield’s most beloved characters with artificial intelligence. This second model titled “Simpsons Face Generation using Diffusion Models” is an application of diffusion models to generate new faces from the Simpsons universe.
The first thing we do is gather the data. We need to have many examples of Simpsons faces for our model to learn from. These images are then resized to a uniform size of 32×32 pixels and are also converted to a numerical format, normalized, and split into training and testing datasets.
In order to understand the process, let’s first take a step back and understand what a diffusion model does. At a high level, it simulates the process of an image diffusing over time, starting from noise and ending with the final image. Our goal here is to reverse this process, starting from a random noise image and gradually refining it until we get our final generated Simpsons face.
The model I have designed for this task is quite sophisticated. It first applies a series of processing blocks to the image, adding layers of complexity at each step. These blocks are essentially Convolutional Neural Networks (CNNs) that learn how to transform the image at each step. We go through multiple such steps, gradually decreasing the size of the image. Once we reach the smallest size, we then build it back up, adding in more and more detail at each step.
In each training iteration, our model takes in a noisy version of an image and a particular timestep. The model’s task is to predict the state of the image at the next timestep, gradually learning to produce less and less noisy images as the timesteps increase.
This training process is repeated for multiple epochs, with a decreasing learning rate for each new cycle. This is a common trick in machine learning that helps the model to make big improvements in the beginning and then fine-tune its predictions in the later stages.
The beauty of this model lies in its ability to generate a variety of Simpson faces, each unique and novel. We can see this when we visualize the output of the model. With each timestep, the model gets better and better at generating realistic faces, starting from pure noise.
This is how the script flows:
- The script begins with the importation of essential libraries and modules like TensorFlow, Keras, Matplotlib, and PIL. These are crucial for handling data, creating and manipulating models, and visualizing results.
- Constants such as the image size (IMG_SIZE), batch size (BATCH_SIZE), learning rate (LEARNING_RATE), number of timesteps, and training cycles (TRAINING_CYCLES) are established. These are used to define various parameters for training and for the model itself.
- The environment is prepared for TensorFlow to correctly utilize GPU memory with the ‘set_memory_growth’ function.
- The script then loads and processes the Simpsons faces image data. Images are resized to 32×32 pixels and converted to the RGB format. All images are then normalized by centering the data.
- A method for visualizing samples of images is provided with the function ‘visualize_samples(img_batch)’. This function displays a batch of 25 images in a 5×5 grid.
- Noise is introduced to the image data over a series of time steps using the ‘add_noise_over_time(img_input, timestep)’ function. This process simulates the diffusion process, taking the model from a noisy image to a clear one over specified timesteps.
- The ‘processing_block’ function is introduced, which applies a series of convolutional layers, activations, and layer normalization to both the image and time parameter tensors.
- The main architecture of the diffusion model is outlined in the ‘create_model’ function. The architecture features two main processing paths: a downward path which applies processing blocks and pooling operations, and an upward path which uses upsampling operations. An additional Multi-Layer Perceptron (MLP) is situated between the two paths.
- An Adam optimizer is created with the specified learning rate, and a Mean Absolute Error loss function is defined.
- A ‘train_one_iteration’ function is defined, which uses the model to train on different states of the image (generated with noise) and calculate the loss for each training iteration.
- The main training process is encapsulated within the ‘execute_training’ function. The function iterates over a number of epochs, each epoch consisting of a specified number of steps. For each step, a random batch of images is selected, and ‘train_one_iteration’ is called to train the model and calculate the loss.
- Two prediction visualizing functions are defined: ‘perform_prediction’ which visualizes the final prediction after all time steps, and ‘perform_step_prediction’ which visualizes each stored prediction sequentially.
- Finally, the ‘main_procedure’ is defined, which encapsulates the entire model training and prediction visualization process. It performs multiple training epochs, with learning rate decay after each epoch, and visualizes predictions after each training cycle.
This script provides an excellent example of a diffusion model application, which generates Simpsons characters’ faces. It demonstrates the use of concepts such as noise introduction, convolutional neural networks, and the use of timesteps in generating predictions.
Model results:
The study that employed the diffusion model faced a unique challenge: time constraint. The diffusion model concept was only introduced on a Thursday afternoon, while the deadline for the experiment was the next morning, Friday, at 10AM. This limited window of less than 12 hours, and in reality only a few hours of actual productive time, severely restricted the opportunities to extensively test and train the model.
Given the complexity and novelty of the diffusion model, fully understanding and implementing the concept requires considerable time. This issue is compounded when considering the iterative process of model training, testing, fine-tuning, and retesting, which is common in machine learning. Therefore, a few hours of available time is less than optimal for conducting a thorough exploration and execution of the model.
Given the circumstances, it is noteworthy that the model was able to produce recognizable images despite the severe time restriction. This again underscores the potential of the diffusion model approach, provided that it’s given adequate time and resources for training and refinement.
Conclusions
In the face of AI’s ever-evolving horizon, these initial explorations underscore a promising trajectory. As intriguing as they have been insightful, these projects exemplify the vast capabilities of deep learning. There’s a unique magic in harnessing these advanced technologies to recreate colorful realities from grayscale pasts or generate familiar faces from a beloved TV show.
As I turn the page towards the upcoming weeks at the Oxford Summer School, anticipation ripples through me. The horizon is painted with the prospect of diving deeper into the dynamic domain of AI, charting new territories, and pushing the boundaries of what we can achieve. Stay tuned, as the best is yet to come.