Modern AI Art Is Inspired by a Physics Principle

 Through ANIL ANANTHASWAMY

5 January 2023

By learning to reverse the mechanism that, among other things, causes ink to spread across water, diffusion models create amazing visuals.



DALLE 2, an image production system developed by OpenAI, will produce bizarre photos of precisely what you want it to: "goldfish sucking Coca-Cola on a beach." During training, the software would have seen pictures of beaches, goldfish, and Coca-Cola, but it is exceedingly improbable that it would have seen one where all three were together. However, DALLE 2 is able to put the ideas together into something that could have pleased Dal.


DALLE 2 is an example of a generative model, which is a system that makes an effort to use training data to create new content that is both high-quality and diverse in comparison to the training data. This is one of the most challenging machine learning tasks, and getting here has been challenging.

The first significant artificial intelligence models for generating pictures were neural networks, which are computer programmes made up of multiple layers of artificial neurons. However, despite improvements in picture quality, the models remained unreliable and challenging to train. A potent generative model that had been developed by a postdoctoral researcher with a love for physics sat dormant throughout this time, but two graduate students made technological advancements that allowed the beast to come to life.

DALLE 2 is a real monster. The fundamental understanding from physics is what enables DALLE 2's visuals as well as those of its rivals Stable Diffusion and Imagen. They are supported by a system called a diffusion model, which draws extensively on nonequilibrium thermodynamics, the theory that regulates processes like the dispersion of fluids and gases. Yang Song, a machine learning researcher at OpenAI, said: "There are several approaches that were first created by physicists and are now highly significant in machine learning."

Both the industry and users have been shocked by the strength of these devices. Anima Anandkumar, a computer scientist at the California Institute of Technology and senior director of machine learning research at Nvidia, declared that "this is an exciting time for generative models." We have shown that generative models are effective for downstream tasks [that] increase the fairness of predictive AI models, even if the realistic-looking visuals produced by diffusion models can occasionally reinforce societal and cultural prejudices, she said.


High Likelihood


Let's start with a basic image consisting of simply two adjacent grayscale pixels in order to comprehend how producing data for images works. Based on the shade of each pixel, we can properly describe this image with just two numbers (from zero being completely black to 255 being completely white). The picture may be plotted as a point in 2D space using these two values.

When many photos are shown as points, clusters—specific images and the related pixel values—that happen more frequently than others—might be seen. Imagine now that there is a surface above the plane, and that the height of the surface reflects the density of the clusters. Using this surface, a probability distribution is mapped out. The highest section of the surface is where you are most likely to discover individual data points, and the lowest area is where you are least likely to find them.

These pictures of "goldfish slurping Coca-Cola on a beach" were created by DALLE 2. The OpenAI algorithm could nonetheless produce the images on its own even if it was unlikely to have seen anything comparable before.


You may now create new photos using this probability distribution. All you have to do is "sample" the distribution by randomly producing new data points while sticking to the limitation that you provide more likely data more frequently. An picture changes with each new point.

For more realistic grayscale images, let's say with a million pixels apiece, the same approach still applies. Only now, instead of two axes, each image must be plotted using a million axes. Such pictures will have a complicated million-plus-one-dimensional surface for the probability distribution. If you take a sample from that distribution, one million pixel values will result. If you print those pixels on a piece of paper, the picture will probably resemble a picture taken from the original data set.

Learning this complex probability distribution for a group of pictures that serve as training data is the difficult part of generative modelling. The distribution is helpful in two ways: first, it captures a lot of information about the data, and second, it allows researchers to integrate probability distributions over other data types (such text and photos) to create bizarre results like a goldfish gulping Coca-Cola on a beach. Anandkumar added, "You may combine and contrast several notions to produce whole new scenarios that were never encountered in training data.

A model known as a generative adversarial network (GAN) produced realistic images for the first time in 2014. Anandkumar expressed his enthusiasm, "There was so much." GANs are challenging to train since they might not pick up on the entire probability distribution and might end up creating pictures using only a portion of the distribution. For instance, a GAN trained on photos of various animals might exclusively produce images of dogs.


A more reliable model was required for machine learning. One would be offered by Jascha Sohl-Dickstein, whose work was influenced by physics.


Fluorescent Excitement


Sohl-Dickstein was a postdoc at Stanford University at the time that GANs were developed, focusing on generative models with a side focus on nonequilibrium thermodynamics. This area of physics investigates systems that are not in thermal equilibrium and that interchange matter and energy both internally and externally.


A drop of blue ink dripping through a glass of water serves as an example. It first manifests as a single, black glob. At this stage, a probability distribution that accurately captures the initial condition, prior to the ink spreading, is required in order to determine the likelihood of finding an ink molecule in a certain tiny volume of the container. But because of its complexity, this distribution is difficult to sample from.

However, eventually the ink disperses throughout the water, turning it a light blue colour. This results in a probability distribution of molecules that is far more uniform and clear to define mathematically. Every stage of the diffusion process' probability distribution is described by nonequilibrium thermodynamics. Importantly, each step is reversible; with small enough steps, you may go from a basic to a complicated distribution.


A fresh method of generative modelling was developed by Jascha Sohl-Dickstein using the ideas of diffusion.


Sohl-Dickstein created an algorithm for generative modelling based on the diffusion theory. The basic notion is that the algorithm converts complicated pictures in the training data set into simple noise first, similar to changing a dab of ink into diffuse light blue water, and then instructs the system to reverse the process, converting noise into images.

This is how it goes. The algorithm starts by selecting a picture from the training set. We may represent the image as a dot in million-dimensional space if, as before, we assume that each of the million pixels has some value. Every time a time step is made, the algorithm adds noise to each pixel, simulating the ink's dispersion after a brief time step.

The values of the pixels become less related to their values in the original image as this process goes on, and the pixels start to resemble a simple noise distribution. (At each time step, the algorithm also slightly moves each pixel value toward the origin, which is the zero value on all of those axes. This correction stops the pixel values from being too big for computers to handle.)

When all of the photos in the data set are processed in this way, the initial complicated distribution of dots in a space with a million dimensions (which is difficult to explain and sample from) is transformed into a straightforward, regular distribution of dots near the origin.


Yang Song contributed to the development of an innovative method for creating pictures by teaching a network how to successfully decode noisy images.


Sohl-Dickstein stated, "The series of changes very slowly reduces your data distribution to basically a giant noise ball." You are left with a distribution you may easily sample from after this "forward procedure."

The next phase involves machine learning: teach a neural network to anticipate the less noisy photos that arrived earlier by feeding it the noisy ones from a forward pass. It will initially make mistakes, so you adjust the network's settings to improve performance. The neural network can eventually successfully transform a noisy picture that represents a sample from the simple distribution into an image that represents a sample from the complicated distribution.

A complete generative model is the trained network. The requirement for an initial picture to do a forward pass is no longer necessary: You may directly sample from the simple distribution since you have a complete mathematical description of it. This sample, which is basically static, may be transformed by the neural network into a final image that resembles an image from the training set.


Sohl-Dickstein remembers the diffusion model's initial results. When you squinted, you could have thought, "I guess that coloured blob looks like a vehicle," he said. "This is considerably more structured than I'd ever received before. I'd spent so many months of my life gazing at various patterns of pixels and attempting to discern structure. I was quite happy.


Thinking about the future


Although Sohl-Dickstein published his diffusion model approach in 2015, it lagged well behind GANs in terms of performance. The photographs looked awful, and the process was far too long, even though diffusion models could sample throughout the full distribution and never get stuck producing only a subset of images. Sohl-Dickstein remarked, "I don't believe at the time this was considered as fascinating.

It would take two students, who were unfamiliar with Sohl-Dickstein and one another, to make the connection between this early work and current diffusion models like DALLE 2. Song, a doctorate candidate at Stanford at the time, was the first. He and his advisor released a unique technique in 2019 for creating generative models without estimating the data's probability distribution (the high-dimensional surface). Instead, it calculated the distribution's gradient (think of it as the slope of the high-dimensional surface).

Song discovered that his method worked best when he first added increasing amounts of noise to each image in the training data set, then asked his neural network to forecast the original image using gradients of the distribution, thus denoising it. After being trained, his neural network could gradually transform a noisy image collected from a straightforward distribution back into an image corresponding to the training data set. Despite the excellent image quality, his machine learning model was agonisingly slow to sample.And he accomplished this without having read any of Sohl-works. Dickstein's I had no knowledge of diffusion models, Song said. "After our 2019 paper was released, Jascha sent me an email. He made a point of pointing out to me how tightly connected our models are.

The second pupil saw these linkages in 2020 and understood that Song's work may enhance Sohl-diffusion Dickstein's models. Jonathan Ho continued working on generative modelling after completing his PhD study there at the University of California, Berkeley. He declared, "I felt it was the most mathematically beautiful branch of machine learning."


Ho improved the Sohl-Dickstein diffusion model by including some of Song's concepts as well as other developments in the field of neural networks. I was aware that I needed to have the model produce stunning examples in order to capture the community's interest, he added. I firmly believed that doing this was the most significant action I could do at the moment.

He had great insight. In a study titled "Denoising Diffusion Probabilistic Models" published in 2020, Ho and his coworkers revealed this updated and new diffusion model. It soon established itself as a milestone, earning the moniker "DDPM" among scholars. These models equaled or outperformed all other generative models, including GANs, in one benchmark of picture quality that compares the distribution of produced images to the distribution of training photos. The major actors soon became aware of it. Today, DDPM is used in some form or another by DALLE 2, Stable Diffusion, Imagen, and other commercial models.


Jonathan Ho and his colleagues combined Sohl-Dickstein and Song’s methods to make possible modern diffusion models, such as DALL·E 2.

A further essential component of contemporary diffusion models is the use of large language models (LLMs), such as GPT-3. These are generative models that were taught probability distributions over words rather than images using content from the internet. Ho, who is currently a research scientist at a stealth company, and Tim Salimans demonstrated how to combine data from an LLM and an image-generating diffusion model in 2021, using text (such as "goldfish slurping Coca-Cola on a beach") to direct the process of diffusion and subsequently image generation. The success of text-to-image models like DALLE 2 is due to this kind of "directed diffusion."

Ho said, "They much exceed my greatest dreams." "I'm not going to act like I predicted this at all."


Creating Difficulties


Images from DALLE 2 and similar films are still far from ideal, notwithstanding how effective these models have been. Large language models can produce text that reflects societal and cultural prejudices like racism and sexism. This is so because they receive training using texts that were downloaded from the internet, and many times such writings involve offensive language. The same biases are ingrained in LLMs that learn a probability distribution over such material. Additionally, uncurated photos downloaded from the internet are used to train diffusion models, which might provide equally skewed results. It is understandable why LLMs and current diffusion models occasionally provide pictures that are representative of the problems in society.

Anandkumar possesses first-hand knowledge. She was surprised when she attempted to create stylised avatars of herself using a diffusion model-based programme. In contrast to what it was showing males, she claimed, "so [many] of the photos were excessively sexualized." It's not just her.

These biases can be reduced by either establishing checks on both the input prompts and the outputs of these models, which is a tough effort given the size of the data set, or by curating and filtering the data (an incredibly difficult process). Of course, nothing can replace thoroughly and meticulously evaluating a model for safety, according to Ho. This is a significant obstacle for the industry.

Despite these reservations, Anandkumar is certain of the effectiveness of generative modelling. What I cannot build, I cannot comprehend, according to Richard Feynman, she stated. Her team's ability to create generative models to, for instance, construct synthetic training data of under-represented classes for prediction tasks, such darker skin tones for facial recognition, has improved fairness thanks to a greater knowledge of the subject. Additionally, generative models may provide light on how our brains process chaotic data, conjure up mental images, and plan out the future. And developing more complex models might give AIs comparable powers.

Anandkumar stated, "I believe we are just at the beginning of the potential of what we can achieve with generative AI.



Post a Comment

0 Comments
* Please Don't Spam Here. All the Comments are Reviewed by Admin.