DCGAN is a class of CNNs called deep convolutional generative adversarial networks, that have certain architectural constrains, and the paper demonstrated that DCGAN is a strong candidate for unsupervised learning. Because when the discriminator can distinguish the fake and real, features extracted could well represent the data itself. How about the generator? Let us see what the paper tells us.
Using my words,
- the paper proposed an architectural constrains of DCGANs which could be inclidently trained stable and more likely to converge.
- the paper proved that DCGANs trained on massive amount of images without labels, could be a very competitive feature extractor, from both generator and discriminator (mostly from discriminator), that be used for other higher level supervised task, like the image classifier.
- the paper tried to peek how does generator work and what kind of features and filters do they learn.
- the paper shows that the generators have interesting vector arithmetic properties allowing for easy manipulation of many semantic qualities of generated samples.
1. Model Architecture
Historical attempts to scale up GANs using CNNs to model images have been unsuccessful (Generate high-resolution images). After extensive model exploration the author in this paper identified a family of architectures that results in stable training across a range of datasets and allowed for training higher resolution and deeper generative models. The architecture guidelines for stable Deep Convolutional GANs:
- Replace any pooling layers with strided convolutions (discriminator) and fractional-strided convolutions (generator).
- Use batchnorm in both the generator and the discriminator.
- Remove fully connected hidden layers for deeper architecture.
- Use ReLU activation in generator for all layers except for the output, which uses Tanh.
- Use LeakyReLU activation in the discriminator for all layers.
The other basic about the structure is, the first layer of the GAN, which takes a uniform noise distribution Z as input, could be called fully connected as it is just a matrix multiplication, and the result is then shaped into a 4-dimensional tensor and used as the start of the convolution stack. For the discriminator, the last convolution layer is flattened and then fed into a single sigmoid output.
More details about the parameters and training of the network:
- No pre-processing is needed for the training images besides scaling to the range of the tanh activation function [-1,1].
- All models were trained with mini-batch stochastic gradient descent (SGD) with a mini-batch size of 128.
- All weights were initialized from a zero-centered Normal distribution with standard deviation 0.02.
- In the LeakyReLU, the slope of the leak was set to 0.2 in all models.
- Previous work uses momentum to accelerate training, this paper used the Adam optimizer with tuned hyperparameters. The learning rate is set 0.0002.
- Set the momentum term beta1 to 0.5.
2. Unsupervised Representation, Feature Extractor
One common technique for evaluating the quality of unsupervised representation learning algorithms is to apply them as a feature extractor on supervised datasets and evaluate the performance of linear models fitted on top of these features.
The DCGAN is trained on Imagenet-1k and then use the discriminator’s convolutional features from all layers, maxpooling each layers representation to produce a 4*4 spatial grid. These features are then flattened and concatenated to form a 28672 dimensional vector and a regularized linear L2-SVM classifier is trained on top of them. It has gained impressive performance, however, it still can do better.
Visualize the Discriminator Features
Previous work has demonstrated that supervised training of CNNs on large image datasets results in very powerful learned features. Additionaly, supervised CNNs trained on scene classification learn object detectors. The authors demonstrated that an unsupervised DCGAN trained on a large image dataset can also learn a hierarchy of interesting features. Using guided backpropagation, it shows that the features learned by the discriminator activate on typical parts of a bedroom, like beds and windows.
3. Vector Arithmetic Properties of Generators
This section is trying to explore the answers to question of what representations the generator learns.
From the quality of the generated samples, it suggests that the generator learns specific object representations for major scene components such as beds, windows, lamps, doors, and miscellaneous furnitures. They conducted an experiment to remove windows from the generator.
So, they did a simple classification experiments (logistic regression, windows or not, inside the bounding box of the windows are positives, outside of it is negative) with features drawn from the second highest convolution layer. All feature maps with weights being positive were dropped from all spatial locations.
They showed that by averaging the Z vector generated for three examples showed consistent and stable generations that semantically obeyed the linear arithmetic. Including object manipulation and face pose.
This is to the knowledge of authors the first demonstration of the purely unsupervised models can learn to convincingly model object attributes like scale, rotation, and position.