The problem of transferring a deep convolutional network trained for object recognition to the task of scene image classification is considered. An embedded implementation of the recently proposed mixture of factor analyzers Fisher vector (MFA-FV) is proposed. This enables the design of a network architecture, the MFAFVNet, that can be trained in an end to end manner. The new architecture involves the design of a MFA-FV layer that implements a statistically correct version of the MFA-FV, through a combi- nation of network computations and regularization. When compared to previous neural implementations of Fisher vectors, the MFAFVNet relies on a more powerful statistical model and a more accurate implementation. When compared to previous non-embedded models, the MFAFVNet relies on a state of the art model, which is now embedded into a CNN. This enables end to end training. Experiments show that the MFAFVNet has state of the art performance on scene classification.
Architecture: Architecture of the MFAFV network. The input image first goes through a feature extractor, e.g. VGG or ResNet. Then a ROI pooling layer is applied to derive features with different sizes. A MFA-FV layer is further added to implement a trainable MFA operation for the scene recognition.
Architecture: Architecture of the MFA-FV layer. The implement of MFA based fisher vector in neural network.
The key innovation of this paper is to implement mixture of factor analyze (MFA) with a neural network. A MFA-FV layer is designed as shown above with a few matrix operations as well as some approximation.