论文部分内容阅读
Object recognition has been extensively studied in the history of computer vision as one of the most fundamental problems.Among years,the research objective has evolved drastically,especially with the growth of data scale available on the web.In this dissertation,we study one of the latest and most challenging ob ject recognition tasks-fine-grained visual categorization(FGVC).Inparticular,we consider several practical issues for conducting FGVC in real-world applications,including classification accuracy,generalization ability,model interpretation and runtime e?ciency.To do so,several FGVC algorithms are proposed to cover application scenarios where various kinds of supervision are provided for training models. The main con-tribution of this thesis,therefore,is to propose a general pipeline for conducting fine-grained visual categorization in a variety of real-world applications based on the proposed algorithms. Our first work aims to improve the generalization ability of FGVC algorithms by reducing the extensive requirement of human-labeled annotations.We study FGVC under the weakest form of supervi-sion,where only image-level labels are provided for training.For this challenging task,the proposed weakly supervised FGVC al-gorithm employs the widely used multi-instance learning framework,but conducting a carefully designed initialization strategy via a novel multi-task co-localization algorithm.The localization results,mean-while,also enable object-level domain-specific fine-tuning of deep neu-ral networks,which significantly boosts the performance. Our second work targets on further improving the classification accu-racyof FGVC.Motivated by the recent success of part-based models and deep convolutional features in FGVC,the proposed method fol-lows a semi-supervised framework that exploits inexhaustible web data to augment existing strongly supervised FGVC datasets,so that the scale of extensive labeled training data could keep pace with the rapid evolution of the convolutional neural network(CNN)architec-tures. Our key discovery is that by transferring explicit knowledge learned from strongly supervised datasets using sophisticated object recognition methods,each web image can now carry additional do-main specific knowledge,which leads to an increased information gain.The proposed method achieves state-of-the-art performance on sev-eral FGVC benchmarks,where the improvement comes from both the perspective of features and classifiers. In addition to the pursuit of the classification performance,we also investigate a set of other practical issues on performing FGVC in real-world applications,i.e.,the model interpretability and runtime e?ciency.Implementing asa strongly supervised FGVC algo-rithm,a novel Part-Stacked CNN architecture is proposed,which is able to run at real-time by utilizing a set of computational sharing and architectural sharing strategies on multiple ob ject parts,and provide human understandable visual manuals for explaining the classifica-tion results through part-level analysis.Experiments are conducted to evaluate the algorithm with respect to the classification accuracy,runtime e?ciency and also the quality of model interpretation.