Additional information
Abstract
Even with a plenty amount of normal samples, anomaly detection has been considered as a challenging machine learning task due to its one-class nature, i.e., the lack of anomalous samples in training time. It is only recently that a few-shot regime of anomaly detection became feasible in this regard, e.g., with a help from large vision-language pre-trained models such as CLIP, despite its wide applicability. In this paper, we explore the potential of large text-to-image generative models in performing few-shot anomaly detection. Specifically, recent text-to-image models have shown unprecedented ability to generalize from few images to extract their common and unique concepts, and even encode them into a textual token to "personalize" the model: so-called textual inversion. Here, we question whether this personalization is specific enough to discriminate the given images from their potential anomalies, which are often, e.g., open-ended, local, and hard-to-detect. We observe that the standard textual inversion is not enough for detecting anomalies accurately, and thus we propose a simple-yet an effective regularization scheme to enhance its specificity derived from the zero-shot transferability of CLIP. We also propose a self-tuning scheme to further optimize the performance of our detection pipeline, leveraging synthetic data generated from the personalized generative model. Our experiments show that the proposed inversion scheme could achieve state-of-the-art results on a wide range of few-shot anomaly detection benchmarks.