transformers/docs/source/en/model_doc/vit_msn.md

4.3 KiB

ViTMSN

Overview

The ViTMSN model was proposed in Masked Siamese Networks for Label-Efficient Learning by Mahmoud Assran, Mathilde Caron, Ishan Misra, Piotr Bojanowski, Florian Bordes, Pascal Vincent, Armand Joulin, Michael Rabbat, Nicolas Ballas. The paper presents a joint-embedding architecture to match the prototypes of masked patches with that of the unmasked patches. With this setup, their method yields excellent performance in the low-shot and extreme low-shot regimes.

The abstract from the paper is the following:

We propose Masked Siamese Networks (MSN), a self-supervised learning framework for learning image representations. Our approach matches the representation of an image view containing randomly masked patches to the representation of the original unmasked image. This self-supervised pre-training strategy is particularly scalable when applied to Vision Transformers since only the unmasked patches are processed by the network. As a result, MSNs improve the scalability of joint-embedding architectures, while producing representations of a high semantic level that perform competitively on low-shot image classification. For instance, on ImageNet-1K, with only 5,000 annotated images, our base MSN model achieves 72.4% top-1 accuracy, and with 1% of ImageNet-1K labels, we achieve 75.7% top-1 accuracy, setting a new state-of-the-art for self-supervised learning on this benchmark.

drawing

MSN architecture. Taken from the original paper.

This model was contributed by sayakpaul. The original code can be found here.

Usage tips

  • MSN (masked siamese networks) is a method for self-supervised pre-training of Vision Transformers (ViTs). The pre-training objective is to match the prototypes assigned to the unmasked views of the images to that of the masked views of the same images.
  • The authors have only released pre-trained weights of the backbone (ImageNet-1k pre-training). So, to use that on your own image classification dataset, use the [ViTMSNForImageClassification] class which is initialized from [ViTMSNModel]. Follow this notebook for a detailed tutorial on fine-tuning.
  • MSN is particularly useful in the low-shot and extreme low-shot regimes. Notably, it achieves 75.7% top-1 accuracy with only 1% of ImageNet-1K labels when fine-tuned.

Resources

A list of official Hugging Face and community (indicated by 🌎) resources to help you get started with ViT MSN.

If you're interested in submitting a resource to be included here, please feel free to open a Pull Request and we'll review it! The resource should ideally demonstrate something new instead of duplicating an existing resource.

ViTMSNConfig

autodoc ViTMSNConfig

ViTMSNModel

autodoc ViTMSNModel - forward

ViTMSNForImageClassification

autodoc ViTMSNForImageClassification - forward