8.7 KiB
Pyramid Vision Transformer V2 (PVTv2)
Overview
The PVTv2 model was proposed in PVT v2: Improved Baselines with Pyramid Vision Transformer by Wenhai Wang, Enze Xie, Xiang Li, Deng-Ping Fan, Kaitao Song, Ding Liang, Tong Lu, Ping Luo, and Ling Shao. As an improved variant of PVT, it eschews position embeddings, relying instead on positional information encoded through zero-padding and overlapping patch embeddings. This lack of reliance on position embeddings simplifies the architecture, and enables running inference at any resolution without needing to interpolate them.
The PVTv2 encoder structure has been successfully deployed to achieve state-of-the-art scores in Segformer for semantic segmentation, GLPN for monocular depth, and Panoptic Segformer for panoptic segmentation.
PVTv2 belongs to a family of models called hierarchical transformers , which make adaptations to transformer layers in order to generate multi-scale feature maps. Unlike the columnal structure of Vision Transformer (ViT) which loses fine-grained detail, multi-scale feature maps are known preserve this detail and aid performance in dense prediction tasks. In the case of PVTv2, this is achieved by generating image patch tokens using 2D convolution with overlapping kernels in each encoder layer.
The multi-scale features of hierarchical transformers allow them to be easily swapped in for traditional workhorse computer vision backbone models like ResNet in larger architectures. Both Segformer and Panoptic Segformer demonstrated that configurations using PVTv2 for a backbone consistently outperformed those with similarly sized ResNet backbones.
Another powerful feature of the PVTv2 is the complexity reduction in the self-attention layers called Spatial Reduction Attention (SRA), which uses 2D convolution layers to project hidden states to a smaller resolution before attending to them with the queries, improving the O(n^2)
complexity of self-attention to O(n^2/R)
, with R
being the spatial reduction ratio (sr_ratio
, aka kernel size and stride in the 2D convolution).
SRA was introduced in PVT, and is the default attention complexity reduction method used in PVTv2. However, PVTv2 also introduced the option of using a self-attention mechanism with linear complexity related to image size, which they called "Linear SRA". This method uses average pooling to reduce the hidden states to a fixed size that is invariant to their original resolution (although this is inherently more lossy than regular SRA). This option can be enabled by setting linear_attention
to True
in the PVTv2Config.
Abstract from the paper:
Transformer recently has presented encouraging progress in computer vision. In this work, we present new baselines by improving the original Pyramid Vision Transformer (PVT v1) by adding three designs, including (1) linear complexity attention layer, (2) overlapping patch embedding, and (3) convolutional feed-forward network. With these modifications, PVT v2 reduces the computational complexity of PVT v1 to linear and achieves significant improvements on fundamental vision tasks such as classification, detection, and segmentation. Notably, the proposed PVT v2 achieves comparable or better performances than recent works such as Swin Transformer. We hope this work will facilitate state-of-the-art Transformer researches in computer vision. Code is available at https://github.com/whai362/PVT.
This model was contributed by FoamoftheSea. The original code can be found here.
Usage tips
- PVTv2 is a hierarchical transformer model which has demonstrated powerful performance in image classification and multiple other tasks, used as a backbone for semantic segmentation in Segformer, monocular depth estimation in GLPN, and panoptic segmentation in Panoptic Segformer, consistently showing higher performance than similar ResNet configurations.
- Hierarchical transformers like PVTv2 achieve superior data and parameter efficiency on image data compared with pure transformer architectures by incorporating design elements of convolutional neural networks (CNNs) into their encoders. This creates a best-of-both-worlds architecture that infuses the useful inductive biases of CNNs like translation equivariance and locality into the network while still enjoying the benefits of dynamic data response and global relationship modeling provided by the self-attention mechanism of transformers.
- PVTv2 uses overlapping patch embeddings to create multi-scale feature maps, which are infused with location information using zero-padding and depth-wise convolutions.
- To reduce the complexity in the attention layers, PVTv2 performs a spatial reduction on the hidden states using either strided 2D convolution (SRA) or fixed-size average pooling (Linear SRA). Although inherently more lossy, Linear SRA provides impressive performance with a linear complexity with respect to image size. To use Linear SRA in the self-attention layers, set
linear_attention=True
in thePvtV2Config
. - [
PvtV2Model
] is the hierarchical transformer encoder (which is also often referred to as Mix Transformer or MiT in the literature). [PvtV2ForImageClassification
] adds a simple classifier head on top to perform Image Classification. [PvtV2Backbone
] can be used with the [AutoBackbone
] system in larger architectures like Deformable DETR. - ImageNet pretrained weights for all model sizes can be found on the hub.
The best way to get started with the PVTv2 is to load the pretrained checkpoint with the size of your choosing using AutoModelForImageClassification
:
import requests
import torch
from transformers import AutoModelForImageClassification, AutoImageProcessor
from PIL import Image
model = AutoModelForImageClassification.from_pretrained("OpenGVLab/pvt_v2_b0")
image_processor = AutoImageProcessor.from_pretrained("OpenGVLab/pvt_v2_b0")
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)
processed = image_processor(image)
outputs = model(torch.tensor(processed["pixel_values"]))
To use the PVTv2 as a backbone for more complex architectures like DeformableDETR, you can use AutoBackbone (this model would need fine-tuning as you're replacing the backbone in the pretrained model):
import requests
import torch
from transformers import AutoConfig, AutoModelForObjectDetection, AutoImageProcessor
from PIL import Image
model = AutoModelForObjectDetection.from_config(
config=AutoConfig.from_pretrained(
"SenseTime/deformable-detr",
backbone_config=AutoConfig.from_pretrained("OpenGVLab/pvt_v2_b5"),
use_timm_backbone=False
),
)
image_processor = AutoImageProcessor.from_pretrained("SenseTime/deformable-detr")
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)
processed = image_processor(image)
outputs = model(torch.tensor(processed["pixel_values"]))
PVTv2 performance on ImageNet-1K by model size (B0-B5):
Method | Size | Acc@1 | #Params (M) |
---|---|---|---|
PVT-V2-B0 | 224 | 70.5 | 3.7 |
PVT-V2-B1 | 224 | 78.7 | 14.0 |
PVT-V2-B2-Linear | 224 | 82.1 | 22.6 |
PVT-V2-B2 | 224 | 82.0 | 25.4 |
PVT-V2-B3 | 224 | 83.1 | 45.2 |
PVT-V2-B4 | 224 | 83.6 | 62.6 |
PVT-V2-B5 | 224 | 83.8 | 82.0 |
PvtV2Config
autodoc PvtV2Config
PvtForImageClassification
autodoc PvtV2ForImageClassification - forward
PvtModel
autodoc PvtV2Model - forward