Awesome-Transformer-Attention - ビジョントランスフォーマー/アテンションの究極的に包括的な論文リスト(論文、コード、および関連ウェブサイトを含む)

(An ultimately comprehensive paper list of Vision Transformer/Attention, including papers, codes, and related websites)

Created at: 2021-09-15 15:16:24
Language:

究極の - 素晴らしい - トランスフォーマー - 注意すごい

このリポジトリには、論文、コード、および関連するWebサイトを含む、ビジョントランスフォーマー&アテンションの包括的な論文リストが含まれています。
このリストはミン・フン・チェンによって管理されています。(積極的に更新を続ける)

無視された論文を見つけたら、プルリクエストを作成したり問題を開いたり、私に電子メールを送ってください
このリストをより包括的にするためのあらゆる形態の貢献は大歓迎です。

このリポジトリが役に立つ場合は、このリストを引用し★STARすることを検討してください。
このリストを他の人と共有すること自由に感じなさい!

[更新日:2022年10月]論文リストの後半をREADME_2.md
に分割 [更新日:2022年10月] ECCV 2022の関連論文を全て追加しました!
[更新:2022年9月] ルーカス・ベイヤーによるトランスフォーマーのチュートリアルスライドを追加しました!
[更新日:2022年7月] ICML 2022の関連論文を全て追加しました!
[更新日:2022年6月] CVPR 2022の関連論文を全て追加しました!


概要

------ (以下の論文はREADME_2.mdに移動) ------


アンケート

  • "A Survey on Visual Transformer", TPAMI, 2022 (Huawei).[論文]]
  • "A Comprehensive Study of Vision Transformers on Dense Prediction Tasks", VISAP, 2022 (NavInfo Europe, Netherlands).[論文]]
  • "Vision-and-Language Pretrained Models: A Survey", IJCAI, 2022 (シドニー大学).[論文]]
  • "Vision-Language Pre-training: Basics, Recent Advances, and Future Trends", arXiv, 2022 (Microsoft).[論文]]
  • "Vision+X: A Survey on Multimodal Learning in the Light of Data", arXiv, 2022 (Illinois Institute of Technology, Chicago).[論文]]
  • "Vision Transformers for Action Recognition: A Survey", arXiv, 2022 (Charles Sturt University, Australia).[論文]]
  • "VLP: A Survey on Vision-Language Pre-training", arXiv, 2022 (CAS).[論文]]
  • "Transformers in Remote Sensing: A Survey", arXiv, 2022 (MBZUAI).[論文][ギットハブ]
  • "トランスフォーマーに基づく医用画像解析:レビュー", arXiv, 2022 (NUS, シンガポール).[論文]]
  • "3D Vision with Transformers: A Survey", arXiv, 2022 (MBZUAI).[論文][GitHub]
  • "Vision Transformers: State of the Art and Research Challenges", arXiv, 2022 (NYCU).[論文]]
  • "Transformers in Medical Imaging: A Survey", arXiv, 2022 (MBZUAI).[論文][GitHub]
  • "Transconductorersによるマルチモーダル学習:調査", arXiv, 2022 (オックスフォード).[論文]]
  • 「トランスフォーマーで医用画像を変革?主要な特性、現在の進歩、および将来の展望の比較レビュー」、arXiv、2022(CAS)。[論文]]
  • "Transformers in 3D Point Clouds: A Survey", arXiv, 2022 (University of Waterloo).[論文]]
  • "医療応用のための注意メカニズムに関する調査:私たちはより良いアルゴリズムに向かっていますか?", arXiv, 2022 (INESC TEC and University of Porto, Portugal).[論文]]
  • "Efficient Transformers: A Survey", arXiv, 2022 (Google).[論文]]
  • 「私たちは新しいパラダイムシフトの準備ができていますか?A Survey on Visual Deep MLP", arXiv, 2022 (清華).[論文]]
  • "Vision Transformers in Medical Computer Vision - A Contemplative Retrospection", arXiv, 2022 (National University of Sciences and Technology (NUST), Pakistan).[論文]]
  • "Video Transformers: A Survey", arXiv, 2022 (Universitat de Barcelona, Spain).[論文]]
  • "Transformers in Medical Image Analysis: A Review", arXiv, 2022 (南京大学).[論文]]
  • "Vision Transformerの最近の進歩:最近の仕事の調査と見通し"、arXiv、2022(?)。[論文]]
  • "Transformers Meet Visual Learning Understanding: A Comprehensive Review", arXiv, 2022 (Xidian University).[論文]]
  • "Image Captioning In the Transformer Age", arXiv, 2022 (Alibaba).[論文][GitHub]
  • "Visual Attention Methods in Deep Learning: An In-Depth Survey", arXiv, 2022 (Fayoum University, Egypt).[論文]]
  • "Transformers in Vision: A Survey", ACM Computing Surveys, 2021 (MBZUAI).[論文]]
  • "Survey: Transformer Based Video-Language Pre-training", arXiv, 2021 (中国人民大学).[論文]]
  • "A Survey of Transformers", arXiv, 2021 (Fudan).[論文]]
  • "A Survey of Visual Transformers", arXiv, 2021 (CAS).[論文]]
  • "マシンビジョンのための注意メカニズムとディープラーニング:最先端の調査"、arXiv、2021(カシミール大学、インド)。[論文]]

[概要に戻る]

画像分類/バックボーン

注意付きコンブの交換

純粋な注意

コンブステム+注意

コンブ+アテンション

[概要に戻る]

ビジョントランス

一般的なビジョントランス

  • ViT: "A Image is worth 16x16 Words: Transformers for Image Recognition at Scale", ICLR, 2021 (Google).[論文][テンソルフロー][パイトーチ(ルシドレイン)]
  • Perceiver: "Perceiver: General Perception with Iterative Attention", ICML, 2021 (DeepMind).[論文][パイトーチ(ルシドレイン)]
  • PiT: "Rethinking Spatial Dimensions of Vision Transformers", ICCV, 2021 (NAVER).[論文][パイトーチ]
  • VT: "Visual Transformers: Where Do Transformers Really Belong in Vision Models?", ICCV, 2021 (Facebook).[論文][パイトーチ (tahmid0007)]
  • PVT: "Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction Without Convolutions", ICCV, 2021 (南京大学).[論文][パイトーチ]
  • iRPE: "Rethinking and Improving Relative Position Encoding for Vision Transformer", ICCV, 2021 (Microsoft).[論文][パイトーチ]
  • CaiT: "Going deep with Image Transformers", ICCV, 2021 (Facebook).[論文][パイトーチ]
  • Swin-Transformer: "Swin Transformer: Hierarchical Vision Transformer using Shifted Windows", ICCV, 2021 (Microsoft).[論文][パイトーチ][パイトーチ(ベルニワル)]
  • T2T-ViT: "Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet", ICCV, 2021 (Yitu).[論文][パイトーチ]
  • FFNBN: "Leveraging Batch Normalization for Vision Transformers", ICCVW, 2021 (Microsoft).[論文]]
  • DPT: "DPT: Deformable Patch-based Transformer for Visual Recognition", ACMMM, 2021 (CAS).[論文][パイトーチ]
  • Focal: "Focal Attention for Long-Range Interaction in Vision Transformers", NeurIPS, 2021 (Microsoft).[論文][パイトーチ]
  • XCiT: "XCiT: Cross-Covariance Image Transformers", NeurIPS, 2021 (Facebook).[論文]]
  • Twins: "Twins: Revisiting Spatial Attention Design in Vision Transformers", NeurIPS, 2021 (Meituan).[論文][パイトーチ]]
  • ARM: "Blend-Anti-Aliasing into Vision Transformer", NeurIPS, 2021 (Amazon).[論文][GitHub (建設中)]
  • DVT:「すべての画像が16x16ワードの価値があるわけではない:適応シーケンス長のダイナミックビジョントランスフォーマー」、NeurIPS、2021(清華)。[論文][パイトーチ]
  • Aug-S: "Augmented Shortcuts for Vision Transformers", NeurIPS, 2021 (Huawei).[論文]]
  • TNT: "Transformer in Transformer", NeurIPS, 2021 (Huawei).[論文][パイトーチ][パイトーチ(ルシドレイン)]
  • ViTAE: "ViTAE: Vision Transformer Advanced by Exploring Intrinsic Inductive Bias", NeurIPS, 2021 (シドニー大学).[論文][パイトーチ]
  • DeepViT: "DeepViT: Towards Deep Vision Transformer", arXiv, 2021 (NUS + ByteDance).[論文][コード]]
  • So-ViT: "So-ViT: Mind Visual Tokens for Vision Transformer", arXiv, 2021 (大連工科大学).[論文][パイトーチ]
  • LV-ViT: "All Tokens Matter: Token Labeling for Training Better Vision Transformers", NeurIPS, 2021 (ByteDance).[論文][パイトーチ]
  • NesT: "Aggregating Nested Transformers", arXiv, 2021 (Google).[論文][テンソルフロー]]
  • KVT: "KVT: k-NN Attention for Boosting Vision Transformers", arXiv, 2021 (Alibaba).[論文]]
  • Refined-ViT: "Refiner: Refineing Self-attention for Vision Transformers", arXiv, 2021 (NUS, Singapore).[論文][パイトーチ]
  • シャッフルトランスフォーマー:「シャッフルトランスフォーマー:ビジョントランスフォーマーのための空間シャッフルの再考」、arXiv、2021(テンセント)。[論文]]
  • CAT:「CAT:Cross Attention in Vision Transformer」、arXiv、2021(KuaiShou)。[論文][パイトーチ]
  • V-MoE: "Scaling Vision with Sparse Mixed of Experts", arXiv, 2021 (Google).[論文]]
  • P2T: "P2T: Pyramid Pooling Transformer for Scene Understanding", arXiv, 2021 (南海大学).[論文]]
  • PvTv2: "PVTv2: Improved Baselines with Pyramid Vision Transformer", arXiv, 2021 (南京大学).[論文][パイトーチ]
  • LG-Transformer: "Local-to-Global Self-Attention in Vision Transformers", arXiv, 2021 (IIAI, UAE).[論文]]
  • ViP: "Visual Parser: Representing Part-whole Hierarchies with Transformers", arXiv, 2021 (オックスフォード).[論文]]
  • Scaled-ReLU: "Scaled ReLU Matters for Training Vision Transformers", AAAI, 2022 (Alibaba).[論文]]
  • LIT: "Less is More: Pay less Attention in Vision Transformers", AAAI, 2022 (Monash University).[論文][パイトーチ]
  • DTN: "Dynamic Token Normalization Improve Vision Transformer", ICLR, 2022 (Tencent).[論文][パイトーチ(建設中)]
  • RegionViT: "RegionViT: Regional-to-Local Attention for Vision Transformers", ICLR, 2022 (MIT-IBM Watson).[論文][パイトーチ]
  • CrossFormer: "CrossFormer: A Versatile Vision Transformer Based on Cross-scale Attention", ICLR, 2022 (Zhejiang University).[論文][パイトーチ]
  • ?: "Scaling the Depth of Vision Transformers via the Fourier Domain Analysis", ICLR, 2022 (UT Austin).[論文]]
  • ViT-G: "Scaling Vision Transformers", CVPR, 2022 (Google).[論文]]
  • CSWin: "CSWin Transformer: A General Vision Transformer Backbone with Cross-Shape Windows", CVPR, 2022 (Microsoft).[論文][パイトーチ]
  • MPViT: "MPViT: Multi-Path Vision Transformer for Dense Prediction", CVPR, 2022 (KAIST).[論文][パイトーチ]
  • Diverse-ViT: "The Principles of Diversity: Training Strong Vision Transformers Requires for Reduce All Level of Redundancy", CVPR, 2022 (UT Austin).[論文][パイトーチ]
  • DW-ViT: "Beyond Fixation: Dynamic Window Visual Transformer", CVPR, 2022 (Dark Matter AI, China).[論文][パイトーチ(建設中)]
  • MixFormer: "MixFormer: Mixing Features across Windows and Dimensions", CVPR, 2022 (Baidu).[論文][パドル]]
  • DAT: "Vision Transformer with Deformable Attention", CVPR, 2022 (清華).[論文][パイトーチ]
  • Swin-Transformer-V2: "Swin Transformer V2: Scaling Up Capacity and Resolution", CVPR, 2022 (Microsoft).[論文][パイトーチ]
  • MSG-Transformer: "MSG-Transformer: Exchange Local Spatial Information by Manipulating Messenger Tokens", CVPR, 2022 (華中科学技術大学).[論文][パイトーチ]
  • NomMer: "NomMer: Nominate Synergistic Context in Vision Transformer for Visual Recognition", CVPR, 2022 (Tencent).[論文][パイトーチ]
  • Shunted: "Shunted Self-Attention via Multi-Scale Token Aggregation", CVPR, 2022 (NUS).[論文][パイトーチ]
  • PyramidTNT: "PyramidTNT: Improved Transformer-in-Transformer Baselines with Pyramid Architecture", CVPRW, 2022 (Huawei).[論文][パイトーチ]
  • X-ViT: "X-ViT: ソフトマックスなしの高性能リニアビジョントランス", CVPRW, 2022 (カカオ).[論文]]
  • ReMixer: "ReMixer: Object-aware Mixing Layer for Vision Transformers", CVPRW, 2022 (KAIST).[論文][パイトーチ]
  • UN:"Unified Normalization for Acceleration and Stabilizing Transformers", ACMMM, 2022 (Hikvision).[論文][コード(建設中)]
  • Wave-ViT: "Wave-ViT: Unifying Wavelet and Transformers for Visual Representation Learning", ECCV, 2022 (JD).[論文][パイトーチ]
  • DaViT: "DaViT: Dual Attention Vision Transformers", ECCV, 2022 (Microsoft).[論文][パイトーチ]
  • ScalableViT: "ScalableViT: Rethinking the Context-oriented Generalization of Vision Transformer", ECCV, 2022 (ByteDance).[論文]]
  • MaxViT: "MaxViT: Multi-Axis Vision Transformer", ECCV, 2022 (Google).[論文][テンソルフロー]]
  • VSA: "VSA: Learning Varied-Size Window Attention in Vision Transformers", ECCV, 2022 (シドニー大学).[論文][パイトーチ]
  • ?: "Expediting Large-Scale Vision Transformer for Dense Prediction without Fine-Tuning", NeurIPS (Microsoft).[論文]]
  • BViT: "BViT: Broad Attention based Vision Transformer", arXiv, 2022 (CAS).[論文]]
  • O-ViT: "O-ViT: Orthogonal Vision Transformer", arXiv, 2022 (華東師範大学).[論文]]
  • MOA-Transformer: "Aggregating Global Features into Local Vision Transformer", arXiv, 2022 (カンザス大学).[論文][パイトーチ]
  • BOAT:"BOAT:Bilateral Local Attention Vision Transformer"、arXiv、2022年(Baidu+HKU)。[論文]]
  • ViTAEv2: "ViTAEv2: Vision Transformer Advanced by Exploring Inductive Bias for Image Recognition and Beyond", arXiv, 2022 (シドニー大学).[論文]]
  • VAN: "Visual Attention Network", arXiv, 2022 (清華).[論文][パイトーチ]
  • HiP: "Hierarchical Perceptioner", arXiv, 2022 (DeepMind).[論文]]
  • PatchMerger: "Learning to Merge Tokens in Vision Transformers", arXiv, 2022 (Google).[論文]]
  • DGT: "Dynamic Group Transformer: A General Vision Transformer Backbone with Dynamic Group Attention", arXiv, 2022 (Baidu).[論文]]
  • NAT: "Neighborhood Attention Transformer", arXiv, 2022 (オレゴン州).[論文][パイトーチ]
  • ASF-former: "Adaptive Split-Fusion Transformer", arXiv, 2022 (Fudan).[論文][パイトーチ(建設中)]
  • LITv2: "Fast Vision Transformers with HiLo Attention", arXiv, 2022 (Monash University).[論文][コード(建設中)]
  • PerViT: "Peripheral Vision Transformer", arXiv, 2022 (POSTECH).[論文]]
  • SP-ViT: "SP-ViT: Learning 2D Spatial Priors for Vision Transformers", arXiv, 2022 (Alibaba).[論文]]
  • EATFormer: "EATFormer: Improving Vision Transformer Inspired by Evolutionary Algorithm", arXiv, 2022 (Zhejiang University).[論文]]
  • GC-ViT: "Global Context Vision Transformers", arXiv, 2022 (NVIDIA).[論文][パイトーチ]
  • LinGlo: "Rethinking Query-Key Pairwise Interactions in Vision Transformers", arXiv, 2022 (TCL Research Wuhan).[論文]]
  • Dual-ViT: "Dual Vision Transformer", arXiv, 2022 (JD).[論文][パイトーチ]
  • MMA: "Multi-manifold Attention for Vision Transformers", arXiv, 2022 (Centre for Research and Technology Hellas, ギリシャ).[論文]]
  • MAFormer: "MAFormer: A Transformer Network with Multi-scale Attention Fusion for Visual Recognition", arXiv, 2022 (Baidu).[論文]]
  • AEWin: "ビジョントランスフォーマーにおけるローカル-グローバルインタラクションのための軸方向に拡張されたウィンドウ", arXiv, 2022 (南西交通大学).[論文]]
  • MAGNETO: "Foundation Transformers", arXiv, 2022 (Microsoft).[論文]]

効率的なビジョントランス

  • DeiT: "トレーニングデータ効率の高い画像トランスフォーマーと注意による蒸留", ICML, 2021 (Facebook).[論文][パイトーチ]
  • ConViT: "ConViT: Improving Vision Transformers with Soft Convolutional Inductive Biases", ICML, 2021 (Facebook).[論文][コード]]
  • ?: "Improving the Efficiency of Transformers for Resource-Constraint-Devices", DSD, 2021 (NavInfo Europe, Netherlands).[論文]]
  • PS-ViT: "Vision Transformer with Progressive Sampling", ICCV, 2021 (CPII).[論文]]
  • HVT: "Scalable Visual Transformers with Hierarchical Pooling", ICCV, 2021 (Monash University).[論文][パイトーチ]
  • CrossViT: "CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification", ICCV, 2021 (MIT-IBM).[論文][パイトーチ]
  • ViL: "Multi-Scale Vision Longformer: A New Vision Transformer for High-Resolution Image Encoding", ICCV, 2021 (Microsoft).[論文][パイトーチ]
  • Visformer: "Visformer: The Vision-friendly Transformer", ICCV, 2021 (北漢大学).[論文][パイトーチ]
  • MultiExitViT: "Multi-Exit Vision Transformer for Dynamic Inference", BMVC, 2021 (オーフス大学, デンマーク).[論文][テンソルフロー]]
  • SViTE: "Chasing Sparsity in Vision Transformers: A End-to-End Exploration", NeurIPS, 2021 (UT Austin).[論文][パイトーチ]
  • DGE: "Dynamic Grained Encoder for Vision Transformers", NeurIPS, 2021 (Megvii).[論文][パイトーチ]
  • GGトランスフォーマー: "一瞥と視線のビジョントランスフォーマー", NeurIPS, 2021 (JHU).[論文][コード(建設中)]
  • DynamicViT: "DynamicViT: Efficient Vision Transformers with Dynamic Token Sparsification", NeurIPS, 2021 (清華).[論文][パイトーチ][ウェブサイト]]
  • ResT: "ResT: A Efficient Transformer for Visual Recognition", NeurIPS, 2021 (南京大学).[論文][パイトーチ]
  • Adder-Transformer: "Adder Attention for Vision Transformer", NeurIPS, 2021 (Huawei).[論文]]
  • ソフト: "SOFT: Softmax-free Transformer with Linear Complexity", NeurIPS, 2021 (Fudan).[論文][パイトーチ][ウェブサイト]]
  • IA-RED 2: "IA-RED2: Interpretability-Aware Redundancy Reduction for Vision Transformers", NeurIPS, 2021 (MIT-IBM).[論文][ウェブサイト]]
  • LocalViT: "LocalViT: Bring Locality to Vision Transformers", arXiv, 2021 (ETHZ).[論文][パイトーチ]
  • CCT: "Compact Transformersによるビッグデータパラダイムの脱出", arXiv, 2021 (オレゴン大学).[論文][パイトーチ]
  • DiversePatch: "Vision Transformers with Patch Diversification", arXiv, 2021 (UT Austin + Facebook).[論文][パイトーチ]
  • SL-ViT: "オーバーヘッドの少ないより正確な早期出口のための単層ビジョントランス", arXiv, 2021 (オーフス大学).[論文]]
  • ?: "動的推論のためのマルチ出口ビジョントランス", arXiv, 2021 (オーフス大学, デンマーク).[論文]]
  • DeiT-Manifold: "Efficient Vision Transformers via Fine-Grained Manifold Distillation", arXiv, 2021 (Huawei).[論文]]
  • ViX: "Vision Xformers: Efficient Attention for Image Classification", arXiv, 2021 (Indian Institute of Technology Bombay).[論文]]
  • Transformer-LS: "Long-Short Transformer: Efficient Transformers for Language and Vision", NeurIPS, 2021 (NVIDIA).[論文][パイトーチ]
  • WideNet: "Go Wide Instead of Deeper", arXiv, 2021 (NUS).[論文]]
  • Armour: "Armour: Generalizable Compact Self-Attention for Vision Transformers", arXiv, 2021 (Arm).[論文]]
  • IPE: "Exploring and Improving Mobile Level Vision Transformers", arXiv, 2021 (CUHK).[論文]]
  • DS-Net++: "DS-Net++: Dynamic Weight Slicing for Efficient Inference in CNNs and Transformers", arXiv, 2021 (Monash University).[論文][パイトーチ]
  • UFO-ViT: "UFO-ViT: High Performance Linear Vision Transformer without Softmax", arXiv, 2021 (Kakao).[論文]]
  • Token-Pooling: "Token Pooling in Visual Transformers", arXiv, 2021 (Apple).[論文]]
  • Evo-ViT: "Evo-ViT: Slow-Fast Token Evolution for Dynamic Vision Transformer", AAAI, 2022 (Tencent).[論文][パイトーチ]
  • PS-Attention: "Pale Transformer: A General Vision Transformer Backbone with Pale-Shape Attention", AAAI, 2022 (Baidu).[論文][パドル]]
  • ShiftViT: "When Shift Operation Meets Vision Transformer: A Very Simple Alternative to Attention Mechanism", AAAI, 2022 (Microsoft).[論文][パイトーチ]
  • EViT: "すべてのパッチが必要なわけではありません: トークン再編成によるビジョントランスフォーマーの迅速化", ICLR, 2022 (Tencent).[論文][パイトーチ]
  • QuadTree: "QuadTree Attention for Vision Transformers", ICLR, 2022 (Simon Fraser + Alibaba).[論文][パイトーチ]
  • アンチオーバースムージング:「フーリエドメイン分析によるディープビジョントランスフォーマーにおけるアンチオーバースムージング:理論から実践へ」、ICLR、2022(UTオースティン)。[論文][パイトーチ]
  • QnA: "Learned Query for Efficient Local Attention", CVPR, 2022 (Tel-Aviv).[論文][ジャックス]
  • LVT: "Lite Vision Transformer with Enhanced Self-Attention", CVPR, 2022 (Adobe).[論文][パイトーチ]
  • A-ViT: "A-ViT: Adaptive Tokens for Efficient Vision Transformer", CVPR, 2022 (NVIDIA).[論文][ウェブサイト]]
  • PS-ViT: "Patch Slimming for Efficient Vision Transformers", CVPR, 2022 (Huawei).[論文]]
  • Rev-MViT: "Reversible Vision Transformers", CVPR, 2022 (Meta).[論文][パイトーチ]
  • AdaViT: "AdaViT: Adaptive Vision Transformers for Efficient Image Recognition", CVPR, 2022 (Fudan).[論文]]
  • DQS: "Dynamic Query Selection for Fast Visual Perceptioner", CVPRW, 2022 (Sorbonne Universite', France).[論文]]
  • ATS: "Adaptive Token Sampling for Efficient Vision Transformers", ECCV, 2022 (Microsoft).[論文][ウェブサイト]]
  • EdgeViT: "EdgeViTs: Competition Light-weight CNNs on Mobile Device with Vision Transformers", ECCV, 2022 (Samsung).[論文][パイトーチ]
  • SReT: "Sliced Recursive Transformer", ECCV, 2022 (CMU + MBZUAI).[論文][パイトーチ]
  • SiT: "Self-slimmed Vision Transformer", ECCV, 2022 (SenseTime).[論文][パイトーチ]
  • DFvT: "二重融合ViT: Vision Transformer Doubly からの情報をローカル表現と融合", ECCV, 2022 (Alibaba).[論文]]
  • TerViT: "TerViT: An Efficient Ternary Vision Transformer", arXiv, 2022 (北漢大学).[論文]]
  • MT-ViT: "効率的な推論のためのマルチテールビジョントランス", arXiv, 2022 (武漢大学).[論文]]
  • ViT-P: "ViT-P: Rethinking Data-efficient Vision Transformers from Locality", arXiv, 2022 (重慶工科大学).[論文]]
  • CF-ViT: "粗いビジョントランスフォーマー", arXiv, 2022 (厦門大学 + テンセント).[論文][パイトーチ]
  • EIT:"EIT:効率的に誘導バイアスをViTに導く"、arXiv、2022(軍事科学院、中国)。[論文]]
  • SepViT: "SepViT: Separable Vision Transformer", arXiv, 2022 (中国電子科技大学).[論文]]
  • ResT-V2: "ResT V2: Simpler, Faster and Stronger", arXiv, 2022 (南京大学).[論文][パイトーチ]
  • TRT-ViT: "TRT-ViT: TensorRT-oriented Vision Transformer", arXiv, 2022 (ByteDance).[論文]]
  • SuperViT: "Super Vision Transformer", arXiv, 2022 (厦門大学).[論文][パイトーチ]
  • EfficientViT: "EfficientViT: Enhanced Linear Attention for High-Resolution Low-Computation Visual Recognition", arXiv, 2022 (MIT).[論文]]
  • EfficientFormer: "EfficientFormer: Vision Transformers at MobileNet Speed", arXiv, 2022 (Snap).[論文][コード(建設中)]
  • Tutel: "Tutel: Adaptive Mixture-of-Experts at Scale", arXiv, 2022 (Microsoft).[論文][パイトーチ]
  • SimA: "SimA: Simple Softmax-free Attention for Vision Transformers", arXiv, 2022 (Maryland + UC Davis).[論文][パイトーチ]
  • EdgeNeXt: "EdgeNeXt: Efficient Amalgamated CNN-Transformer Architecture for Mobile Vision Applications", arXiv, 2022 (MBZUAI).[論文][パイトーチ]
  • VVT: "Near Vision Transformer", arXiv, 2022 (Australian National University).[論文][コード(建設中)]
  • ソフト:"ソフトマックスフリーリニアトランス"、arXiv、2022(フーダン)。[論文][パイトーチ]
  • MaiT: "MaiT: Leverage Attention Masks for More Efficient Image Transformers", arXiv, 2022 (Samsung).[論文]]
  • LightViT: "LightViT: Towards Light-Weight Convolution-Free Vision Transformers", arXiv, 2022 (SenseTime).[論文][コード(建設中)]
  • Next-ViT: "Next-ViT: Next Generation Vision Transformer for Efficient Deployment in Realistic Industrial Scenarios", arXiv, 2022 (ByteDance).[論文]]
  • XFormer: "Lightweight Vision Transformer with Cross Feature Attention", arXiv, 2022 (Samsung).[論文]]
  • PatchDropout: "PatchDropout: Economizing Vision Transformers using Patch Dropout", arXiv, 2022 (KTH, Sweden).[論文]]
  • ClusTR: "ClusTR: Exploring Efficient Self-attention via Clustering for Vision Transformers", arXiv, 2022 (アデレード大学, オーストラリア).[論文]]
  • DiNAT: "Dilated Neighborhood Attention Transformer", arXiv, 2022 (オレゴン大学).[論文][パイトーチ]
  • MobileViTv3: "MobileViTv3: Mobile-Friendly Vision Transformer with Simple and Effective Fusion of Local, Global and Input Features", arXiv, 2022 (Micron).[論文][パイトーチ]
  • ToMe: "Token Merging: Your ViT but Faster", arXiv, 2022 (Meta).[論文][パイトーチ]
  • ViTCoD: "ViTCoD: Vision Transformer Acceleration via Dedicated Algorithm and Accelerator Co-Design", IEEE International Symposium on High-Performance Computer Architecture (HPCA), 2023 (Georgia Tech).[論文]]

コンブ+変圧器

  • LeViT: "LeViT: a Vision Transformer in ConvNet's Clothing for Faster Inference", ICCV, 2021 (Facebook).[論文][パイトーチ]
  • CeiT: "Incorporated Convolution Designs into Visual Transformers", ICCV, 2021 (SenseTime).[論文][パイトーチ (rishikksh20)]
  • Conformer: "Conformer: Local Features Coupling Global Representations for Visual Recognition", ICCV, 2021 (CAS).[論文][パイトーチ]
  • CoaT: "Co-Scale Conv-Attentional Image Transformers", ICCV, 2021 (UCSD).[論文][パイトーチ]
  • CvT: "CvT: Introducing Convolutions to Vision Transformers", ICCV, 2021 (Microsoft).[論文][コード]]
  • ViTc: "Early Convolutions Help Transformers See Better", NeurIPS, 2021 (Facebook).[論文]]
  • ConTNet: "ConTNet: なぜ畳み込みと変圧器を同時に使わないのか?", arXiv, 2021 (ByteDance).[論文][パイトーチ]
  • SPACH: "A Battle of Network Structures: An Empirical Study of CNN, Transformer, and MLP", arXiv, 2021 (Microsoft).[論文]]
  • MobileViT: "MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer", ICLR, 2022 (Apple).[論文][パイトーチ]
  • CMT: "CMT: Convolutional Neural Networks Meet Vision Transformers", CVPR, 2022 (Huawei).[論文]]
  • Mobile-Former: "Mobile-Former: Bridging MobileNet and Transformer", CVPR, 2022 (Microsoft).[論文][パイトーチ(建設中)]
  • TinyViT: "TinyViT: Fast Pretraining Distillation for Small Vision Transformers", ECCV, 2022 (Microsoft).[論文][パイトーチ]
  • CETNet: "畳み込み埋め込みは階層ビジョントランスをより強くする", ECCV, 2022 (OPPO).[論文]]
  • ParC-Net: "ParC-Net: Position Aware Circular Convolution with Merits from ConvNets and Transformer", ECCV, 2022 (Intellifusion, China).[論文][パイトーチ]
  • ?: "How to Train Vision Transformer on Small-scale Datasets?", BMVC, 2022 (MBZUAI).[論文][パイトーチ]
  • DHVT: "Small Datasetsにおけるビジョントランスフォーマーと畳み込みニューラルネットワークの間のギャップを埋める", NeurIPS, 2022 (USTC).[論文][コード(建設中)]
  • CXV: "Convolutional Xformers for Vision", arXiv, 2022 (IIT Bombay).[論文][パイトーチ]
  • ConvMixer: "Patches Is All You Need?", arXiv, 2022 (CMU).[論文][パイトーチ]
  • MobileViTv2: "Separable Self-attention for Mobile Vision Transformers", arXiv, 2022 (Apple).[論文][パイトーチ]
  • UniFormer: "UniFormer: Unifying Convolution and Self-attention for Visual Recognition", arXiv, 2022 (SenseTime).[論文][パイトーチ]
  • EdgeFormer: "EdgeFormer: Improving Light-weight ConvNets by Learning from Vision Transformers", arXiv, 2022 (?).[論文]]
  • iFormer: "Inception Transformer", arXiv, 2022 (Sea AI Lab).[論文][パイトーチ]
  • MoCoViT: "MoCoViT: Mobile Convolutional Vision Transformer", arXiv, 2022 (ByteDance).[論文]]
  • DynamicViT: "Dynamic Spatial Sparsification for Efficient Vision Transformers and Convolutional Neural Networks", arXiv, 2022 (清華大学).[論文][パイトーチ]
  • ConvFormer: "ConvFormer: Closing the Gap Between CNN and Vision Transformers", arXiv, 2022 (National University of Defense Technology, China).[論文]]
  • Fast-ParC: "Fast-ParC: Position Aware Global Kernel for ConvNets and ViTs", arXiv, 2022 (Intellifusion, China).[論文]]
  • MetaFormer: "MetaFormer Baselines for Vision", arXiv, 2022 (Sea AI Lab).[論文][パイトーチ]
  • SATA: "Accumulated Trivial Attention Matters in Vision Transformers on Small Datasets", WACV, 2023 (University of Kansas).[論文][パイトーチ(建設中)]

トレーニング+変圧器

  • iGPT: "Generative Pretraining from Pixels", ICML, 2020 (OpenAI).[論文][テンソルフロー]]
  • MoCo-V3: "A Empirical Study of Training Self-Supervised Vision Transformers", ICCV, 2021 (Facebook).[論文]]
  • DINO:"Emerging Properties in Self-Supervised Vision Transformers", ICCV, 2021 (Facebook).[論文][パイトーチ]
  • drloc: "Efficient Training of Visual Transformers with Small Datasets", NeurIPS, 2021 (University of Trento).[論文][パイトーチ]
  • CARE:「自己監督視覚表現学習におけるトランスフォーマーによるCNNの注意の活性化」、NeurIPS、2021(Tencent)。[論文][パイトーチ]
  • MST: "MST: Masked Self-Supervised Transformer for Visual Representation", NeurIPS, 2021 (SenseTime).[論文]]
  • SiT: "SiT: Self-Supervised Vision Transformer", arXiv, 2021 (University of Surrey).[論文][パイトーチ]
  • MoBY: "Swin Transformersによる自己教師あり学習", arXiv, 2021 (Microsoft).[論文][パイトーチ]
  • ?: "Investigation Transfer Learning Capabilities of Vision Transformers and CNNs by Fine-Tuning a Single Trainable Block", arXiv, 2021 (Pune Institute of Computer Technology, India).[論文]]
  • Annotations-1.3B: "Billion-Scale Pretraining with Vision Transformers for Multi-Task Visual Representations", WACV, 2022 (Pinterest).[論文]]
  • BEiT: "BEiT: BERT Pre-Training of Image Transformers", ICLR, 2022 (Microsoft).[論文][パイトーチ]
  • EsViT: "Efficient Self-Supervised Vision Transformers for Representation Learning", ICLR, 2022 (Microsoft).[論文]]
  • iBOT: "Image BERT Pre-Training with Online Tokenizer", ICLR, 2022 (ByteDance).[論文][パイトーチ]
  • MaskFeat: "Self-Supervised Visual Pre-Trainingのためのマスクされた機能予測", CVPR, 2022 (Facebook).[論文]]
  • AutoProg: "Automated Progressive Learning for Efficient Training of Vision Transformers", CVPR, 2022 (Monash University, Australia).[論文][コード(建設中)]
  • MAE: "マスクされたオートエンコーダーはスケーラブルなビジョン学習者です", CVPR, 2022 (Facebook).[論文][パイトーチ][パイトーチ(ペンジリアン)]
  • SimMIM: "SimMIM: A Simple Framework for Masked Image Modeling", CVPR, 2022 (Microsoft).[論文][パイトーチ]
  • SelfPatch: "Patch-Level Representation Learning for Self-Supervised Vision Transformers", CVPR, 2022 (KAIST).[論文][パイトーチ]
  • Bootstrapping-ViTs: "Bootstrapping ViTs: Towards Liberating Vision Transformers from Pre-training", CVPR, 2022 (Zhejiang University).[論文][パイトーチ]
  • TransMix: "TransMix: Attend to Mix for Vision Transformers", CVPR, 2022 (JHU).[論文][パイトーチ]
  • PatchRot: "PatchRot: A Self-Supervised Technique for Training Vision Transformers", CVPRW, 2022 (アリゾナ州).[論文]]
  • SplitMask: "大規模データセットは自己監視型事前トレーニングに必要ですか?", CVPRW, 2022 (Meta).[論文]]
  • MC-SSL: "MC-SSL: Towards Multi-Concept Self-Supervised Learning", CVPRW, 2022 (University of Surrey, UK).[論文]]
  • RelViT: "私の隣人はどこにいるの?自己監督型ビジョントランスフォーマーにおけるパッチ関係の活用」、CVPRW、2022年(パドヴァ大学、イタリア)。[論文]]
  • data2vec: "data2vec: A General Framework for Self-Supervised Learning in Speech, Vision and Language", ICML, 2022 (Meta).[論文][パイトーチ]
  • SSTA:「自己監視モデルはビジョントランスフォーマーのための良いティーチングアシスタントです」、ICML、2022(テンセント)。[論文][コード(建設中)]
  • MP3: "効果的な事前トレーニング戦略としてのポジション予測", ICML, 2022 (Apple).[論文]]
  • CutMixSL: "Visual Transformer Meets CutMix for Improved Accuracy, Communication Efficiency, and Data Privacy in Split Learning", IJCAI, 2022 (延世大学, 韓国).[論文]]
  • BootMAE: "Bootstrapped Masked Autoencoders for Vision BERT Pretraining", ECCV, 2022 (Microsoft).[論文][パイトーチ]
  • TokenMix: "TokenMix: Rethinking Image Mixing for Data Augmentation in Vision Transformers", ECCV, 2022 (CUHK).[論文][パイトーチ]
  • ?: "Locality Guidance for Improving Vision Transformers on Tiny Datasets", ECCV, 2022 (北京大学).[論文][パイトーチ]
  • HAT:「高周波部品の再考によるビジョントランスの改善」、ECCV、2022(清華)。[論文][パイトーチ]
  • IDMM:"Training Vision Transformers with Only 2040 Images", ECCV, 2022 (南京大学).[論文]]
  • AttMask: "What to Hide from Your Students: Attention-Guided Masked Image Modeling", ECCV, 2022 (アテネ国立工科大学).[論文][パイトーチ]
  • SLIP:"SLIP:自己監督と言語イメージの事前トレーニングを満たす"、ECCV、2022(バークレー+メタ)。[論文][パイトーチ]
  • mc-BEiT: "mc-BEiT: Multi-Choice Discretization for Image BERT Pre-training", ECCV, 2022 (北京大学).[論文]]
  • SL2O: "Scalable Learning to Optimize: A Learned Optimizer Can Train Big Models", ECCV, 2022 (UT Austin).[論文][パイトーチ]
  • TokenMixup: "TokenMixup: Efficient Attention-guided Token-level Data Augmentation for Transformers", NeurIPS, 2022 (高麗大学).[論文][コード(建設中)]
  • ?: "あなたのViTを訓練する方法は?ビジョントランスフォーマーにおけるデータ、拡張、および正規化」、Transactions on Machine Learning Research (TMLR)、2022 (Google)。[論文][テンソルフロー][パイトーチ (ライトマン)]
  • PeCo: "PeCo: Perceptual Codebook for BERT Pre-Training of Vision Transformers", arXiv, 2022 (Microsoft).[論文]]
  • RePre: "RePre: Improving Self-Supervised Vision Transformer with Reconstructive Pre-training", arXiv, 2022 (北京郵電大学).[論文]]
  • Beyond-Masking: "Beyond Masking: Demystifying Token-Based Pre-Training for Vision Transformers", arXiv, 2022 (CAS).[論文][コード(建設中)]
  • Kronecker-Adaptation: "Parameter-efficient Fine-Tuning for Vision Transformers", arXiv, 2022 (Microsoft).[論文]]
  • DILEMMA: "DILEMMA: Self-Supervised Shape and Texture Learning with Transformers", arXiv, 2022 (ベルン大学, スイス).[論文]]
  • DeiT-III: "DeiT III: Revenge of the ViT", arXiv, 2022 (Meta).[論文]]
  • ?: "ImageNet-1k のための Better plain ViT baselines for", arXiv, 2022 (Google).[論文][テンソルフロー]]
  • ConvMAE: "ConvMAE: Masked Convolution Meets Masked Autoencoders", arXiv, 2022 (Shanghai AI Laboratory).[論文][パイトーチ(建設中)]
  • ViT-Adapter: "Vision Transformer Adapter for Dense Predictions", arXiv, 2022 (Shanghai AI Lab).[論文][コード(建設中)]
  • UM-MAE: "Uniform Masking: Enabling MAE Pre-Training for Pyramid-based Vision Transformers with Locality", arXiv, 2022 (南京科技大学).[論文][パイトーチ]
  • MixMIM: "MixMIM: Mixed and Masked Image Modeling for Efficient Visual Representation Learning", arXiv, 2022 (SenseTime).[論文][コード(建設中)]
  • A2MIM: "Architecture-Agnostic Masked Image Modeling - From ViT back to CNN", arXiv, 2022 (Westlake University, China).[論文][パイトーチ]
  • GMML:"GMML is All You Need", arXiv, 2022 (University of Surrey, UK).[論文][パイトーチ]
  • HiViT: "HiViT: Hierarchical Vision Transformer Meets Masked Image Modeling", arXiv, 2022 (CAS).[論文]]
  • ?: "A Closer Look at Self-Supervised Lightweight Vision Transformers", arXiv, 2022 (Megvii).[論文]]
  • SIM: "自己監督型視覚表現学習のためのシャム画像モデリング", arXiv, 2022 (SenseTime).[論文]]
  • SupMAE: "SupMAE: Supervised Masked Autoencoder are Efficient Vision Learners", arXiv, 2022 (UT Austin).[論文][パイトーチ]
  • LoMaR: "Efficient Self-Supervised Vision Pretraining with Local Masked Reconstruction", arXiv, 2022 (KAUSST).[論文]]
  • SAR: "Spatial Entropy Regularization for Vision Transformers", arXiv, 2022 (University of Trento, Italy).[論文]]
  • ExtreMA: "Extreme Masking for Learning Instance and Distributed Visual Representations", arXiv, 2022 (Microsoft).[論文]]
  • ?: "自己監督トランスフォーマーの特徴自己関係の探索", arXiv, 2022 (南海大学).[論文]]
  • ?: "Position Label for Self-Supervised Vision Transformer", arXiv, 2022 (Southwest Jiaotong University).[論文]]
  • ジグソーパズル-ViT: "ジグソーパズル-ViT: ビジョントランスフォーマーでジグソーパズルを学ぶ", arXiv, 2022 (KU ルーヴェン, ベルギー).[論文][パイトーチ][ウェブサイト]]
  • DropKey: "DropKey", arXiv, 2022 (Meitu).[論文]]
  • BEiT-v2: "BEiT v2: Masked Image Modeling with Vector-Quantized Visual Tokenizers", arXiv, 2022 (Microsoft).[論文][パイトーチ]
  • MILAN:"MILAN:言語支援表現に関するマスクされた画像事前トレーニング"、arXiv、2022(プリンストン)。[論文][パイトーチ(建設中)]
  • PSS: "Accelerateating Vision Transformer Training via a Patch Sampling Schedule", arXiv, 2022 (Franklin and Marshall College, Pennsylvania).[論文][パイトーチ]
  • MaskCLIP: "MaskCLIP: Masked Self-Distillation Advances Contrastive Language-Image Pretraining", arXiv, 2022 (Microsoft).[論文]]
  • DMAE:「マスクされたオートエンコーダは効率的な知識蒸留器を可能にする」、arXiv、2022(JHU + UC Santa Cruz)。[論文][コード(建設中)]
  • dBOT: "Exploring Target Representations for Masked Autoencoder", arXiv, 2022 (ByteDance).[論文]]
  • PatchErasesing: "Effective Vision Transformer Training: A Data-Centric Perspective", arXiv, 2022 (Alibaba).[論文]]
  • 自己蒸留: "トランスフォーマーのさらなる事前訓練のための自己蒸留", arXiv, 2022 (KAIST).[論文]]
  • TL-Align: "Token-Label Alignment for Vision Transformers", arXiv, 2022 (清華大学).[論文][パイトーチ]
  • AutoView: "Learning Self-Regularized Adversarial Views for Self-Supervised Vision Transformers", arXiv, 2022 (Sun Yat-sen University).[論文][コード(建設中)]
  • CLIPpy: "Perceptual Grouping in Vision-Language Models", arXiv, 2022 (Apple).[論文]]

ロバスト性+変圧器

  • ViT-ロバストネス: "画像分類のためのトランスフォーマーのロバスト性を理解する", ICCV, 2021 (Google).[論文]]
  • SAGA:「敵対的例に対するビジョントランスフォーマーの堅牢性について」、ICCV、2021(コネチカット大学)。[論文]]
  • ?: "Adversarial Robustness Comparison of Vision Transformer and MLP-Mixer to CNNs", BMVC, 2021 (KAIST).[論文][パイトーチ]
  • ViTs-vs-CNNs: "Is Transformers More Robust than CNNs?", NeurIPS, 2021 (JHU + UC Santa Cruz).[論文][パイトーチ]
  • T-CNN:"Transform CNNs: recast pretrained convolutional layer with self-attention", arXiv, 2021 (Facebook).[論文]]
  • トランスフォーマーアタック:「ビジュアルトランスフォーマーの敵対的堅牢性について」、arXiv、2021(西安交通)。[論文]]
  • ?: "Reveal of Vision Transformers Robustness against Adversarial Attack", arXiv, 2021 (University of Rennes).[論文]]
  • ?: "On Improving Adversarial Transferability of Vision Transformers", arXiv, 2021 (ANU).[論文][パイトーチ]
  • ?: "Exploring Corruption Robustness: Inductive Biases in Vision Transformers and MLP-Mixers", arXiv, 2021 (University of Pittsburgh).[論文]]
  • トークンアタック: "ビジョントランスフォーマーに対する敵対的トークン攻撃", arXiv, 2021 (ニューヨーク大学).[論文]]
  • ?: "離散表現はビジョントランスの堅牢性を強化", arXiv, 2021 (Google).[論文]]
  • ?: "Vision Transformers are Robust Learners", AAAI, 2022 (PyImageSearch + IBM).[論文][テンソルフロー]]
  • PNA: "Towards Transferable Adversarial Attack on Vision Transformers", AAAI, 2022 (Fudan + Maryland).[論文][パイトーチ]
  • MIA-Former: "MIA-Former: Efficient and Robust Vision Transformers via Multi-Graed Input-Adaptation", AAAI, 2022 (Rice University).[論文]]
  • Patch-Fool: "Patch-Fool: Is Vision Transformers Always Robust Against Adversarial Perturbations?", ICLR, 2022 (Rice University).[論文][パイトーチ]
  • Generalization-Enhanced-ViT: "Delving Deep into the Generalization of Vision Transformers under Distribution Shifts", CVPR, 2022 (Beihang University + NTU, Singapore).[論文]]
  • ECViT: "Towards Practical Certifiable Patch Defense with Vision Transformer", CVPR, 2022 (Tencent).[論文]]
  • Attention-Fool: "Give Me Your Attention: Dot-Product Attention Considered Harmful for Adversarial Patch Robustness", CVPR, 2022 (Bosch).[論文]]
  • Memory-Token: "Fine-tuning Image Transformers using Learnable Memory", CVPR, 2022 (Google).[論文]]
  • エイプリル: "エイプリル: ビジョントランスフォーマーのプライバシーに関するアキレス腱の発見", CVPR, 2022 (CAS).[論文]
  • Smooth-ViT: "Certified Patch Robustness via Smoothed Vision Transformers", CVPR, 2022 (MIT).[論文][パイトーチ]
  • RVT: "Towards Robust Vision Transformer", CVPR, 2022 (Alibaba).[論文][パイトーチ]
  • Pyramid: "Pyramid Adversarial Training Improve ViT Performance", CVPR, 2022 (Google).[論文]]
  • VARS: "Visual Attention Emerges from Recurrent Sparse Reconstruction", ICML, 2022 (Berkeley + Microsoft).[論文][パイトーチ]
  • FAN: "Understanding The Robustness in Vision Transformers", ICML, 2022 (NVIDIA).[論文][パイトーチ]
  • CFA: "テストタイムクラス条件付き特徴アライメントによるゼロからの再訓練を伴わない強靭なビジョントランス", IJCAI, 2022 (東京大学).[論文][パイトーチ]
  • ?: "コーシー問題によるビジョントランスフォーマーの敵対的堅牢性の理解", ECML-PKDD, 2022 (エクセター大学, 英国).[論文][パイトーチ]
  • ?: "A Impartial Take to the CNN vs Transformer Robustness Contest", ECCV, 2022 (オックスフォード).[論文]]
  • AGAT:"ビジョントランスフォーマーに関する効率的な敵対的トレーニングに向けて"、ECCV、2022(浙江大学)。[論文]]
  • ?: "ビジョントランスは摂動に強く、摂動にパッチを当てることができますか?", ECCV, 2022 (TUM).[論文]]
  • ViP: "ViP: Unified Certified Detection and Recovery for Patch Attack with Vision Transformers", ECCV, 2022 (UC Santa Cruz).[論文][パイトーチ]
  • ?: "When Adversarial Training Meets Vision Transformers: Recipes from Training to Architecture", NeurIPS, 2022 (北京大学).[論文][コード(建設中)]
  • ?: "Is Vision Transformers Robust to Spurious Correlations?", arXiv, 2022 (UW-Madison).[論文]]
  • MA:「MLPミキサーの敵対的移転可能性を高める」、arXiv、2022(北京工科大学)。[論文]]
  • ?: "Deeper Insights into ViTs Robustness towards Common Corruptions", arXiv, 2022 (Fudan + Microsoft).[論文]]
  • ?: "ビジョントランスを用いたプライバシー保護画像分類", arXiv, 2022 (首都大学東京).[論文]]
  • RobustViT: "ビジョントランスフォーマーの関連性マップの最適化は堅牢性を向上させる", arXiv, 2022 (テルアビブ).[論文][パイトーチ]
  • FedWAvg: "Federated Adversarial Training with Transformers", arXiv, 2022 (Institute of Electronics and Digital Technologies (IETR), France).[論文]]
  • RobustCNN: "Can CNNs Be More Robust than Transformers?", arXiv, 2022 (UC Santa Cruz + JHU).[論文][パイトーチ]
  • Backdoor-Transformer: "Backdoor Attack on Vision Transformers", arXiv, 2022 (Maryland + UC Davis).[論文][コード(建設中)]
  • ?: "Defending Backdoor Attack on Vision Transformer via Patch Processing", arXiv, 2022 (Baidu).[論文]]
  • ?: "ビジョントランスフォーマーの秘密鍵を用いた画像とモデル変換", arXiv, 2022 (首都大学東京).[論文]]
  • ?: "空間的およびスペクトル攻撃に対するビジョントランスフォーマーの敵対的堅牢性の分析", arXiv, 2022 (延世大学).[論文]]
  • CLIPping Privacy: "CLIPping Privacy: Identity Inference Attack on Multi-Modal Machine Learning Models", arXiv, 2022 (TUM).[論文]]
  • ?: "A Light Recipe to Train Robust Vision Transformers", arXiv, 2022 (EPFL).[論文]]
  • ?: "Attacking Compressed Vision Transformers", arXiv, 2022 (NYU).[論文]]
  • C-AVP: "Visual Prompting for Adversarial Robustness", arXiv, 2022 (ミシガン州).[論文]]
  • ?: "ビジョントランスフォーマーの湾曲表現空間", arXiv, 2022 (延世大学).[論文]]
  • RKDE:"堅牢なカーネル密度推定によるトランスフォーマの堅牢化"、arXiv、2022(UTオースティン)。[論文]]
  • MRAP: "事前訓練された変圧器は必ずしも堅牢性を向上させるとは限らない", arXiv, 2022 (アリゾナ州立大学).[論文]]

モデル圧縮+トランス

  • ViT-quant: "Post-Training Quantization for Vision Transformer", NeurIPS, 2021 (Huawei).[論文]]
  • VTP: "Visual Transformer Pruning", arXiv, 2021 (Huawei).[論文]]
  • NViT: "NViT: Vision Transformer Compression and Parameter Redistribution", arXiv, 2021 (NVIDIA).[論文]]
  • MD-ViT: "ビジョントランスの多次元モデル圧縮", arXiv, 2021 (プリンストン).[論文]]
  • FQ-ViT: "FQ-ViT: Fully Quantized Vision Transformer Without Retraining", arXiv, 2021 (Megvii).[論文][パイトーチ]
  • UVC: "Unified Visual Transformer Compression", ICLR, 2022 (UT Austin).[論文][パイトーチ]
  • MiniViT: "MiniViT: Compressing Vision Transformers with Weight Multiplexing", CVPR, 2022 (Microsoft).[論文][パイトーチ]
  • Auto-ViT-Acc: "Auto-ViT-Acc: A FPGA-Aware Automatic Acceleration Framework for Vision Transformer with Mixed-Scheme Quantization", International Conference on Field Programmable Logic and Applications (FPL), 2022 (NorthEastern University).[論文]]
  • SPViT: "SPViT: Enabling Faster Vision Transformers via Soft Token Pruning", ECCV, 2022 (NorthEastern University).[論文][パイトーチ]
  • PSAQ-ViT: "Patch Similarity Aware Data-Free Quantization for Vision Transformers", ECCV, 2022 (CAS).[論文][パイトーチ]
  • PTQ4ViT: "PTQ4ViT: Post-Training Quantization Framework for Vision Transformers", ECCV, 2022 (北京大学).[論文]]
  • EAPruning: "EAPruning: Evolutionary Pruning for Vision Transformers and CNNs", BMVC, 2022 (Meituan).[論文]]
  • Q-ViT: "Q-ViT: Accurate and Full Quantized Low-bit Vision Transformer", NeurIPS, 2022 (北漢大学).[論文][パイトーチ]
  • Q-ViT: "Q-ViT: Fully Diffiable Quantization for Vision Transformer", arXiv, 2022 (Megvii).[論文]]
  • VAQF: "VAQF: Fully Automatic Software-Hardware Co-Design Framework for Low-Bit Vision Transformer", arXiv, 2022 (NorthEastern University).[論文]]
  • VTP: "Vision Transformer Compression with Structured Pruning and Low Rank Approximation", arXiv, 2022 (UCLA).[論文]]
  • SiDT: "Searching Intrinsic Dimensions of Vision Transformers", arXiv, 2022 (UC Irvine).[論文]]
  • I-ViT: "I-ViT: Integer-only Quantization for Efficient Vision Transformer Inference", arXiv, 2022 (CAS).[論文]]
  • PSAQ-ViT-V2: "PSAQ-ViT V2: Towards Accurate and General Data-Free Quantization for Vision Transformers", arXiv, 2022 (CAS).[論文][パイトーチ]
  • AS:「アダプティブスパースViT:自己注意を完全に活用することによる学習可能なアダプティブトークンプルーニングに向けて」、arXiv、2022(Baidu)。[論文]]
  • SaiT: "SaiT: Sparse Vision Transformers through Adaptive Token Pruning", arXiv, 2022 (Samsung).[論文]]
  • oViT: "oViT: A Accurate Second-Order Pruning Framework for Vision Transformers", arXiv, 2022 (IST Austria).[論文]]

[概要に戻る]

アテンションフリー

MLPシリーズ

  • RepMLP: "RepMLP: Re-parameterizing Convolutions into Fully connected Layers for Image Recognition", arXiv, 2021 (Megvii).[論文][パイトーチ]
  • EAMLP: "Beyond Self-attention: External Attention using Two Linear Layer for Visual Tasks", arXiv, 2021 (清華大学).[論文]]
  • フォワードオンリー:「注意すら必要ですか?フィードフォワードレイヤーのスタックは、ImageNetで驚くほどうまくいきます」、arXiv、2021(オックスフォード)。[論文][パイトーチ]
  • ResMLP: "ResMLP: Feedforward networks for image classification with data-efficient training", arXiv, 2021 (Facebook).[論文]]
  • ?: "Can Attention Enable MLP to Catch Up with CNN?", arXiv, 2021 (清華).[論文]]
  • ViP: "Vision Permutator: A Permutable MLP Like Architecture for Visual Recognition", arXiv, 2021 (NUS, Singapore).[論文][パイトーチ]
  • CCS: "Rethinking Token-Mixing MLP for MLP based Vision Backbone", arXiv, 2021 (Baidu).[論文]]
  • S 2-MLPv2: "S 2-MLPv2: Improved Spatial-Shift MLP Architecture for Vision", arXiv, 2021 (Baidu).[論文]]
  • RaftMLP: "RaftMLP: Do MLP Based Model Dream of Win Over Computer Vision?", arXiv, 2021 (立教大学, 日本).[論文][パイトーチ]
  • Hire-MLP: "Hire-MLP: Vision MLP via Hierarchical Rearrangement", arXiv, 2021 (Huawei).[論文]]
  • Sparse-MLP: "Sparse-MLP: A Full-MLP Architecture with Conditional Computation", arXiv, 2021 (NUS).[論文]]
  • ConvMLP: "ConvMLP: Hierarchical Convolutional MLP for Vision", arXiv, 2021 (University of Oregon).[論文][パイトーチ]
  • sMLP: "画像認識のためのスパースMLP: Is Self-Attention Really Required?", arXiv, 2021 (Microsoft).[論文]]
  • MLP-Mixer: "MLP-Mixer: An all-MLP Architecture for Vision", NeurIPS, 2021 (Google).[論文][テンソルフロー][パイトーチ-1 (ルシドレイン)][パイトーチ-2 (リシクシュ20)]
  • gMLP: "Pay Attention to MLP", NeurIPS, 2021 (Google).[論文][パイトーチ(アントニビグーレット)]
  • S 2-MLP: "S2-MLP: Spatial-Shift MLP Architecture for Vision", WACV, 2022 (Baidu).[論文]]
  • CycleMLP: "CycleMLP: A MLP like Architecture for Dense Prediction", ICLR, 2022 (HKU).[論文][パイトーチ]
  • AS-MLP: "AS-MLP: A Axial Shifted MLP Architecture for Vision", ICLR, 2022 (上海工科大学).[論文][パイトーチ]
  • Wave-MLP: "A Image Patch is a Wave: Quantum Inspired Vision MLP", CVPR, 2022 (Huawei).[論文][パイトーチ]
  • DynaMixer: "DynaMixer: A Vision MLP Architecture with Dynamic Mixing", ICML, 2022 (Tencent).[論文][パイトーチ]
  • STD: "Spatial-Channel Token Distillation for Vision MLP", ICML, 2022 (Huawei).[論文]]
  • AMixer: "AMixer: Adaptive Weight Mixing for Self-Attention Free Vision Transformers", ECCV, 2022 (清華大学).[論文]]
  • MS-MLP: "Mixing and Shifting: Exploiting Global and Local Dependencies in Vision MLP", arXiv, 2022 (Microsoft).[論文]]
  • ActiveMLP: "ActiveMLP: A MLP like Architecture with Active Token Mixer", arXiv, 2022 (Microsoft).[論文]]
  • MDMLP: "MDMLP: Image Classification from Scratch on Small Datasets with MLP", arXiv, 2022 (江蘇大学).[論文][パイトーチ]
  • PosMLP: "Parameterization of Cross-Token Relations with Relative Positional Encoding for Vision MLP", arXiv, 2022 (University of Science and Technology of China).[論文][パイトーチ]
  • SplitMixer: "SplitMixer: Fat Trimmed From MLP like Models", arXiv, 2022 (Quintic AI, California).[論文][パイトーチ]
  • gSwin: "gSwin: Gated MLP Vision Model with Hierarchical Structure of Shifted Window", arXiv, 2022 (PKSHATechnology, Japan).[論文]]
  • ?: "MLPベースのビジョンモデルにおける量子化の分析", arXiv, 2022 (バークレー).[論文]]

その他のアテンションフリー

  • PoolFormer: "MetaFormer is Actually What You Need for Vision", CVPR, 2022 (Sea AI Lab).[論文][パイトーチ]
  • FocalNet: "Focal Modulation Networks", arXiv, 2022 (Microsoft).[論文][パイトーチ]
  • シーケンサー: "シーケンサー:画像分類のための深いLSTM", arXiv, 2022 (立教大学, 日本).[論文]]

[概要に戻る]

変圧器の解析

  • Attention-CNN: "On the Relationship between Self-Attention and Convolutional Layers", ICLR, 2020 (EPFL).[論文][パイトーチ][ウェブサイト]]
  • Transformer-Explainability: "Transformer Interpretability Beyond Attention Visualization", CVPR, 2021 (Tel Aviv).[論文][パイトーチ]
  • ?: "畳み込みニューラルネットワークまたはトランスフォーマーは人間の視覚に似ていますか?", CogSci, 2021 (プリンストン).[論文]]
  • ?: "ConvNets vs. Transformers: Who Visual Representations are more Transferable?", ICCVW, 2021 (HKU).[論文]]
  • ?: "Do Vision Transformers See Like Convolutional Neural Networks?", NeurIPS, 2021 (Google).[論文]]
  • ?: "Intriguing Properties of Vision Transformers", NeurIPS, 2021 (MBZUAI).[論文][パイトーチ]
  • FoveaTer: "FoveaTer: Foveated Transformer for Image Classification", arXiv, 2021 (UCSB).[論文]]
  • ?: "Demystifying Local Vision Transformer: Sparse Connectivity, Weight Sharing, and Dynamic Weight", arXiv, 2021 (Microsoft).[論文]]
  • ?: "Revisiting the Calibration of Modern Neural Networks", arXiv, 2021 (Google).[論文]]
  • ?: "What Makes for Hierarchical Vision Transformer?", arXiv, 2021 (Horizon Robotic).[論文]]
  • ?: "変圧器ネットワークにおけるペア画像類似性の視覚化", WACV, 2022 (テンプル大学).[論文][パイトーチ]
  • FDSL: "Can Vision Transformers Learn Without Natural Images?", AAAI, 2022 (AIST).[論文][パイトーチ][ウェブサイト]]
  • AlterNet: "How Do Vision Transformers Work?", ICLR, 2022 (延世大学).[論文][パイトーチ]
  • ?: "When Vision Transformers outperform ResNets without Pretraining or Strong Data Augmentations", ICLR, 2022 (Google).[論文][テンソルフロー]]
  • ?: "On the Connection between Local Attention and Dynamic Depth-wise Convolution", ICLR, 2022 (Microsoft).[論文]]
  • ?: "Unraveling Attention via Convex Duality: Analysis and Interpretations of Vision Transformers", ICML, 2022 (Stanford).[論文]]
  • ?: "Vision Transformersについて誰もが知っておくべき3つのこと", ECCV, 2022 (メタ).[論文]]
  • ?: "Vision Transformers learn patch association", NeurIPS, 2022 (Princeton).[論文]]
  • AWD-ViT: "Visualizing and Understanding Patch Interactions in Vision Transformer", arXiv, 2022 (JD).[論文]]
  • ?: "CNNs and Transformers Perceptione Hybrid Images Similar to Humans", arXiv, 2022 (Quintic AI, CA).[論文][コード]]
  • MJP: "Breaking the Chain of Gradient Leak in Vision Transformers", arXiv, 2022 (Tencent).[論文]]
  • ViT-Shapley: "Learning to Estimate Shapley Value with Vision Transformers", arXiv, 2022 (UW).[論文][パイトーチ]
  • ?: "A Unified and Biological-Plausible Relationship Graph Representation of Vision Transformers", arXiv, 2022 (University of Electronic Science and Technology of China).[論文]]
  • ?: "ビジョントランス(VT)は非自然画像ドメインにどの程度うまく転送されますか?A Empirical Study Involving Art Classification", arXiv, 2022 (フローニンゲン大学、オランダ)[論文]]
  • ?: "Transformer vs. MLP-Mixer Exponential Expressionive Gap for NLP Problems", arXiv, 2022 (Technion Israel Institute of Technology).[論文]]
  • ProtoPFormer: "ProtoPFormer: Concenting on Prototypical Parts in Vision Transformers for Interpretable Image Recognition", arXiv, 2022 (Zhejiang University).[論文][パイトーチ]
  • ICLIP: "Exploring Visual Interpretability for Contrastive Language-Image Pre-training", arXiv, 2022 (HKUST).[論文][コード(建設中)]
  • ?: "Large Models are Parsimonious Learners: Activation Sparsity in Trained Transformers", arXiv, 2022 (Google).[論文]]
  • ?: "Vision Transformer Visualization: What Neurons Tell and How Neurons Behave?", arXiv, 2022 (Monash University).[論文][パイトーチ]

[概要に戻る]

検出

物体検出

  • CNNベースのバックボーン:
    • DETR: "End-to-End Object Detection with Transformers", ECCV, 2020 (Facebook).[論文][パイトーチ]
    • 変形可能なDETR: "変形可能なDETR:エンドツーエンドの物体検出のための変形可能なトランス", ICLR, 2021 (SenseTime).[論文][パイトーチ]
    • UP-DETR: "UP-DETR: Transmerによる物体検出のための教師なし事前トレーニング", CVPR, 2021 (Tencent).[論文][パイトーチ]
    • SMCA: "Fast Convergence of DETR with Spatially Modulated Co-Attention", ICCV, 2021 (CUHK).[論文][パイトーチ]
    • Conditional-DETR: "Conditional DETR for Fast Training Convergence", ICCV, 2021 (Microsoft).[論文]]
    • PnP-DETR: "PnP-DETR: Towards Efficient Visual Analysis with Transformers", ICCV, 2021 (Yitu).[論文][コード(建設中)]
    • TSP: "物体検出のためのトランスベースのセット予測の再考", ICCV, 2021 (CMU).[論文]]
    • Dynamic-DETR: "Dynamic DETR: End-to-End Object Detection with Dynamic Attention", ICCV, 2021 (Microsoft).[論文]]
    • ViT-YOLO: "ViT-YOLO:Transformer-Based YOLO for Object Detection", ICCVW, 2021 (西安大学).[論文]]
    • ACT: "Adaptive Clustering Transformerによるエンドツーエンドの物体検出", BMVC, 2021 (北京 + CUHK).[論文][パイトーチ]
    • DIL-ViT: "Pay Attention to Varying Receptive Fields: Object Detection with Atrous Filters and Vision Transformers", BMVC, 2021 (Monash University Malaysia).[論文]]
    • Efficient-DETR: "Efficient DETR: Improving End-to-End Object Detector with Dense Prior", arXiv, 2021 (Megvii).[論文]]
    • CA-FPN: "Content-Augmented Feature Pyramid Network with Light Linear Transformers", arXiv, 2021 (CAS).[論文]]
    • DETReg: "DETReg: 物体検出のための地域事前確率による教師なし事前トレーニング", arXiv, 2021 (Tel-Aviv + Berkeley).[論文][ウェブサイト]]
    • GQPos: "Guiding Query Position and Performing Similar Attention for Transformer-Based Detection Heads", arXiv, 2021 (Megvii).[論文]]
    • Anchor-DETR: "Anchor DETR: Query Design for Transformer-Based Detector", AAAI, 2022 (Megvii).[論文][パイトーチ]
    • Sparse-DETR: "Sparse DETR: Efficient End-to-End Object Detection with Learnable Sparsity", ICLR, 2022 (Kakao).[論文][パイトーチ]
    • DAB-DETR: "DAB-DETR: Dynamic Anchor Box are Better Query for DETR", ICLR, 2022 (IDEA, China).[論文][パイトーチ]
    • DN-DETR: "DN-DETR: Accelerate DETR Training by Introducing Query DeNoising", CVPR, 2022 (International Digital Economy Academy (IDEA), China).[論文][パイトーチ]
    • SAM-DETR: "Accelerateating DETR Convergence via Semantic-Aligned Matching", CVPR, 2022 (NTU, Singapore).[論文][パイトーチ]
    • AdaMixer: "AdaMixer: A Fast-Converging Query-Based Object Detector", CVPR, 2022 (南京大学).[論文][コード(建設中)]
    • DESTR: "DESTR: Object Detection With Split Transformer", CVPR, 2022 (オレゴン州).[論文]]
    • REGO: "Recurrent Glimpse-based Decoder for Detection with Transformer", CVPR, 2022 (シドニー大学).[論文][パイトーチ]
    • ?: "Training Object Detectors From Scratch: A Empirical Study in the Era of Vision Transformer", CVPR, 2022 (Ant Group).[論文]]
    • DE-DETR: "Towards Data-Efficient Detection Transformers", ECCV, 2022 (JD).[論文][パイトーチ]
    • DFFT: "トランスフォーマーによる効率的なデコーダフリー物体検出", ECCV, 2022 (Tencent).[論文]]
    • Cornerformer: "Cornerformer: Purifying Instances for Corner-Based Detectors", ECCV, 2022 (Huawei).[論文]]
    • ?: "A Simple Approach and Benchmark for 21,000-Category Object Detection", ECCV, 2022 (Microsoft).[論文][コード(建設中)]
    • KA: "Knowledge Amalgamation for Object Detection with Transformers", arXiv, 2022 (Zhejiang University).[論文]]
    • MIMDet: "Unleashing Vanilla Vision Transformer with Masked Image Modeling for Object Detection", arXiv, 2022 (Tencent).[論文][パイトーチ]
    • imTED: "Visual Object Detectionのための事前トレーニング済みトランスエンコーダデコーダのインテグラルマイグレーション", arXiv, 2022 (CAS).[論文]]
    • AO2-DETR: "AO2-DETR: Anyry-Oriented Object Detection Transformer", arXiv, 2022 (北京大学).[論文]]
    • MaskDINO: "Mask DINO: Towards A Unified Transformer-based Framework for Object Detection and Segmentation", arXiv, 2022 (IDEA, China).[論文][コード(建設中)]
    • TCC: "物体検出における特徴ピラミッドのブースティングのためのトランスベースのコンテキスト凝縮", arXiv, 2022 (シドニー大学).[論文]]
    • Conditional-DETR-V2: "Conditional DETR V2: Efficient Detection Transformer with Box Query", arXiv, 2022 (北京大学).[論文]]
    • Group-DETR: "Group DETR: Fast Training Convergence with Decoupled One-to-Many Label Assignment", arXiv, 2022 (Baidu).[論文]]
    • H-DETR: "DETR with Hybrid Matching", arXiv, 2022 (Microsoft).[論文]]
    • SAM-DETR++: "拡張DETR収束とマルチスケール特徴融合のためのセマンティックアライメントマッチング", arXiv, 2022 (NTU, Singapore).[論文][パイトーチ]
    • IMFA: "Towards Efficient Use of Multi-Scale Features in Transformer-Based Object Detectors", arXiv, 2022 (NTU, Singapore).[論文][コード(建設中)]
    • ComplETR: "ComplETR: Reduce the cost of the annotations for object detection in dense scene with vision transformers", arXiv, 2022 (Amazon).[論文]]
    • Obj2Seq: "Obj2Seq: Formatting objects as Sequence with Class Prompt for Visual Tasks", arXiv, 2022 (CAS).[論文][パイトーチ]
  • トランスベースのバックボーン:
    • ViT-FRCNN: "Towards Transformer-Based Object Detection", arXiv, 2020 (Pinterest).[論文]]
    • WB-DETR: "WB-DETR: Transformer-Based Detector Without Backbone", ICCV, 2021 (CAS).[論文]]
    • YOLOS: "You Only Look at One Sequence: Rethinking Transformer in Vision through Object Detection", NeurIPS, 2021 (Horizon Robotics).[論文][パイトーチ]
    • ?: "Benchmarking Detection Transfer Learning with Vision Transformers", arXiv, 2021 (Facebook).[論文]]
    • ViDT: "ViDT: A Efficient and Effective Fully Transformer-based Object Detector", ICLR, 2022 (NAVER).[論文][パイトーチ]
    • FP-DETR: "FP-DETR: Detection Transformer Advanced by Fully Pre-training", ICLR, 2022 (USTC).[論文]]
    • DETR++: "DETR++: Taming Your Multi-Scale Detection Transformer", CVPRW, 2022 (Google).[論文]]
    • ViTDet: "Exploring Plain Vision Transformer Backbones for Object Detection", ECCV, 2022 (Meta).[論文]]
    • UViT: "A Simple Single-Scale Vision Transformer for Object Detection and Instance Segmentation", ECCV, 2022 (Google).[論文]]
    • D 2 ETR: "D2ETR: Decoder-Only DETR with Computationally Efficient Cross-Scale Attention", arXiv, 2022 (Alibaba).[論文][パイトーチ]
    • DINO:"DINO:エンドツーエンドの物体検出のための改良されたノイズ除去アンカーボックスを備えたDETR"、arXiv、2022(IDEA、中国)。[論文][コード(建設中)]

[概要に戻る]

3D物体検出

  • AST-GRU: "グラフベースのメッセージパッシングと時空間変圧器の注意によるLiDARベースのオンライン3Dビデオオブジェクト検出", CVPR, 2020 (Baidu).[論文][コード(建設中)]
  • Pointformer: "3D Object Detection with Pointformer", arXiv, 2020 (清華).[論文]]
  • CT3D: "Improving 3D Object Detection with Channel-wise Transformer", ICCV, 2021 (Alibaba).[論文][コード(建設中)]
  • Group-Free-3D: "Group-Free 3D Object Detection via Transformers", ICCV, 2021 (Microsoft).[論文][パイトーチ]
  • VoTr: "3D 物体検出のためのボクセル変圧器", ICCV, 2021 (CUHK + NUS).[論文]]
  • 3DETR: "A End-to-End Transformer Model for 3D Object Detection", ICCV, 2021 (Facebook).[論文][パイトーチ][ウェブサイト]]
  • DETR3D: "DETR3D: 3D-to-2D Query によるマルチビュー画像からの 3D オブジェクト検出", CoRL, 2021 (MIT).[論文]]
  • M3DETR: "M3DeTR: Multi-representation, Multi-scale, Mutual-relation 3D Object Detection with Transformers", WACV, 2022 (University of Maryland).[論文][パイトーチ]
  • SST: "スパーストランスを備えたシングルストライド3Dオブジェクト検出器の抱擁", CVPR, 2022 (CAS).[論文][パイトーチ]
  • MonoDTR: "MonoDTR: Depth-Aware Transformerによる単眼3D物体検出", CVPR, 2022 (NTU).[論文][コード(建設中)]
  • VoxSeT: "Voxel Set Transformer: A Set-to-Set Approach to 3D Object Detection from Point Clouds", CVPR, 2022 (香港理工大学).[論文][パイトーチ]
  • TransFusion: "TransFusion: Robust LiDAR - Camera Fusion for 3D Object Detection with Transformers", CVPR, 2022 (HKUST).[論文][パイトーチ]
  • CAT-Det: "CAT-Det: Contrastly Augmented Transformer for Multi-modal 3D Object Detection", CVPR, 2022 (北漢大学).[論文]]
  • TokenFusion: "Multimodal Token Fusion for Vision Transformers", CVPR, 2022 (清華).[論文]]
  • SST: "スパーストランスを備えたシングルストライド3Dオブジェクト検出器の抱擁", CVPR, 2022 (CAS).[論文][パイトーチ]
  • LIFT: "LIFT: Learning 4D LiDAR Image Fusion Transformer for 3D Object Detection", CVPR, 2022 (Shanghai Jiao Tong University).[論文]]
  • BoxeR: "BoxeR: Box-Attention for 2D and 3D Transformers", CVPR, 2022 (アムステルダム大学).[論文][パイトーチ]
  • BrT: "Vision and Point Cloud 3D Object Detectionのためのブリッジドトランス", CVPR, 2022 (清華).[論文]]
  • VISTA: "VISTA: Boosting 3D Object Detection via Dual Cross-VIew SpaTial Attention", CVPR, 2022 (華南工科大学).[論文][パイトーチ]
  • STRL: "ラベル効率の高い3D物体検出のための3DETRの自己監視事前トレーニングに向けて", CVPRW, 2022 (ボッシュ).[論文]]
  • MTrans: "自動3Dアノテーションと物体検出のためのマルチモーダルトランス", ECCV, 2022 (HKU).[論文][パイトーチ]
  • CenterFormer: "CenterFormer: Center-based Transformer for 3D Object Detection", ECCV, 2022 (TuSimple).[論文][コード(建設中)]
  • BUTD-DETR: "画像と点群の言語接地のためのボトムアップトップダウン検出トランスフォーマー", ECCV, 2022 (CMU).[論文][パイトーチ][ウェブサイト]]
  • SpatialDETR: "SpatialDETR: Robust Scalable Transformer-Based 3D Object Detection from Multi-View Camera Images with Global Cross-Sensor Attention", ECCV, 2022 (Mercedes-Benz).[論文][パイトーチ]
  • CramNet: "CramNet: Camera-Radar Fusion with Ray-Constrained Cross-Attention for Robust 3D Object Detection", ECCV, 2022 (Waymo).[論文]]
  • SWFormer: "SWFormer: Sparse Window Transformer for 3D Object Detection in Point Clouds", ECCV, 2022 (Waymo).[論文]]
  • EMMF-Det: "3D 物体検出のための局所的セルフアテンションを用いたマルチモーダル機能の強化", ECCV, 2022 (Hikvision).[論文]]
  • PETR: "PETR: Position Embedding Transformation for Multi-View 3D Object Detection", arXiv, 2022 (Megvii).[論文]]
  • MonoDETR: "MonoDETR: Depth-aware Transformer for Monocular 3D Object Detection", arXiv, 2022 (Shanghai AI Laboratory).[論文][コード(建設中)]
  • Graph-DETR3D: "Graph-DETR3D: Rethinking Overlapping Region for Multi-View 3D Object Detection", arXiv, 2022 (University of Science and Technology of China).[論文]]
  • UVTR:"3D物体検出のためのボクセルベースの表現とトランスの統合"、arXiv、2022(CUHK)。[論文][パイトーチ]
  • PETRv2: "PETRv2: A Unified Framework for 3D Perception from Multi-Camera Images", arXiv, 2022 (Megvii).[論文]]
  • PolarFormer: "PolarFormer: Multi-camera 3D Object Detection with Polar Transformer", arXiv, 2022 (復旦大学).[論文][コード(建設中)]
  • AST-GRU: "点群からの3Dビデオ物体検出のためのグラフニューラルネットワークと時空間変圧器の注意", arXiv, 2022 (北京工科大学).[論文]]
  • SEFormer: "SEFormer: Structure Embedding Transformer for 3D Object Detection", arXiv, 2022 (清華大学).[論文]]
  • CRAFT: "CRAFT: Camera-Radar 3D Object Detection with Spatio-Contextual Fusion Transformer", arXiv, 2022 (KAIST).[論文]]
  • CrossDTR: "CrossDTR: Cross-view and Depth-guided Transformers for 3D Object Detection", arXiv, 2022 (NTU).[論文][コード(建設中)]

[概要に戻る]

マルチモーダル検出

  • OVR-CNN: "キャプションを使用したオープン語彙オブジェクト検出", CVPR, 2021 (スナップ).[論文][パイトーチ]
  • MDETR: "MDETR - Modulated Detection for End-to-End Multi-Modal Understanding", ICCV, 2021 (NYU).[論文][パイトーチ][ウェブサイト]]
  • FETNet: "FETNet: Feature Exchange Transformer Network for RGB-D Object Detection", BMVC, 2021 (清華).[論文]]
  • MEDUSA:「マルチモーダルトランスフォーマーによる物体検出のためのシーン深度の活用」、BMVC、2021(Google)。[論文][パイトーチ]
  • StrucTexT: "StrucTexT: Structured Text Understanding with Multi-Modal Transformers", arXiv, 2021 (Baidu).[論文]]
  • MAVL: "Multi-modal Transformerによるクラスに依存しない物体検出", ECCV, 2022 (MBZUAI).[論文][パイトーチ]
  • OWL-ViT: "Simple Open-Vocabulary Object Detection with Vision Transformers", ECCV, 2022 (Google).[論文][ジャックス][顔を抱きしめる]]
  • X-DETR: "X-DETR: A Versatile Architecture for Instance-wise Vision-Language Tasks", ECCV, 2022 (Amazon).[論文]]
  • simCrossTrans: "simCrossTrans: A Simple Cross-Modality Transfer Learning for Object Detection with ConvNets or Vision Transformers", arXiv, 2022 (The City University of New York).[論文][パイトーチ]
  • ?: "DALL-E for Detection: Language-driven Context Image Synthesis for Object Detection", arXiv, 2022 (USC).[論文]]
  • YONOD:「あなたは1つの検出器しか必要としない:ビジョントランスフォーマーに基づく異なるモダリティのための統一オブジェクト検出器」、arXiv、2022(CUNY)。[論文][パイトーチ]
  • OmDet: "OmDet: Language-Aware Object Detection with Large-scale Vision-Language Multi-dataset Pre-training", arXiv, 2022 (浙江大学濱江研究所).[論文]]
  • Detection-Hub: "Detection Hub: Unifying Object Detection Datasets via Query Adaptation on Language Embedding", arXiv, 2022 (Fudan + Microsoft).[論文]]
  • F-VLM: "F-VLM: Open-Vocabulary Object Detection on Frozen Vision and Language Models", arXiv, 2022 (Google).[論文]]
  • ContFormer: "Video Referring Expression Comprehension via Transformer with Content-aware Query", arXiv, 2022 (北京大学).[論文]]

[概要に戻る]

ホイ検出

  • HOIトランス: "HOIトランスによるエンドツーエンドの人間物体相互作用検出", CVPR, 2021 (Megvii).[論文][パイトーチ]
  • HOTR: "HOTR: End-to-End Human-Object Interaction Detection with Transformers", CVPR, 2021 (カカオ+高麗大学).[論文][パイトーチ]
  • MSTR: "MSTR: Multi-Scale Transformer for End-to-End Human-Object Interaction Detection", CVPR, 2022 (Kakao).[論文]]
  • SSRT: "何を見て、どこで見るべきか:人間とオブジェクトの相互作用を検出するためのセマンティックおよび空間洗練されたトランス"、CVPR、2022(アマゾン)。[論文]]
  • CPC: "人体相互作用検出におけるトランスフォーマーのデコードパス拡張による一貫性学習", CVPR, 2022 (高麗大学).[論文][パイトーチ(建設中)]
  • DisTR: "Disentangled Transformerによる人間-物体相互作用検出", CVPR, 2022 (Baidu).[論文]]
  • STIP: "Exploring Structure-Aware Transformer over Interaction Proposal for Human-Object Interaction Detection", CVPR, 2022 (JD).[論文][パイトーチ]
  • DOQ: "Distillation Using Oracle Query for Transformer-Based Human-Object Interaction Detection", CVPR, 2022 (華南工科大学).[論文]]
  • UPT: "Efficient Two-Stage Detection of Human-Object Interaction with a Novel Unary-Pairwise Transformer", CVPR, 2022 (Australian Centre for Robotic Vision).[論文][パイトーチ][ウェブサイト]]
  • CATN: "Category-Aware Transformer Network for Better Human-Object Interaction Detection", CVPR, 2022 (華中科技大学).[論文]]
  • HQM: "Towards Hard-Positive Query Mining for DETR-based Human-Object Interaction Detection", ECCV, 2022 (華南工科大学).[論文][パイトーチ]
  • Iwin: "Iwin: Human-Object Interaction Detection via Transformer with Irregular Windows", ECCV, 2022 (Shanghai Jiao Tong).[論文]]
  • ?: "タッチライントランスによる具現化リファレンスの理解", arXiv, 2022 (清華大学).[論文][パイトーチ]

[概要に戻る]

顕著な物体検出

  • VST: "Visual Saliency Transformer", ICCV, 2021 (Northwestern Polytechincal University).[論文]]
  • ?: "顕著性予測のためのエネルギーベースの潜在空間を用いた学習生成ビジョントランス", NeurIPS, 2021 (Baidu).[論文]]
  • SwinNet: "SwinNet: Swin Transformer drives edge-aware RGB-D and RGB-T salient object detection", TCSVT, 2021 (安徽大学).[論文][コード]]
  • SOD-Transformer: "Transformer Transforms Salient Object Detection and Camouflaged Object Detection", arXiv, 2021 (Northwestern Polytechnical University).[論文]]
  • GLSTR:"トランスフォーマーによる顕著な物体検出におけるグローバルローカル表現の統合"、arXiv、2021(華南工科大学)。[論文]]
  • TriTransNet: "TriTransNet: RGB-D Alient Object Detection with a Triplet Transformer Embedding Network", arXiv, 2021 (安徽大学).[論文]]
  • AbiU-Net: "トランスベースの非対称バイラテラルU-Netによる顕著な物体検出のブースト", arXiv, 2021 (南海大学).[論文]]
  • TranSalNet: "TranSalNet: Visual saliency prediction using transformers", arXiv, 2021 (Cardiff University, UK).[論文]]
  • DFTR: "DFTR: Depth-supervised Hierarchical Feature Fusion Transformer for Salient Object Detection", arXiv, 2022 (Tencent).[論文]]
  • GroupTransNet: "GroupTransNet: Group Transformer Network for RGB-D Salient Object Detection", arXiv, 2022 (南海大学).[論文]]
  • SelfReformer: "SelfReformer: Self-Refined Network with Transformer for Salient Object Detection", arXiv, 2022 (NTU, Singapore).[論文]]
  • DTMINet: "Dual Swin-Transformer based Mutual Interactive Network for RGB-D Salient Object Detection", arXiv, 2022 (CUHK).[論文]]
  • MCNet: "Mirror Complementary Transformer Network for RGB-thermal Salient Object Detection", arXiv, 2022 (北京郵政大学).[論文][パイトーチ]
  • SiaTrans: "SiaTrans: Siames Transformer Network for RGB-D Salient Object Detection with Depth Image Classification", arXiv, 2022 (山東科学技術大学).[論文]]

[概要に戻る]

その他の検出タスク

  • X 監視:
    • LOST: "Localizing Objects with Self-Supervised Transformers and no Labels", BMVC, 2021 (Valeo.ai).[論文][パイトーチ]
    • Omni-DETR: "Omni-DETR: Omni-Supervised Object Detection with Transformers", CVPR, 2022 (Amazon).[論文][パイトーチ]
    • TokenCut: "正規化カットを使用した教師なしオブジェクト発見のための自己監視トランスフォーマー", CVPR, 2022 (グルノーブルアルプ大学, フランス).[論文][パイトーチ][ウェブサイト]]
    • WS-DETR: "Scaling Novel Object Detection with Weakly Supervised Detection Transformers", CVPRW, 2022 (Microsoft).[論文]]
    • TRT: "Re-Attention Transformer for Weakly Supervised Object Localization", arXiv, 2022 (Zhejiang University).[論文][パイトーチ]
    • TokenCut: "TokenCut: Segmenting Objects in Images and Videos with Self-Supervised Transformer and Normalized Cut", arXiv, 2022 (University Grenoble Alpes, France).[論文][パイトーチ][ウェブサイト]]
  • Xショット物体検出:
    • AIT: "ワンショット物体検出のための適応型画像トランス", CVPR, 2021 (アカデミア・シニカ).[論文]]
    • Meta-DETR: "Meta-DETR: Few-Shot Object Detection via Unified Image-Level Meta-Learning", arXiv, 2021 (NTU Singapore).[論文][パイトーチ]
    • CAT: "CAT: Cross-Attention Transformer for One-Shot Object Detection", arXiv, 2021 (Northwestern Polytechnical University).[論文]]
    • FCT: "Few-Shot Object Detection with Fully Cross-Transformer", CVPR, 2022 (コロンビア).[論文]]
    • SaFT: "One-shot Object DetectionのためのSemantic-aligned Fusion Transformer ", CVPR, 2022 (Microsoft).[論文]]
    • TENET:"Time-rEversed diffusioN tEnsor Transformer: A New TENET of Few-Shot Object Detection", ECCV, 2022 (ANU).[論文][パイトーチ]
    • Meta-DETR: "Meta-DETR: Image-Level Few-Shot Detection with Inter-Class Correlation Exploitation", TPAMI, 2022 (NTU, Singapore).[論文]]
    • Incremental-DETR: "Incremental-DETR: Incremental Few Shot Object Detection via Self-Supervised Learning", arXiv, 2022 (NUS).[論文]]
    • FS-DETR: "FS-DETR: Few-Shot DEtection TRansformer with prompting and without re-training", arXiv, 2022 (Samsung).[論文]]
  • オープンワールド/語彙:
    • OW-DETR: "OW-DETR: Open-World Detection Transformer", CVPR, 2022 (IIAI).[論文][パイトーチ]
    • DetPro: "Learning to prompt for Open-Vocabulary Object Detection with Vision-Language Model", CVPR, 2022 (清華大学).[論文][パイトーチ]
    • PromptDet: "PromptDet: Towards Open-vocabulary Detection using Uncurated Images", ECCV, 2022 (Meituan).[論文][パイトーチ][ウェブサイト]]
    • OV-DETR: "Open-Vocabulary DETR with Conditional Matching", ECCV, 2022 (NTU, Singapore).[論文]]
    • VL-PLM: "物体検出のための視覚および言語モデルによるラベルなしデータの活用", ECCV, 2022 (ラトガース大学).[論文][パイトーチ][ウェブサイト]]
    • DetCLIP: "DetCLIP: Dictionary-Enriched Visual-Concept Paralleled Pre-Training for Open-world Detection", NeurIPS, 2022 (HKUST).[論文]]
  • 歩行者検出:
    • PED: "DETR for Crowd Pedestrian Detection", arXiv, 2020 (清華).[論文][パイトーチ]
    • ペデストロン: "Pedestrian Detection: Domain Generalization, CNNs, Transformers and Beyond", arXiv, 2022 (IIAI).[論文][パイトーチ]
  • 車線検出:
    • LSTR:"トランスフォーマーによるエンドツーエンドの車線形状予測"、WACV、2021(西安交通)。[論文][パイトーチ]
    • LETR: "エッジのないトランスを用いた線分検出", CVPR, 2021 (UCSD).[論文][パイトーチ]
    • Laneformer: "Laneformer: Object-aware Row-Column Transformers for Lane Detection", AAAI, 2022 (Huawei).[論文]]
    • TLC: "マンハッタン世界におけるリアルタイム消失点検出のための画像コンテキストを備えた変圧器ベースの線分分類器", CVPR, 2022 (北京大学).[論文]]
    • PersFormer: "PersFormer: 3D Lane Detection via Perspective Transformer and the OpenLane Benchmark", ECCV, 2022 (Shanghai AI Laboratory).[論文][パイトーチ]
    • MHVA: "マルチフレーム水平および垂直注意とビジュアルトランスモジュールに基づくレーン検出トランス", ECCV, 2022 (北漢大学).[論文]]
    • PriorLane: "PriorLane: A Prior Knowledge Enhanced Lane Detection Approach Based on Transformer", arXiv, 2022 (Zhejiang Lab).[論文][パイトーチ]
    • CurveFormer: "CurveFormer: 3D Lane Detection by Curve Propagation with Curve Query and Attention", arXiv, 2022 (NullMax, China).[論文]]
  • オブジェクトのローカリゼーション:
    • TS-CAM: "TS-CAM: Token Semantic Coupled Attention Map for Weakly Supervised Object Localization", arXiv, 2021 (CAS).[論文]]
    • LCTR:"LCTR:弱監督オブジェクトローカリゼーションのためのトランスの局所連続性の覚醒について", AAAI, 2022 (厦門大学).[論文]]
    • ViTOL: "ViTOL: Vision Transformer for Weakly Supervised Object Localization", CVPRW, 2022 (Mercedes-Benz).[論文][パイトーチ]
    • SCM: "Weakvised Object Localization via Transformer with Implicit Spatial Calibration", ECCV, 2022 (CUHK).[論文][パイトーチ]
    • CaFT:"CaFT:Clustering and Filter on Tokens of Transformer for Weak-Supervised Object Localization", arXiv, 2022 (Zhejiang University).[論文]]
  • リレーション検出:
    • PST: "複合クエリを使用した部分と合計のトランスフォーマーを使用した視覚的関係の検出", ICCV, 2021 (Amazon).[論文]]
    • PST: "Part and Sum Transformersを用いたVisual Composite Set Detection (Visual Composite Set Detection using Part-and-Sum Transformers", arXiv, 2021 (Amazon).[論文]]
    • TROI: "Transform ROI for Capture Visual Transformations in Videos", arXiv, 2021 (NUS, Singapore).[論文]]
    • RelTransformer: "RelTransformer: A Transformer-Based Long-Tail Visual Relationship Recognition", CVPR, 2022 (KAUST).[論文][パイトーチ]
    • VReBERT: "VReBERT: A Simple and Flexible Transformer for Visual Relationship Detection", ICPR, 2022 (ANU).[論文]]
  • 異常検出:
    • VT-ADL: "VT-ADL: A VISION TRANSFORMER NETWORK FOR IMAGE ANOOMALY DETECTION AND LOCALIZATION", ISIE, 2021 (ウーディネ大学, イタリア).[論文]]
    • InTra: "異常検出のためのインペイントトランスフォーマー", arXiv, 2021 (富士通).[論文]]
    • AnoViT: "AnoViT: Unsupervised Anomaly Detection and Localization with Vision Transformer-based Encoder-Decoder", arXiv, 2022 (高麗大学).[論文]]
    • ?: "ビデオ異常検出のためのビジョントランスによるマルチコンテキスト予測", arXiv, 2022 (高麗大学).[論文]]
  • クロスドメイン:
    • SSTN: "SSTN: 自動運転のための自己監視型ドメイン適応熱物体検出", arXiv, 2021 (光州科学技術大学院大学).[論文]]
    • DA-DETR: "DA-DETR: Domain Adaptive Detection Transformer by Hybrid Attention", arXiv, 2021 (NTU Singapore).[論文]]
    • MTTrans: "MTTrans: Cross-Domain Object Detection with Mean-Teacher Transformer", ECCV, 2022 (北漢大学).[論文]]
    • OAA-OTA: "Improving Transferability for Domain Adaptive Detection Transformers", arXiv, 2022 (北京工業大学).[論文]]
    • SSTA:"空間認識およびセマンティック認識トークンアライメントに基づくクロスドメイン検出トランス"、arXiv、2022(中国電子科学技術大学)。[論文]]
  • 共顕著な物体検出:
    • CoSformer: "CoSformer: Detection Co-Salient Object with Transformers", arXiv, 2021 (南京大学).[論文]]
  • 指向性物体検出:
    • O2DETR: "Oriented Object Detection with Transformer", arXiv, 2021 (Baidu).[論文]]
  • マルチビュー検出:
    • MVDeTr: "シャドウトランス(およびビューコヒーレントデータ拡張)によるマルチビュー検出", ACMMM, 2021 (ANU).[論文]]
  • ポリゴン検出:
    • ?: "ポイントコレクションとしての多角形形状の分解におけるトランスフォーマーの調査", ICCVW, 2021 (デルフト工科大学, オランダ).[論文]]
  • ドローンビュー:
    • TPH: "TPH-YOLOv5: Improved YOLOv5 based on Transformer Prediction Head for Object Detection on Drone-Captured Scenarios", ICCVW, 2021 (北漢大学).[論文]]
    • TransVisDrone: "TransVisDrone: Spatio-Temporal Transformer for Vision-based Drone-to-Drone Detection in Aerial Videos", arXiv, 2022 (UCF).[論文][コード(建設中)]
  • 赤外:
    • ?: "Infrared Small-Dim Target Detection with Transformer under Complex Backgrounds", arXiv, 2021 (重慶郵電大学).[論文]]
  • テキスト:
    • SwinTextSpotter: "SwinTextSpotter: Scene Text Spotting via Better Synergy between Text Detection and Text Recognition", CVPR, 2022 (華南工科大学).[論文][パイトーチ]
    • TESTR: "Text Spotting Transformers", CVPR, 2022 (UCSD).[論文][パイトーチ]
    • TTS: "Towards Weak-Supervised Text Spotting using a Multi-Task Transformer", CVPR, 2022 (Amazon).[論文]]
    • oCLIP: "Language Matters: A Weakvised Vision-Language Pre-Training Approach for Scene Text Detection and Spotting", ECCV, 2022 (ByteDance).[論文]]
    • TransDETR: "End-to-End Video Text Spotting with Transformer", arXiv, 2022 (Zhejiang University).[論文][パイトーチ]
    • ?: "Arbitrary Shape Text Detection using Transformers", arXiv, 2022 (University of Waterloo, Canada).[論文]]
    • ?: "境界トランスによる任意の形状テキスト検出", arXiv, 2022 (北京科技大学).[論文][コード(建設中)]
    • DPText-DETR: "DPText-DETR: Towards Better Scene Text Detection with Dynamic Points in Transformer", arXiv, 2022 (JD).[論文][コード(建設中)]
    • DPTNet: "DPTNet: A Dual-Path Transformer Architecture for Scene Text Detection", arXiv, 2022 (厦門大学).[論文]]
  • 変更検出:
    • ChangeFormer: "A Transformer-Based Siamese Network for Change Detection", arXiv, 2022 (JHU).[論文][パイトーチ]
    • IDET:"IDET:高品質変化検出のための反復差分強化トランス"、arXiv、2022(中国民間航空大学)。[論文]]
  • エッジ検出:
  • 個人検索:
    • COAT: "Cascade Transformers for End-to-End Person Search", CVPR, 2022 (Kitware).[論文][パイトーチ]
    • PSTR:"PSTR:トランスフォーマーによるエンドツーエンドのワンステップパーソンサーチ"、CVPR、2022(天津大学)。[論文][パイトーチ]
  • 操作検出:
    • ObjectFormer: "画像操作検出とローカリゼーションのためのオブジェクトフォーマー", CVPR, 2022 (復旦大学).[論文]]
  • 接地された状況認識:
    • CoFormer: "Collaborative Transformers for Grounded Situation Recognition", CVPR, 2022 (POSTECH).[論文][パイトーチ]
  • ミラー検出:
    • SATNet: "Symmetry-Aware Transformer-based Mirror Detection", arXiv, 2022 (Harbin Institute of Technology).[論文][パイトーチ]

[概要に戻る]

セグメンテーション

セマンティックセグメンテーション

  • SETR: "トランスフォーマーによるシーケンス間の観点からセマンティックセグメンテーションを再考する", CVPR, 2021 (Tencent).[論文][パイトーチ][ウェブサイト]]
  • TrSeg: "TrSeg: Transformer for Semantic segmentation", PRL, 2021 (高麗大学).[論文][パイトーチ]
  • CWT: "Simpler is Better: Few-shot Semantic Segmentation with Classifier Weight Transformer", ICCV, 2021 (University of Surrey, UK).[論文][パイトーチ]
  • Segmenter: "Segmenter: Transformer for Semantic Segmentation", ICCV, 2021 (INRIA).[論文][パイトーチ]
  • UN-EPT: "A Unified Efficient Pyramid Transformer for Semantic Segmentation", ICCVW, 2021 (Amazon).[論文][パイトーチ]
  • SegFormer: "SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers", NeurIPS, 2021 (NVIDIA).[論文][パイトーチ]
  • FTN: "Semantic Image Segmentationのための完全トランスネットワーク", arXiv, 2021 (Baidu).[論文]]
  • OffRoadTranSeg: "OffRoadTranSeg: Semi-Supervised Segmentation using Transformers on OffRoad environment", arXiv, 2021 (IISER.インド)。[論文]]
  • MaskFormer: "ピクセルごとの分類はセマンティックセグメンテーションに必要なすべてではない", arXiv, 2021 (UIUC + Facebook).[論文][ウェブサイト]]
  • TRFS: "Boosting Few-shot Semantic Segmentation with Transformers", arXiv, 2021 (ETHZ).[論文]]
  • Flying-Guide-Dog: "Flying Guide Dog: Walkable Path Discovery for the Visual Disabilities Using Drones and Transformer-based Semantic Segmentation", arXiv, 2021 (KIT, Germany).[論文][コード(建設中)]
  • VSPW: "Transformer Modelの集約によるVSPWデータセットのセマンティックセグメンテーション", arXiv, 2021 (Xiaomi).[論文]]
  • SDTP: "SDTP: Semantic-aware Decoupled Transformer Pyramid for Dense Image Prediction", arXiv, 2021 (?).[論文]]
  • TopFormer: "TopFormer: Token Pyramid Transformer for Mobile Semantic Segmentation", CVPR, 2022 (Tencent).[論文][パイトーチ]
  • GroupViT: "GroupViT: Semantic Segmentation Emerges from Text Supervision", CVPR, 2022 (NVIDIA).[論文][ウェブサイト][パイトーチ]
  • HRViT: "Semantic Segmentationのためのマルチスケール高解像度ビジョントランス", CVPR, 2022 (Meta).[論文][パイトーチ]
  • GReaT: "画像解析のためのグラフ推論トランス", ACMMM, 2022 (HKUST).[論文]]
  • SegDeformer: "A Transformer-Based Decoder for Semantic Segmentation with Multi-level Context Mining", ECCV, 2022 (Shanghai Jiao Tong + Huawei).[論文][パイトーチ]
  • SegViT: "SegViT: Semantic Segmentation with Plain Vision Transformers", NeurIPS, 2022 (The University of Adelaide, Australia).[論文]]
  • RTFormer: "RTFormer: Efficient Design for Real-Time Semantic Segmentation with Transformer", NeurIPS, 2022 (Baidu).[論文][パドル]]
  • Lawin: "Lawin Transformer: Improving Semantic Segmentation Transformer with Multi-scale Representations via Large Window Attention", arXiv, 2022 (北京郵政大学).[論文][パイトーチ]
  • PFT: "Pyramid Fusion Transformer for Semantic Segmentation", arXiv, 2022 (CUHK + SenseTime).[論文]]
  • DFlatFormer: "セマンティックセグメンテーションのための分解された行および列クエリによるデュアルフラット化トランスフォーマー", arXiv, 2022 (OPPO).[論文]]
  • FeSeFormer: "Feature Selective Transformer for Semantic Image Segmentation", arXiv, 2022 (Baidu).[論文]]
  • StructToken: "StructToken : Rethinking Semantic Segmentation with Structural Prior", arXiv, 2022 (Shanghai AI Lab).[論文]]
  • TSG: "Semantic SegmentationのためのTransformer Scale Gate for Semantic Segmentation", arXiv, 2022 (Monash University, Australia).[論文]]
  • HILA: "階層的レベル間注意を用いたトランスフォーマーにおけるセマンティックセグメンテーションの改善", arXiv, 2022 (トロント大学).[論文][ウェブサイト][パイトーチ]
  • HLG: "トランスフォーマーによる視覚表現学習:シーケンスツーシーケンスパースペクティブ", arXiv, 2022 (復旦大学).[論文][パイトーチ]
  • SSformer: "SSformer: A Lightweight Transformer for Semantic Segmentation", arXiv, 2022 (Nanjing University of Aeronautics and Astronautics).[論文][パイトーチ]
  • NamedMask: "NamedMask: Distilling Segmenters from Complementary Foundation Models", arXiv, 2022 (Oxford).[論文][パイトーチ][ウェブサイト]]

[概要に戻る]

深さ推定

  • DPT: "Vision Transformers for Dense Prediction", ICCV, 2021 (Intel).[論文][パイトーチ]
  • TransDepth: "Transformer-Based Attention Networks for Continuous Pixel-wise Prediction", ICCV, 2021 (Haerbin Institute of Technology + University of Trento).[論文][パイトーチ]
  • ASTransformer: "Transformer-based Monocular Depth Estimation with Attention Supervision", BMVC, 2021 (USTC).[論文][パイトーチ]
  • MT-SfMLearner: "Transformers in Self-Supervised Monocular Depth Estimation with Unknown Camera Intrinsics", VISAP, 2022 (NavInfo Europe, Netherlands).[論文]]
  • DepthFormer: "Transersによるマルチフレーム自己監視深度", CVPR, 2022 (トヨタ).[論文]]
  • GuideFormer: "GuideFormer: Transformers for Image Guided Depth Completion", CVPR, 2022 (Agency for Defense Development, Korea).[論文]]
  • SparseFormer: "SparseFormer: Attention-based Depth Completion Network", CVPRW, 2022 (Meta).[論文]]
  • DEST: "Depth Estimation with Simplified Transformer", CVPRW, 2022 (NVIDIA).[論文]]
  • MonoViT: "MonoViT: Self-Supervised Monocular Depth Estimation with a Vision Transformer", 3DV, 2022 (ボローニャ大学, イタリア).[論文][パイトーチ]
  • スパイクトランス:「スパイクトランス:スパイクカメラの単眼深度推定」、ECCV、2022(北京大学)。[論文][パイトーチ]
  • GLPanoDepth: "GLPanoDepth: Global-to-Local Panoramic Depth Estimation", arXiv, 2022 (Nanjing University).[論文]]
  • DepthFormer: "DepthFormer: Exploiting Long-Range Correlation and Local Information for Accurate Monocular Depth Estimation", arXiv, 2022 (Harbin Institute of Technology).[論文][パイトーチ]
  • BinsFormer: "BinsFormer: Revisiting Adaptive Bins for Monocular Depth Estimation", arXiv, 2022 (Harbin Institute of Technology).[論文][パイトーチ]
  • SideRT: "SideRT: A Real-Time Pure Transformer Architecture for Single Image Depth Estimation", arXiv, 2022 (Meituan).[論文]]
  • MonoFormer: "MonoFormer: Towards Generalization of self-supervised monocular depth estimation with Transformers", arXiv, 2022 (DGIST, Korea).[論文]]
  • Depthformer: "Depthformer : Multiscale Vision Transformer for Monocular Depth Estimation with Local Global Information Fusion", arXiv, 2022 (Indian Institute of Technology Delhi).[論文]]
  • TODE-Trans: "TODE-Trans: Transparent Object Depth Estimation with Transformer", arXiv, 2022 (USTC).[論文][コード(建設中)]

[概要に戻る]

オブジェクトセグメンテーション

  • SOTR: "SOTR: Segmenting Objects with Transformers", ICCV, 2021 (China Agricultural University).[論文][パイトーチ]
  • Trans4Trans: "Trans4Trans: Efficient Transformer for Transparent Object Segmentation to Help Visual Disabilities People Navigate in the Real World", ICCVW, 2021 (Karlsruhe Institute of Technology, Germany).[論文][コード(建設中)]
  • Trans2Seg: "Segmenting Transparent Object in the Wild with Transformer", arXiv, 2021 (HKU + SenseTime).[論文][パイトーチ]
  • SOIT: "SOIT: Segmenting Objects with Instance-Aware Transformers", AAAI, 2022 (Hikvision).[論文][パイトーチ]
  • CAST:「適応セグメントトークンとの同時認識とセグメンテーション」、arXiv、2022(バークレー)。[論文]]

[概要に戻る]

その他のセグメンテーションタスク

  • ビジョン言語:
  • マルチモーダル:
    • UCTNet: "UCTNet: Uncertainty-Aware Cross-Modal Transformer Network for Indoor RGB-D Semantic Segmentation", ECCV, 2022 (Lehigh University, Pennsylvania).[論文]]
    • CMX: "CMX: Cross-Modal Fusion for RGB-X Semantic Segmentation with Transformers", arXiv, 2022 (Karlsruhe Institute of Technology, Germany).[論文][パイトーチ]
  • パノラマセグメンテーション:
    • MaX-DeepLab: "MaX-DeepLab: End-to-End Panoptic Segmentation with Mask Transformers", CVPR, 2021 (Google).[論文][パイトーチ(コンラッドリー)]
    • SIAin: "A End-to-End Trainable Video Panoptic Segmentation Method using Transformers", arXiv, 2021 (SI Analytics, South Korea).[論文]]
    • VPSトランス:"ビデオ汎光学セグメンテーションのための時空間トランスフォーマー"、WACV、2022(クルージュナポカ工科大学、ルーマニア)。[論文]]
    • CMT-DeepLab: "CMT-DeepLab: Clustering Mask Transformers for Panoptic Segmentation", CVPR, 2022 (Google).[論文]]
    • Panoptic-SegFormer: "Panoptic SegFormer", CVPR, 2022 (南京大学).[論文][パイトーチ]
    • Mask2Former: "ユニバーサル画像セグメンテーションのためのマスクドアテンションマスクトランス", CVPR, 2022 (メタ).[論文][パイトーチ][ウェブサイト]]
    • kMaX-DeepLab: "k-means Mask Transformer", ECCV, 2022 (Google).[論文][テンソルフロー]]
    • Panoptic-PartFormer: "Panoptic-PartFormer: Learning a Unified Model for Panoptic Part Segmentation", ECCV, 2022 (北京).[論文][パイトーチ]
  • インスタンスセグメンテーション:
    • ISTR: "ISTR: End-to-End Instance Segmentation with Transformers", arXiv, 2021 (厦門大学).[論文][パイトーチ]
    • Mask-Transfiner: "Mask Transfineer for High-Quality Instance Segmentation", CVPR, 2022 (ETHZ).[論文][パイトーチ][ウェブサイト]]
    • BoundaryFormer: "Instance Segmentation with Mask-Supervised Polygonal Boundary Transformers", CVPR, 2022 (UCSD).[論文]]
    • PPT: "Parallel Pretrained Transformers (PPT) for Synthetic Data Based Instance Segmentation", CVPRW, 2022 (ByteDance).[論文]]
    • OSFormer: "OSFormer: One-Stage Camouflaged Instance Segmentation with Transformers", ECCV, 2022 (Huazhong University of Science and Technology).[論文][パイトーチ]
    • AISFormer: "AISFormer: Amodal Instance Segmentation with Transformer", BMVC, 2022 (University of Arkansas, Arkansas).[論文][コード(建設中)]
    • TOIST: "TOIST: Task Oriented Instance Segmentation Transformer with Noun-代名詞 Distillation", NeurIPS, 2022 (清華大学).[論文][パイトーチ]
  • オプティカルフロー:
  • パノラマセマンティックセグメンテーション:
    • Trans4PASS: "Bending Reality: Distortion-aware Transformers for Adapting to Panoramic Semantic Segmentation", CVPR, 2022 (Karlsruhe Institute of Technology, Germany).[論文][パイトーチ]
  • Xショット:
    • CyCTR: "Few-Shot Segmentation via Cycle-Consistent Transformer", NeurIPS, 2021 (University of Technology Sydney).[論文]]
    • CATrans: "CATrans: Context and Affinity Transformer for Few-Shot Segmentation", IJCAI, 2022 (Baidu).[論文]]
    • VAT: "Cost Aggregation with 4D Convolutional Swin Transformer for Few-Shot Segmentation", ECCV, 2022 (高麗大学).[論文][パイトーチ][ウェブサイト]]
    • DCAMA: "Dense Cross-Query-and-Support Attention Weighted Mask Aggregation for Few-Shot Segmentation", ECCV, 2022 (Tencent).[論文]]
    • AAFormer: "Adaptive Agent Transformer for Few-Shot Segmentation", ECCV, 2022 (USTC).[論文]]
    • IPMT: "Intermediate Prototype Mining Transformer for Few-Shot Semantic Segmentation", NeurIPS, 2022 (Northwestern Polytechnical University).[論文][パイトーチ]
    • TAFT:"Task-Adaptive Feature Transformer with Semantic Enrichment for Few-Shot Segmentation", arXiv, 2022 (KAIST).[論文]]
    • MSANet: "MSANet: Multi-Similarity and Attention Guidance for Boosting Few-Shot Segmentation", arXiv, 2022 (AiV Research Group, Korea).[論文][パイトーチ]
  • X-監視:
    • MCTformer: "弱監督セマンティックセグメンテーションのためのマルチクラストークントランス", CVPR, 2022 (西オーストラリア大学).[論文][コード(建設中)]
    • AFA: "注意からアフィニティを学ぶ:トランスフォーマーによるエンドツーエンドの弱く監督されたセマンティックセグメンテーション"、CVPR、2022(武漢大学)。[論文][パイトーチ]
    • HSG: "Multiview Cosegmentation and Clustering Transformersによる教師なし階層セマンティックセメンテーション", CVPR, 2022 (Berkeley).[論文][パイトーチ]
    • ?: "高密度予測タスクのためのビジョントランスフォーマーの自己監督事前トレーニング", CVPRW, 2022 (パリ・サクレー大学, フランス).[論文]]
    • SegSwap: "Learning Co-segmentation by Segment Swapping for Retrieval and Discovery", CVPRW, 2022 (École des Ponts ParisTech).[論文][パイトーチ][ウェブサイト]]
    • ViT-PCM: "Max Pooling with Vision Transformers Reconciles Class and Shape in Weak-Supervised Semantic Segmentation", ECCV, 2022 (Sapienza University, Italy).[論文][テンソルフロー]]
    • TransFGU: "TransFGU: A Top-down Approach to Fine-Grained Unsupervised Semantic Segmentation", ECCV, 2022 (Alibaba).[論文][パイトーチ]
    • TransCAM: "TransCAM: Transformer Attention-Based CAM Refinement for Weakly Supervised Semantic Segmentation", arXiv, 2022 (University of Toronto).[論文][パイトーチ]
    • WegFormer: "WegFormer: Transformers for Weakly Supervised Semantic Segmentation", arXiv, 2022 (同済大学, 中国).[論文]]
    • MaskDistill: "Discover Object Masks with Transformers for Unsupervised Semantic Segmentation", arXiv, 2022 (KU Leuven).[論文][パイトーチ]
    • eX-ViT: "eX-ViT: A Novel eXplainable Vision Transformer for Weakly Supervised Semantic Segmentation", arXiv, 2022 (La Trobe University, Australia).[論文]]
    • TCC: "Transformer-CNN Cohort: Semi-supervised Semantic Segmentation by the Best of Both Students", arXiv, 2022 (Alibaba).[論文]]
  • クロスドメイン:
    • DAFormer: "DAFormer: Improving Network Architectures and Training Strategies for Domain-Adaptive Semantic Segmentation", CVPR, 2022 (ETHZ).[論文][パイトーチ]
  • 亀裂検出:
    • CrackFormer: "CrackFormer: Transformer Network for Fine-Grained Crack Detection", ICCV, 2021 (Nanjing University of Science and Technology).[論文]]
  • カモフラージュされた物体検出:
    • UGTR: "Uncertainty-Guided Transformer Reasoning for Camouflaged Object Detection", ICCV, 2021 (Group42, Abu Dhabi).[論文][パイトーチ]
    • COD: "デュアルタスクインタラクティブトランスによるカモフラージュされた物体検出のブースト", ICPR, 2022 (安徽大学, 中国).[論文][コード(建設中)]
  • 背景の分離:
    • TransBlast: "TransBlast: Self-Supervised Learning using Augmented Subspace with Transformer for Background/Foreground Separation", ICCVW, 2021 (University of British Columbia).[論文]]
  • シーンの理解:
    • BANet: "Transformer Meets Convolution: A Bilateral Awareness Net-work for Semantic Segmentation of Very Fine Resolution Urban Scene Images", arXiv, 2021 (武漢大学).[論文]]
    • Cerberus-Transformer: "Cerberus Transformer: Joint Semantic, Affordance and Attribute Parsing", CVPR, 2022 (清華大学).[論文][パイトーチ]
    • IRISformer: "IRISformer: Dense Vision Transformers for Single-Image Inverse Rendering in Indoor Scenes", CVPR, 2022 (UCSD).[論文][コード(建設中)]
    • InvPT: "高密度シーン理解のための逆ピラミッドマルチタスクトランス", ECCV, 2022 (HKUST).[論文][パイトーチ]
  • 3Dセグメンテーション:
    • Stratified-Transformer: "Stratified Transformer for 3D Point Cloud Segmentation", CVPR, 2022 (CUHK).[論文][パイトーチ]
    • CodedVTR: "CodedVTR: Codebook-based Sparse Voxel Transformer with Geometric Guidance ", CVPR, 2022 (清華).[論文]]
    • M2F3D: "M2F3D: Mask2Former for 3D Instance Segmentation", CVPRW, 2022 (RWTH アーヘン大学, ドイツ).[論文][ウェブサイト]]
  • マルチタスク:
    • MTFormer: "MTFormer: Transformer and Cross-Task Reasoningによるマルチタスク学習", ECCV, 2022 (CUHK).[論文]]
    • MQTransformer: "高密度予測のためのマルチクエリトランスによるマルチタスク学習", arXiv, 2022 (武漢大学).[論文]]
  • フォーキャスト:
  • ライダー:
  • コセグメンテーション:
  • Top-Down Semantic Segmentation:
    • Trans4Map: "Trans4Map: Revisiting Holistic Top-down Mapping from Egocentric Images to Allocentric Semantics with Vision Transformers", arXiv, 2022 (Karlsruhe Institute of Technology, Germany). [Paper]
  • Open-World/Vocabulary:
    • ViL-Seg: "Open-world Semantic Segmentation via Contrasting and Clustering Vision-Language Embedding", ECCV, 2022 (CUHK). [Paper]
    • OVSeg: "Open-Vocabulary Semantic Segmentation with Mask-adapted CLIP", arXiv, 2022 (Meta). [Paper][Website]
  • Applications:
    • FloodTransformer: "Transformer-based Flood Scene Segmentation for Developing Countries", NeurIPSW, 2022 (BITS Pilani, India). [Paper]

[Back to Overview]

Video (High-level)

Action Recognition

  • RGB mainly
    • Action Transformer: "Video Action Transformer Network", CVPR, 2019 (DeepMind). [Paper][Code (ppriyank)]
    • ViViT-Ensemble: "Towards Training Stronger Video Vision Transformers for EPIC-KITCHENS-100 Action Recognition", CVPRW, 2021 (Alibaba). [Paper]
    • TimeSformer: "Is Space-Time Attention All You Need for Video Understanding?", ICML, 2021 (Facebook). [Paper][PyTorch (lucidrains)]
    • MViT: "Multiscale Vision Transformers", ICCV, 2021 (Facebook). [Paper][PyTorch]
    • VidTr: "VidTr: Video Transformer Without Convolutions", ICCV, 2021 (Amazon). [Paper][PyTorch]
    • ViViT: "ViViT: A Video Vision Transformer", ICCV, 2021 (Google). [Paper][PyTorch (rishikksh20)]
    • VTN: "Video Transformer Network", ICCVW, 2021 (Theator). [Paper][PyTorch]
    • TokShift: "Token Shift Transformer for Video Classification", ACMMM, 2021 (CUHK). [Paper][PyTorch]
    • Motionformer: "Keeping Your Eye on the Ball: Trajectory Attention in Video Transformers", NeurIPS, 2021 (Facebook). [Paper][PyTorch][Website]
    • X-ViT: "Space-time Mixing Attention for Video Transformer", NeurIPS, 2021 (Samsung). [Paper][PyTorch]
    • SCT: "Shifted Chunk Transformer for Spatio-Temporal Representational Learning", NeurIPS, 2021 (Kuaishou). [Paper]
    • RSANet: "Relational Self-Attention: What's Missing in Attention for Video Understanding", NeurIPS, 2021 (POSTECH). [Paper][PyTorch][Website]
    • STAM: "An Image is Worth 16x16 Words, What is a Video Worth?", arXiv, 2021 (Alibaba). [Paper][Code]
    • GAT: "Enhancing Transformer for Video Understanding Using Gated Multi-Level Attention and Temporal Adversarial Training", arXiv, 2021 (Samsung). [Paper]
    • TokenLearner: "TokenLearner: What Can 8 Learned Tokens Do for Images and Videos?", arXiv, 2021 (Google). [Paper]
    • VLF: "VideoLightFormer: Lightweight Action Recognition using Transformers", arXiv, 2021 (The University of Sheffield). [Paper]
    • UniFormer: "UniFormer: Unified Transformer for Efficient Spatiotemporal Representation Learning", ICLR, 2022 (CAS + SenstTime). [Paper][PyTorch]
    • Video-Swin: "Video Swin Transformer", CVPR, 2022 (Microsoft). [Paper][PyTorch]
    • DirecFormer: "DirecFormer: A Directed Attention in Transformer Approach to Robust Action Recognition", CVPR, 2022 (University of Arkansas). [Paper][Code (in construction)]
    • DVT: "Deformable Video Transformer", CVPR, 2022 (Meta). [Paper]
    • MeMViT: "MeMViT: Memory-Augmented Multiscale Vision Transformer for Efficient Long-Term Video Recognition", CVPR, 2022 (Meta). [Paper]
    • MLP-3D: "MLP-3D: A MLP-like 3D Architecture with Grouped Time Mixing", CVPR, 2022 (JD). [Paper][PyTorch (in construction)]
    • RViT: "Recurring the Transformer for Video Action Recognition", CVPR, 2022 (TCL Corporate Research, HK). [Paper]
    • SIFA: "Stand-Alone Inter-Frame Attention in Video Models", CVPR, 2022 (JD). [Paper][PyTorch]
    • MViTv2: "MViTv2: Improved Multiscale Vision Transformers for Classification and Detection", CVPR, 2022 (Meta). [Paper][PyTorch]
    • MTV: "Multiview Transformers for Video Recognition", CVPR, 2022 (Google). [Paper][Tensorflow]
    • ORViT: "Object-Region Video Transformers", CVPR, 2022 (Tel Aviv). [Paper][Website]
    • TIME: "Time Is MattEr: Temporal Self-supervision for Video Transformers", ICML, 2022 (KAIST). [Paper][PyTorch]
    • TPS: "Spatiotemporal Self-attention Modeling with Temporal Patch Shift for Action Recognition", ECCV, 2022 (Alibaba). [Paper][PyTorch]
    • DualFormer: "DualFormer: Local-Global Stratified Transformer for Efficient Video Recognition", ECCV, 2022 (Sea AI Lab). [Paper][PyTorch]
    • STTS: "Efficient Video Transformers with Spatial-Temporal Token Selection", ECCV, 2022 (Fudan University). [Paper][PyTorch]
    • Turbo: "Turbo Training with Token Dropout", BMVC, 2022 (Oxford). [Paper]
    • MultiTrain: "Multi-dataset Training of Transformers for Robust Action Recognition", NeurIPS, 2022 (Tencent). [Paper][Code (in construction)]
    • AIA: "Attention in Attention: Modeling Context Correlation for Efficient Video Classification", TCSVT, 2022 (University of Science and Technology of China). [Paper][PyTorch]
    • MSCA: "Vision Transformer with Cross-attention by Temporal Shift for Efficient Action Recognition", arXiv, 2022 (Nagoya Institute of Technology). [Paper]
    • SViT: "Bringing Image Scene Structure to Video via Frame-Clip Consistency of Object Tokens", arXiv, 2022 (Tel Aviv). [Paper][Website]
    • VAST: "Efficient Attention-free Video Shift Transformers", arXiv, 2022 (Samsung). [Paper]
    • Video-MobileFormer: "Video Mobile-Former: Video Recognition with Efficient Global Spatial-temporal Modeling", arXiv, 2022 (Microsoft). [Paper]
    • MAM2: "It Takes Two: Masked Appearance-Motion Modeling for Self-supervised Video Transformer Pre-training", arXiv, 2022 (Baidu). [Paper]
    • ?: "Linear Video Transformer with Feature Fixation", arXiv, 2022 (SenseTime). [Paper]
  • Depth:
    • Trear: "Trear: Transformer-based RGB-D Egocentric Action Recognition", IEEE Transactions on Cognitive and Developmental Systems, 2021 (Tianjing University). [Paper]
  • Pose:
    • ST-TR: "Spatial Temporal Transformer Network for Skeleton-based Action Recognition", ICPRW, 2020 (Polytechnic University of Milan). [Paper]
    • AcT: "Action Transformer: A Self-Attention Model for Short-Time Human Action Recognition", arXiv, 2021 (Politecnico di Torino, Italy). [Paper][Code (in construction)]
    • STAR: "STAR: Sparse Transformer-based Action Recognition", arXiv, 2021 (UCLA). [Paper]
    • GCsT: "GCsT: Graph Convolutional Skeleton Transformer for Action Recognition", arXiv, 2021 (CAS). [Paper]
    • GL-Transformer: "Global-local Motion Transformer for Unsupervised Skeleton-based Action Learning", ECCV, 2022 (Seoul National University). [Paper][PyTorch]
    • ?: "Pose Uncertainty Aware Movement Synchrony Estimation via Spatial-Temporal Graph Transformer", International Conference on Multimodal Interaction (ICMI), 2022 (University of Delaware). [Paper]
    • FG-STFormer: "Focal and Global Spatial-Temporal Transformer for Skeleton-based Action Recognition", ACCV, 2022 (Zhengzhou University). [Paper]
    • STTFormer: "Spatio-Temporal Tuples Transformer for Skeleton-Based Action Recognition", arXiv, 2022 (Xidian University). [Paper][Code (in construction)]
    • ProFormer: "ProFormer: Learning Data-efficient Representations of Body Movement with Prototype-based Feature Augmentation and Visual Transformers", arXiv, 2022 (Karlsruhe Institute of Technology, Germany). [Paper][PyTorch]
    • ?: "Spatial Transformer Network with Transfer Learning for Small-scale Fine-grained Skeleton-based Tai Chi Action Recognition", arXiv, 2022 (Harbin Institute of Technology). [Paper]
    • STAN: "Two-Stream Transformer Architecture for Long Video Understanding", arXiv, 2022 (The University of Surrey, UK). [Paper]
    • STAR-Transformer: "STAR-Transformer: A Spatio-temporal Cross Attention Transformer for Human Action Recognition", WACV, 2023 (Keimyung University, Korea). [Paper]
  • Multi-modal:
    • MBT: "Attention Bottlenecks for Multimodal Fusion", NeurIPS, 2021 (Google). [Paper]
    • MM-ViT: "MM-ViT: Multi-Modal Video Transformer for Compressed Video Action Recognition", WACV, 2022 (OPPO). [Paper]
    • MMT-NCRC: "Multimodal Transformer for Nursing Activity Recognition", CVPRW, 2022 (UCF). [Paper][Code (in construction)]
    • M&M: "M&M Mix: A Multimodal Multiview Transformer Ensemble", CVPRW, 2022 (Google). [Paper]
    • VT-CE: "Combined CNN Transformer Encoder for Enhanced Fine-grained Human Action Recognition", CVPRW, 2022 (A*STAR). [Paper]
    • Hi-TRS: "Hierarchically Self-Supervised Transformer for Human Skeleton Representation Learning", ECCV, 2022 (Rutgers). [Paper][PyTorch]
    • MVFT: "Multi-View Fusion Transformer for Sensor-Based Human Activity Recognition", arXiv, 2022 (Alibaba). [Paper]
    • MOV: "Multimodal Open-Vocabulary Video Classification via Pre-Trained Vision and Language Models", arXiv, 2022 (Google). [Paper]
    • MotionBERT: "MotionBERT: Unified Pretraining for Human Motion Analysis", arXiv, 2022 (Peking University). [Paper][Code (in construction)][Website]
    • HIT: "Holistic Interaction Transformer Network for Action Detection", WACV, 2023 (NTHU). [Paper][Code (in construction)]
  • Group Activity:
    • GroupFormer: "GroupFormer: Group Activity Recognition with Clustered Spatial-Temporal Transformer", ICCV, 2021 (Sensetime). [Paper]
    • ?: "Hunting Group Clues with Transformers for Social Group Activity Recognition", ECCV, 2022 (Hitachi). [Paper]

[Back to Overview]

Action Detection/Localization

  • OadTR: "OadTR: Online Action Detection with Transformers", ICCV, 2021 (Huazhong University of Science and Technology). [Paper][PyTorch]
  • RTD-Net: "Relaxed Transformer Decoders for Direct Action Proposal Generation", ICCV, 2021 (Nanjing University). [Paper][PyTorch]
  • FS-TAL: "Few-Shot Temporal Action Localization with Query Adaptive Transformer", BMVC, 2021 (University of Surrey, UK). [Paper][PyTorch]
  • LSTR: "Long Short-Term Transformer for Online Action Detection", NeurIPS, 2021 (Amazon). [Paper][PyTorch][Website]
  • ATAG: "Augmented Transformer with Adaptive Graph for Temporal Action Proposal Generation", arXiv, 2021 (Alibaba). [Paper]
  • TAPG-Transformer: "Temporal Action Proposal Generation with Transformers", arXiv, 2021 (Harbin Institute of Technology). [Paper]
  • TadTR: "End-to-end Temporal Action Detection with Transformer", arXiv, 2021 (Alibaba). [Paper][Code (in construction)]
  • Vidpress-Soccer: "Feature Combination Meets Attention: Baidu Soccer Embeddings and Transformer based Temporal Detection", arXiv, 2021 (Baidu). [Paper][GitHub]
  • MS-TCT: "MS-TCT: Multi-Scale Temporal ConvTransformer for Action Detection", CVPR, 2022 (INRIA). [Paper][PyTorch]
  • UGPT: "Uncertainty-Guided Probabilistic Transformer for Complex Action Recognition", CVPR, 2022 (Rensselaer Polytechnic Institute, NY). [Paper]
  • TubeR: "TubeR: Tube-Transformer for Action Detection", CVPR, 2022 (Amazon). [Paper]
  • DDM-Net: "Progressive Attention on Multi-Level Dense Difference Maps for Generic Event Boundary Detection", CVPR, 2022 (Nanjing University). [Paper][PyTorch]
  • ?: "Dual-Stream Transformer for Generic Event Boundary Captioning", CVPRW, 2022 (ByteDance). [Paper][PyTorch]
  • ?: "Exploring Anchor-based Detection for Ego4D Natural Language Query", arXiv, 2022 (Renmin University of China). [Paper]
  • EAMAT: "Entity-aware and Motion-aware Transformers for Language-driven Action Localization in Videos", IJCAI, 2022 (Beijing Institute of Technology). [Paper][Code (in construction)]
  • STPT: "An Efficient Spatio-Temporal Pyramid Transformer for Action Detection", ECCV, 2022 (Monash University, Australia). [Paper]
  • TeSTra: "Real-time Online Video Detection with Temporal Smoothing Transformers", ECCV, 2022 (UT Austin). [Paper][PyTorch]
  • TALLFormer: "TALLFormer: Temporal Action Localization with Long-memory Transformer", ECCV, 2022 (UNC). [Paper][PyTorch]
  • ?: "Uncertainty-Based Spatial-Temporal Attention for Online Action Detection", ECCV, 2022 (Rensselaer Polytechnic Institute, NY). [Paper]
  • ActionFormer: "ActionFormer: Localizing Moments of Actions with Transformers", ECCV, 2022 (UW-Madison). [Paper][PyTorch]
  • CoOadTR: "Continual Transformers: Redundancy-Free Attention for Online Inference", arXiv, 2022 (Aarhus University, Denmark). [Paper][PyTorch]
  • Temporal-Perceiver: "Temporal Perceiver: A General Architecture for Arbitrary Boundary Detection", arXiv, 2022 (Nanjing University). [Paper]
  • LocATe: "LocATe: End-to-end Localization of Actions in 3D with Transformers", arXiv, 2022 (Stanford). [Paper]
  • HTNet: "HTNet: Anchor-free Temporal Action Localization with Hierarchical Transformers", arXiv, 2022 (Korea University). [Paper]
  • AdaPerFormer: "Adaptive Perception Transformer for Temporal Action Localization", arXiv, 2022 (Tianjin University). [Paper][PyTorch]
  • CWC-Trans: "A Circular Window-based Cascade Transformer for Online Action Detection", arXiv, 2022 (Meituan). [Paper]

[Back to Overview]

Action Prediction/Anticipation

  • AVT: "Anticipative Video Transformer", ICCV, 2021 (Facebook). [Paper][PyTorch][Website]
  • HORST: "Higher Order Recurrent Space-Time Transformer", arXiv, 2021 (NVIDIA). [Paper][PyTorch]
  • ?: "Action Forecasting with Feature-wise Self-Attention", arXiv, 2021 (A*STAR). [Paper]
  • FUTR: "Future Transformer for Long-term Action Anticipation", CVPR, 2022 (POSTECH). [Paper]
  • TTPP: "TTPP: Temporal Transformer with Progressive Prediction for Efficient Action Anticipation", arXiv, 2022 (CAS). [Paper]
  • VPTR: "VPTR: Efficient Transformers for Video Prediction", ICPR, 2022 (Polytechnique Montreal, Canada). [Paper][PyTorch]
  • Earthformer: "Earthformer: Exploring Space-Time Transformers for Earth System Forecasting", arXiv, 2022 (Amazon). [Paper]
  • AFFT: "Anticipative Feature Fusion Transformer for Multi-Modal Action Anticipation", WACV, 2023 (Karlsruhe Institute of Technology, Germany). [Paper][Code (in construction)]

[Back to Overview]

Video Object Segmentation

  • GC: "Fast Video Object Segmentation using the Global Context Module", ECCV, 2020 (Tencent). [Paper]
  • SSTVOS: "SSTVOS: Sparse Spatiotemporal Transformers for Video Object Segmentation", CVPR, 2021 (Modiface). [Paper][Code (in construction)]
  • JOINT: "Joint Inductive and Transductive Learning for Video Object Segmentation", ICCV, 2021 (University of Science and Technology of China). [Paper][PyTorch]
  • AOT: "Associating Objects with Transformers for Video Object Segmentation", NeurIPS, 2021 (University of Technology Sydney). [Paper][PyTorch (yoxu515)][Code (in construction)]
  • TransVOS: "TransVOS: Video Object Segmentation with Transformers", arXiv, 2021 (Zhejiang University). [Paper]
  • SITVOS: "Siamese Network with Interactive Transformer for Video Object Segmentation", AAAI, 2022 (JD). [Paper]
  • MTTR: "End-to-End Referring Video Object Segmentation with Multimodal Transformers", CVPR, 2022 (Technion - Israel Institute of Technology). [Paper][PyTorch]
  • HODOR: "Differentiable Soft-Masked Attention", CVPRW, 2022 (RWTH Aachen University, Germany). [Paper]
  • BATMAN: "BATMAN: Bilateral Attention Transformer in Motion-Appearance Neighboring Space for Video Object Segmentation", ECCV, 2022 (Microsoft). [Paper]
  • AOT: "Associating Objects with Scalable Transformers for Video Object Segmentation", arXiv, 2022 (University of Technology Sydney). [Paper][Code (in construction)]

[Back to Overview]

Video Instance Segmentation

  • VisTR: "End-to-End Video Instance Segmentation with Transformers", CVPR, 2021 (Meituan). [Paper][PyTorch]
  • IFC: "Video Instance Segmentation using Inter-Frame Communication Transformers", NeurIPS, 2021 (Yonsei University). [Paper][PyTorch]
  • Deformable-VisTR: "Deformable VisTR: Spatio temporal deformable attention for video instance segmentation", ICASSP, 2022 (University at Buffalo). [Paper][Code (in construction)]
  • TeViT: "Temporally Efficient Vision Transformer for Video Instance Segmentation", CVPR, 2022 (Tencent). [Paper][PyTorch]
  • GMP-VIS: "A Graph Matching Perspective With Transformers on Video Instance Segmentation", CVPR, 2022 (Shandong University). [Paper]
  • VMT: "Video Mask Transfiner for High-Quality Video Instance Segmentation", ECCV, 2022 (ETHZ). [Paper][GitHub][Website]
  • SeqFormer: "SeqFormer: Sequential Transformer for Video Instance Segmentation", ECCV, 2022 (ByteDance). [Paper][PyTorch]
  • MS-STS: "Video Instance Segmentation via Multi-scale Spatio-temporal Split Attention Transformer", ECCV, 2022 (MBZUAI). [Paper][PyTorch]
  • VITA: "VITA: Video Instance Segmentation via Object Token Association", arXiv, 2022 (Yonsei University). [Paper][Code (in construction)]
  • IFR: "Consistent Video Instance Segmentation with Inter-Frame Recurrent Attention", arXiv, 2022 (Microsoft). [Paper]
  • DeVIS: "DeVIS: Making Deformable Transformers Work for Video Instance Segmentation", arXiv, 2022 (TUM). [Paper][PyTorch]
  • MinVIS: "MinVIS: A Minimal Video Instance Segmentation Framework without Video-based Training", arXiv, 2022 (NVIDIA). [Paper][PyTorch]
  • InstanceFormer: "InstanceFormer: An Online Video Instance Segmentation Framework", arXiv, 2022 (Ludwig Maximilian University of Munich). [Paper][Code (in construction)]

[Back to Overview]

Other Video Tasks

  • Action Segmentation
    • ASFormer: "ASFormer: Transformer for Action Segmentation", BMVC, 2021 (Peking University). [Paper][PyTorch]
    • Bridge-Prompt: "Bridge-Prompt: Towards Ordinal Action Understanding in Instructional Videos", CVPR, 2022 (Tsinghua University). [Paper][PyTorch]
    • SC-Transformer++: "SC-Transformer++: Structured Context Transformer for Generic Event Boundary Detection", CVPRW, 2022 (CAS). [Paper][Code (in construction)]
    • LocVTP: "LocVTP: Video-Text Pre-training for Temporal Localization", ECCV, 2022 (Peking University). [Paper][PyTorch]
    • UVAST: "Unified Fully and Timestamp Supervised Temporal Action Segmentation via Sequence to Sequence Translation", ECCV, 2022 (Bosch). [Paper][PyTorch]
    • ?: "Transformers in Action: Weakly Supervised Action Segmentation", arXiv, 2022 (TUM). [Paper]
    • CETNet: "Cross-Enhancement Transformer for Action Segmentation", arXiv, 2022 (Shijiazhuang Tiedao University). [Paper]
    • EUT: "Efficient U-Transformer with Boundary-Aware Loss for Action Segmentation", arXiv, 2022 (CAS). [Paper]
    • SC-Transformer: "Structured Context Transformer for Generic Event Boundary Detection", arXiv, 2022 (CAS). [Paper]
  • Video X Segmentation:
    • STT: "Video Semantic Segmentation via Sparse Temporal Transformer", MM, 2021 (Shanghai Jiao Tong). [Paper]
    • CFFM: "Coarse-to-Fine Feature Mining for Video Semantic Segmentation", CVPR, 2022 (ETH Zurich). [Paper][PyTorch]
    • TF-DL: "TubeFormer-DeepLab: Video Mask Transformer", CVPR, 2022 (Google). [Paper]
    • MRCFA: "Mining Relations among Cross-Frame Affinities for Video Semantic Segmentation", ECCV, 2022 (ETH Zurich). [Paper][PyTorch]
    • PolyphonicFormer: "PolyphonicFormer: Unified Query Learning for Depth-aware Video Panoptic Segmentation, ECCV, 2022 (Wuhan University). [Paper][Code (in construction)]
    • ?: "Time-Space Transformers for Video Panoptic Segmentation", arXiv, 2022 (Technical University of Cluj-Napoca, Romania). [Paper]
  • Video Object Detection:
    • TransVOD: "End-to-End Video Object Detection with Spatial-Temporal Transformers", arXiv, 2021 (Shanghai Jiao Tong + SenseTime). [Paper][Code (in construction)]
    • MODETR: "MODETR: Moving Object Detection with Transformers", arXiv, 2021 (Valeo, Egypt). [Paper]
    • ST-MTL: "Spatio-Temporal Multi-Task Learning Transformer for Joint Moving Object Detection and Segmentation", arXiv, 2021 (Valeo, Egypt). [Paper]
    • ST-DETR: "ST-DETR: Spatio-Temporal Object Traces Attention Detection Transformer", arXiv, 2021 (Valeo, Egypt). [Paper]
    • PTSEFormer: "PTSEFormer: Progressive Temporal-Spatial Enhanced TransFormer Towards Video Object Detection", ECCV, 2022 (Shanghai Jiao Tong University). [Paper][PyTorch]
    • TransVOD: "TransVOD: End-to-end Video Object Detection with Spatial-Temporal Transformers", arXiv, 2022 (Shanghai Jiao Tong + SenseTime). [Paper]
    • ?: "Learning Future Object Prediction with a Spatiotemporal Detection Transformer", arXiv, 2022 (Zenseact, Sweden). [Paper]
  • Dense Video Tasks (Detection + Segmentation):
    • TDViT: "TDViT: Temporal Dilated Video Transformer for Dense Video Tasks", ECCV, 2022 (Queen's University Belfast, UK). [Paper][Code (in construction)]
  • Video Retrieval
    • SVRTN: "Self-supervised Video Retrieval Transformer Network", arXiv, 2021 (Alibaba). [Paper]
  • Video Hashing
    • BTH: "Self-Supervised Video Hashing via Bidirectional Transformers", CVPR, 2021 (Tsinghua). [Paper][PyTorch]
  • Video-Language:
    • ?: "Prompting Visual-Language Models for Efficient Video Understanding", ECCV, 2022 (Shanghai Jiao Tong + Oxford). [Paper][PyTorch][Website]
    • X-CLIP: "Expanding Language-Image Pretrained Models for General Video Recognition", ECCV, 2022 (Microsoft). [Paper][PyTorch]
    • EVL: "Frozen CLIP Models are Efficient Video Learners", ECCV, 2022 (CUHK). [Paper][PyTorch (in construction)]
    • STALE: "Zero-Shot Temporal Action Detection via Vision-Language Prompting", ECCV, 2022 (University of Surrey, UK). [Paper][Code (in construction)]
    • FineCo: "Contrastive Video-Language Learning with Fine-grained Frame Sampling", AACL, 2022 (ICL, UK). [Paper]
    • MovieCLIP: "MovieCLIP: Visual Scene Recognition in Movies", WACV, 2023 (USC). [Paper][Website]
  • X-supervised Learning:
    • LSTCL: "Long-Short Temporal Contrastive Learning of Video Transformers", CVPR, 2022 (Facebook). [Paper]
    • SVT: "Self-supervised Video Transformer", CVPR, 2022 (Stony Brook). [Paper][PyTorch][Website]
    • BEVT: "BEVT: BERT Pretraining of Video Transformers", CVPR, 2022 (Microsoft). [Paper][PyTorch]
    • SCVRL: "SCVRL: Shuffled Contrastive Video Representation Learning", CVPRW, 2022 (Amazon). [Paper]
    • VideoMAE: "VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training", CVPRW, 2022 (Tencent). [Paper][Code (in construction)]
    • VIMPAC: "VIMPAC: Video Pre-Training via Masked Token Prediction and Contrastive Learning", CVPRW, 2022 (UNC). [Paper][PyTorch]
    • ?: "Static and Dynamic Concepts for Self-supervised Video Representation Learning", ECCV, 2022 (CUHK). [Paper]
    • MAE: "Masked Autoencoders As Spatiotemporal Learners", arXiv, 2022 (Meta). [Paper]
    • OmniMAE: "OmniMAE: Single Model Masked Pretraining on Images and Videos", arXiv, 2022 (Meta). [Paper][PyTorch]
    • MaskViT: "MaskViT: Masked Visual Pre-Training for Video Prediction", arXiv, 2022 (Stanford). [Paper][Website]
    • ?: "On the Surprising Effectiveness of Transformers in Low-Labeled Video Recognition", arXiv, 2022 (Georgia Tech). [Paper]
  • X-shot:
    • ResT: "Cross-modal Representation Learning for Zero-shot Action Recognition", CVPR, 2022 (Microsoft). [Paper]
    • ViSET: "Zero-Shot Action Recognition with Transformer-based Video Semantic Embedding", arXiv, 2022 (University of South FLorida). [Paper]
    • REST: "REST: REtrieve & Self-Train for generative action recognition", arXiv, 2022 (Samsung). [Paper]
  • Anomaly Detection:
    • CT-D2GAN: "Convolutional Transformer based Dual Discriminator Generative Adversarial Networks for Video Anomaly Detection", ACMMM, 2021 (NEC). [Paper]
    • ADTR: "ADTR: Anomaly Detection Transformer with Feature Reconstruction", International Conference on Neural Information Processing (ICONIP), 2022 (Shanghai Jiao Tong University). [Paper]
    • SSMCTB: "Self-Supervised Masked Convolutional Transformer Block for Anomaly Detection", arXiv, 2022 (UCF). [Paper][Code (in construction)]
  • Relation Detection:
    • VidVRD: "Video Relation Detection via Tracklet based Visual Transformer", ACMMMW, 2021 (Zhejiang University). [Paper][PyTorch]
    • VRDFormer: "VRDFormer: End-to-End Video Visual Relation Detection With Transformers", CVPR, 2022 (Renmin University of China). [Paper][Code (in construction)]
    • VidSGG-BIG: "Classification-Then-Grounding: Reformulating Video Scene Graphs as Temporal Bipartite Graphs", CVPR, 2022 (Zhejiang University). [Paper][PyTorch]
  • Saliency Prediction:
    • STSANet: "Spatio-Temporal Self-Attention Network for Video Saliency Prediction", arXiv, 2021 (Shanghai University). [Paper]
    • UFO: "A Unified Transformer Framework for Group-based Segmentation: Co-Segmentation, Co-Saliency Detection and Video Salient Object Detection", arXiv, 2022 (South China University of Technology). [Paper][PyTorch]
  • Video Inpainting Detection:
    • FAST: "Frequency-Aware Spatiotemporal Transformers for Video Inpainting Detection", ICCV, 2021 (Tsinghua University). [Paper]
  • Driver Activity:
    • TransDARC: "TransDARC: Transformer-based Driver Activity Recognition with Latent Space Feature Calibration", arXiv, 2022 (Karlsruhe Institute of Technology, Germany). [Paper]
    • ?: "Applying Spatiotemporal Attention to Identify Distracted and Drowsy Driving with Vision Transformers", arXiv, 2022 (Jericho High School, NY). [Paper]
    • ViT-DD: "Multi-Task Vision Transformer for Semi-Supervised Driver Distraction Detection", arXiv, 2022 (Purdue). [Paper][PyTorch (in construction)]
  • Video Alignment:
    • DGWT: "Dynamic Graph Warping Transformer for Video Alignment", BMVC, 2021 (University of New South Wales, Australia). [Paper]
  • Sport-related:
    • Skating-Mixer: "Skating-Mixer: Multimodal MLP for Scoring Figure Skating", arXiv, 2022 (Southern University of Science and Technology). [Paper]
  • Action Counting:
    • TransRAC: "TransRAC: Encoding Multi-scale Temporal Correlation with Transformers for Repetitive Action Counting", CVPR, 2022 (ShanghaiTech). [Paper][PyTorch][Website]
  • Action Quality Assessment:
    • ?: "Action Quality Assessment with Temporal Parsing Transformer", ECCV, 2022 (Baidu). [Paper]
    • ?: "Action Quality Assessment using Transformers", arXiv, 2022 (USC). [Paper]
  • Human Interaction:
    • IGFormer: "IGFormer: Interaction Graph Transformer for Skeleton-based Human Interaction Recognition", ECCV, 2022 (The University of Melbourne). [Paper]
  • Domain Adaptation:
    • UDAVT: "Unsupervised Domain Adaptation for Video Transformers in Action Recognition", ICPR, 2022 (University of Trento). [Paper][Code (in construction)]
  • Multi-Camera Editing:
    • TC-Transformer: "Temporal and Contextual Transformer for Multi-Camera Editing of TV Shows", ECCVW, 2022 (CUHK). [Paper]

[Back to Overview]

Multi-Modality

Visual Captioning

  • Masked Transformers: "End-to-End Dense Video Captioning with Masked Transformer", CVPR, 2018 (UMich + Salesforce). [Paper][PyTorch]
  • ETA-Transformer: "Entangled Transformer for Image Captioning", ICCV, 2019 (UTS). [Paper]
  • M2-Transformer: "Meshed-Memory Transformer for Image Captioning", CVPR, 2020 (UniMoRE). [Paper][PyTorch]
  • BMT: "A Better Use of Audio-Visual Cues: Dense Video Captioning with Bi-modal Transformer", BMVC, 2020 (Tampere University, Finland). [Paper][PyTorch][Website]
  • ?: "Optimizing Latency for Online Video Captioning Using Audio-Visual Transformers", Interspeech, 2021 (MERL). [Paper]
  • MCCFormers: "Describing and Localizing Multiple Changes with Transformers", ICCV, 2021 (AIST). [Paper][Website]
  • SATIC: "Semi-Autoregressive Transformer for Image Captioning", ICCVW, 2021 (Hefei University of Technology). [Paper][PyTorch]
  • DGCN: "Dual Graph Convolutional Networks with Transformer and Curriculum Learning for Image Captioning", ACMMM, 2021 (Wuhan University). [Paper]
  • CPTR: "CPTR: Full Transformer Network for Image Captioning", arXiv, 2021 (CAS). [Paper]
  • ReFormer: "ReFormer: The Relational Transformer for Image Captioning", arXiv, 2021 (Stony Brook University). [Paper]
  • LAViTeR: "LAViTeR: Learning Aligned Visual and Textual Representations Assisted by Image and Caption Generation", arXiv, 2021 (University at Buffalo). [Paper]
  • LATGeO: "Label-Attention Transformer with Geometrically Coherent Objects for Image Captioning", arXiv, 2021 (Gwangju Institute of Science and Technology). [Paper]
  • GEVST: "Geometry-Entangled Visual Semantic Transformer for Image Captioning", arXiv, 2021 (NTU, Singapore). [Paper]
  • GAT: "Geometry Attention Transformer with Position-aware LSTMs for Image Captioning", arXiv, 2021 (University of Electronic Science and Technology of China). [Paper]
  • PureT: "End-to-End Transformer Based Model for Image Captioning", AAAI, 2022 (CAS). [Paper]
  • VisualGPT: "VisualGPT: Data-efficient Adaptation of Pretrained Language Models for Image Captioning", CVPR, 2022 (KAUST). [Paper][PyTorch]
  • ViTCAP: "Injecting Semantic Concepts into End-to-End Image Captioning", CVPR, 2022 (Microsoft). [Paper]
  • CLIP-Event: "CLIP-Event: Connecting Text and Images with Event Structures", CVPR, 2022 (Microsoft). [Paper][PyTorch]
  • CLIP4IDC: "CLIP4IDC: CLIP for Image Difference Captioning", CVPRW, 2022 (Aalto University, Finland). [Paper][Code (in construction)]
  • ?: "A Dual-Attentive Approach to Style-Based Image Captioning Using a CNN-Transformer Model", CVPRW, 2022 (The University of the West Indies, Jamaica). [Paper]
  • SpaCap3D: "Spatiality-guided Transformer for 3D Dense Captioning on Point Clouds", IJCAI, 2022 (University of Sydney). [Paper][Code (in construction)][Website]
  • RA-Transformer: "Retrieval-Augmented Transformer for Image Captioning", International Conference on Content-based Multimedia Indexing (CMBI), 2022 (University of Modena and Reggio Emilia, Italy). [Paper]
  • VGCL: "Video-Guided Curriculum Learning for Spoken Video Grounding", ACMMM, 2022 (Zhejiang University). [Paper][PyTorch]
  • GRIT: "GRIT: Faster and Better Image captioning Transformer Using Dual Visual Features", ECCV, 2022 (Tohoku University + RIKEN AIP). [Paper][PyTorch]
  • ?: "Object-Centric Unsupervised Image Captioning", ECCV, 2022 (Meta). [Paper][PyTorch]
  • UEDVC: "Unifying Event Detection and Captioning as Sequence Generation via Pre-Training", ECCV, 2022 (Renmin University of China). [Paper][PyTorch]
  • TIger: "Explicit Image Caption Editing", ECCV, 2022 (Zhejiang University). [Paper][Code]
  • CVLNM: "Learning to Collocate Visual-Linguistic Neural Modules for Image Captioning", IJCV, 2022 (Southeast University, China). [Paper][PyTorch]
  • ViNTER: "ViNTER: Image Narrative Generation with Emotion-Arc-Aware Transformer", arXiv, 2022 (The University of Tokyo). [Paper]
  • D2: "Dual-Level Decoupled Transformer for Video Captioning", arXiv, 2022 (Northwestern Polytechnical University, China). [Paper]
  • VaT: "Variational Transformer: A Framework Beyond the Trade-off between Accuracy and Diversity for Image Captioning", arXiv, 2022 (Tongji University). [Paper]
  • SCST-GEG: "Distincive Image Captioning via CLIP Guided Group Optimization", arXiv, 2022 (McGill University). [Paper]
  • VASTA: "Diverse Video Captioning by Adaptive Spatio-temporal Attention", arXiv, 2022 (University of Tubingen, Germany). [Paper]
  • ?: "Vision Transformer Based Model for Describing a Set of Images as a Story", arXiv, 2022 (The University of Western Australia). [Paper]

[Back to Overview]

Visual Question Answering

  • MCAN: "Deep Modular Co-Attention Networks for Visual Question Answering", CVPR, 2019 (Hangzhou Dianzi University). [Paper][PyTorch]
  • M4C: "Iterative Answer Prediction with Pointer-Augmented Multimodal Transformers for TextVQA", CVPR, 2020 (Facebook). [Paper]
  • SA-M4C: "Spatially Aware Multimodal Transformers for TextVQA", ECCV, 2020 (Georgia Tech). [Paper][PyTorch][Website]
  • ConClaT: "Contrast and Classify: Training Robust VQA Models", ICCV, 2021 (Georgia Tech). [Paper]
  • TRAR: "TRAR: Routing the Attention Spans in Transformer for Visual Question Answering", ICCV, 2021 (Xiamen University). [Paper]
  • UniQer: "Unified Questioner Transformer for Descriptive Question Generation in Goal-Oriented Visual Dialogue", ICCV, 2021 (Keio). [Paper]
  • TxT: "TxT: Crossmodal End-to-End Learning with Transformers", GCPR, 2021 (TU Darmstadt). [Paper]
  • ProTo: "ProTo: Program-Guided Transformer for Program-Guided Tasks", NeurIPS, 2021 (Georiga Tech). [Paper]
  • VisQA: "VisQA: X-raying Vision and Language Reasoning in Transformers", arXiv, 2021 (INSA-Lyon). [Paper][PyTorch]
  • ?: "Mounting Video Metadata on Transformer-based Language Model for Open-ended Video Question Answering", arXiv, 2021 (Seoul National University). [Paper]
  • TPT: "Temporal Pyramid Transformer with Multimodal Interaction for Video Question Answering", arXiv, 2021 (CAS). [Paper]
  • Block-Skim: "Block-Skim: Efficient Question Answering for Transformer", AAAI, 2022 (* Shanghai Jiao Tong*). [Paper]
  • RelViT: "RelViT: Concept-guided Vision Transformer for Visual Relational Reasoning", ICLR, 2022 (NVIDIA). [Paper] [PyTorch]
  • Hypergraph-Transformer: "Hypergraph Transformer: Weakly-supervised Multi-hop Reasoning for Knowledge-based Visual Question Answering", ACL, 2022 (SNU). [Paper][Code (in construction)]
  • X-Trans2Cap: "X-Trans2Cap: Cross-Modal Knowledge Transfer using Transformer for 3D Dense Captioning", CVPR, 2022 (CUHK). [Paper]
  • SwinBERT: "SwinBERT: End-to-End Transformers with Sparse Attention for Video Captioning", CVPR, 2022 (Microsoft). [Paper][PyTorch]
  • UTC: "UTC: A Unified Transformer with Inter-Task Contrastive Learning for Visual Dialog", CVPR, 2022 (Fudan). [Paper]
  • LaTr: "LaTr: Layout-Aware Transformer for Scene-Text VQA", CVPR, 2022 (Amazon). [Paper]
  • QAA: "Query and Attention Augmentation for Knowledge-Based Explainable Reasoning", CVPR, 2022 (University of Minnesota). [Paper][PyTorch]
  • WebQA: "WebQA: Multihop and Multimodal QA", CVPR, 2022 (CMU + Microsoft). [Paper][PyTorch][Website]
  • ?: "Efficient Adaptive Image-Language Learning for Visual Question Answering", CVPRW, 2022 (Google). [Paper]
  • cViL: "cViL: Cross-Lingual Training of Vision-Language Models using Knowledge Distillation", ICPR, 2022 (IIIT, Hyderabad). [Paper]
  • WildQA: "WildQA: In-the-Wild Video Question Answering", International Conference on Computational Linguistics (COLING), 2022 (University of Michigan). [Paper][Website]
  • Distinguishing-VQA: "Overcoming Language Priors in Visual Question Answering via Distinguishing Superficially Similar Instances", COLING, 2022 (Nankai University). [Paper][Code (in construction)]
  • ?: "Weakly Supervised Grounding for VQA in Vision-Language Transformers", ECCV, 2022 (UCF). [Paper][PyTorch (in construction)]
  • VGT: "Video Graph Transformer for Video Question Answering", ECCV, 2022 (Sea AI Lab). [Paper][PyTorch]
  • ?: "Video Question Answering with Iterative Video-Text Co-Tokenization", ECCV, 2022 (Google). [Paper][Website (in construction)]
  • MUST-VQA: "MUST-VQA: MUltilingual Scene-text VQA", ECCVW, 2022 (UAB, Spain). [Paper]
  • DeST: "Learning Fine-Grained Visual Understanding for Video Question Answering via Decoupling Spatial-Temporal Modeling", BMVC, 2022 (NTU). [Paper][PyTorch]
  • MuRAG: "MuRAG: Multimodal Retrieval-Augmented Generator for Open Question Answering over Images and Text", EMNLP, 2022 (Google). [Paper]
  • MMBS: "Towards Robust Visual Question Answering: Making the Most of Biased Samples via Contrastive Learning", EMNLP, 2022 (CAS). [Paper][PyTorch]
  • EnFoRe: "Entity-Focused Dense Passage Retrieval for Outside-Knowledge Visual Question Answering", EMNLP, 2022 (UT Austin). [Paper]
  • PnP-VQA: "Plug-and-Play VQA: Zero-shot VQA by Conjoining Large Pretrained Models with Zero Training", EMNLP Findings, 2022 (Salesforce). [Paper]
  • TMN: "Transformer Module Networks for Systematic Generalization in Visual Question Answering", arXiv, 2022 (Fujitsu). [Paper]
  • ?: "On the Efficacy of Co-Attention Transformer Layers in Visual Question Answering", arXiv, 2022 (Birla Institute of Technology Mesra, India). [Paper]
  • DST: "Towards Efficient and Elastic Visual Question Answering with Doubly Slimmable Transformer", arXiv, 2022 (Hangzhou Dianzi University). [Paper]
  • PAVCR: "Attention Mechanism based Cognition-level Scene Understanding", arXiv, 2022 (Leibniz University of Hannover, Germany). [Paper]
  • REVIVE: "REVIVE: Regional Visual Representation Matters in Knowledge-Based Visual Question Answering", arXiv, 2022 (Microsoft). [Paper]
  • TAG: "TAG: Boosting Text-VQA via Text-aware Visual Question-answer Generation", arXiv, 2022 (Maryland + Salesforce). [Paper][PyTorch]
  • UniCon: "UniCon: Unidirectional Split Learning with Contrastive Loss for Visual Question Answering", arXiv, 2022 (University of Tokyo). [Paper]
  • CLOVE: "Symbolic Replay: Scene Graph as Prompt for Continual Learning on VQA Task", arXiv, 2022 (NUS). [Paper][Code (in construction)]
  • WSQG: "Frame-Subtitle Self-Supervision for Multi-Modal Video Question Answering", arXiv, 2022 (Zhejiang University). [Paper]
  • mVQA: "Towards Multi-Lingual Visual Question Answering", arXiv, 2022 (Google). [Paper]
  • CIB: "Finetuning Pretrained Vision-Language Models with Correlation Information Bottleneck for Robust Visual Question Answering", arXiv, 2022 (Xi'an Jiaotong University). [Paper]
  • LocAns: "Locate before Answering: Answer Guided Question Localization for Video Question Answering", arXiv, 2022 (Fudan University). [Paper]

[Back to Overview]

Visual Grounding

  • General:
    • TransRefer3D: "TransRefer3D: Entity-and-Relation Aware Transformer for Fine-Grained 3D Visual Grounding", ACMMM, 2021 (Beihang University). [Paper]
    • ?: "Vision-and-Language or Vision-for-Language? On Cross-Modal Influence in Multimodal Transformers", EMNLP, 2021 (University of Trento). [Paper]
    • MITVG: "Multimodal Incremental Transformer with Visual Grounding for Visual Dialogue Generation", ACL Findings, 2021 (Tencent). [Paper]
    • TransVG: "TransVG: End-to-End Visual Grounding with Transformers", ICCV, 2021 (USTC). [Paper]
    • GSRTR: "Grounded Situation Recognition with Transformers", BMVC, 2021 (POSTECH). [Paper][PyTorch]
    • Referring-Transformer: "Referring Transformer: A One-step Approach to Multi-task Visual Grounding", NeurIPS, 2021 (UBC). [Paper]
    • VGTR: "Visual Grounding with Transformers", arXiv, 2021 (Beihang University). [Paper]
    • UNICORN: "Crossing the Format Boundary of Text and Boxes: Towards Unified Vision-Language Modeling", arXiv, 2021 (Microsoft). [Paper]
    • Word2Pix: "Word2Pix: Word to Pixel Cross Attention Transformer in Visual Grounding", arXiv, 2021 (A*STAR). [Paper]
    • MVT: "Multi-View Transformer for 3D Visual Grounding", CVPR, 2022 (CUHK). [Paper][PyTorch]
    • GLIP: "Grounded Language-Image Pre-training", CVPR, 2022 (Microsoft). [Paper][PyTorch]
    • M-DGT: "Multi-Modal Dynamic Graph Transformer for Visual Grounding", CVPR, 2022 (University of Toronto). [Paper][PyTorch]
    • QRNet: "Shifting More Attention to Visual Backbone: Query-modulated Refinement Networks for End-to-End Visual Grounding", CVPR, 2022 (East China Normal University). [Paper][PyTorch]
    • SiRi: "SiRi: A Simple Selective Retraining Mechanism for Transformer-based Visual Grounding", ECCV, 2022 (JD). [Paper][PyTorch]
    • UniTAB: "UniTAB: Unifying Text and Box Outputs for Grounded Vision-Language Modeling", ECCV, 2022 (Microsoft). [Paper]
    • TAP: "Improving Closed and Open-Vocabulary Attribute Prediction Using Transformers", ECCV, 2022 (Adobe). [Paper][GitHub][Website]
    • ?: "Do Vision-and-Language Transformers Learn Grounded Predicate-Noun Dependencies?", EMNLP, 2022 (Aix-Marseille University, France). [Paper]
    • SeqTR: "SeqTR: A Simple yet Universal Network for Visual Grounding", arXiv, 2022 (Xiamen University). [Paper][Code (in construction)]
    • BEST: "Visual Clues: Bridging Vision and Language Foundations for Image Paragraph Captioning", arXiv, 2022 (Microsoft). [Paper]
    • GLIPv2: "GLIPv2: Unifying Localization and Vision-Language Understanding", arXiv, 2022 (Microsoft). [Paper][PyTorch]
    • TransVG++: "TransVG++: End-to-End Visual Grounding with Language Conditioned Vision Transformer", arXiv, 2022 (USTC). [Paper]
    • HLGT: "Hierarchical Local-Global Transformer for Temporal Sentence Grounding", arXiv, 2022 (Huazhong University of Science and Technology). [Paper]
    • Dynamic-MDETR: "Dynamic MDETR: A Dynamic Multimodal Transformer Decoder for Visual Grounding", arXiv, 2022 (Nanjing University). [Paper]
  • Video:
    • Multi-Stage-Transformer: "Multi-Stage Aggregated Transformer Network for Temporal Language Localization in Videos", CVPR, 2021 (University of Electronic Science and Technology of China). [Paper]
    • GTR: "On Pursuit of Designing Multi-modal Transformer for Video Grounding", EMNLP, 2021 (Peking). [Paper]
    • STVGBert: "STVGBert: A Visual-Linguistic Transformer Based Framework for Spatio-Temporal Video Grounding", ICCV, 2021 (Tencent). [Paper]
    • DRFT: "End-to-end Multi-modal Video Temporal Grounding", NeurIPS, 2021 (UC Merced). [Paper]
    • TubeDETR: "TubeDETR: Spatio-Temporal Video Grounding with Transformers", CVPR, 2022 (INRIA). [Paper][Website]
    • STVGFormer: "STVGFormer: Spatio-Temporal Video Grounding with Static-Dynamic Cross-Modal Understanding", ACMMMW, 2022 (Sun Yat-sen University). [Paper]
    • VidGTR: "Explore and Match: End-to-End Video Grounding with Transformer", arXiv, 2022 (KAIST). [Paper]
    • ?: "Language-free Training for Zero-shot Video Grounding", WACV, 2023 (Yonsei University). [Paper]

[Back to Overview]

Multi-Modal Representation Learning

  • General:

    • LXMERT: "LXMERT: Learning Cross-Modality Encoder Representations from Transformers", EMNLP, 2019 (UNC). [Paper][PyTorch]
    • ViLBERT: "ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks", NeurIPS, 2019 (Georgia Tech). [Paper][PyTorch]
    • Unified-VLP: "Unified Vision-Language Pre-Training for Image Captioning and VQA", AAAI, 2020 (UMich + Microsoft). [Paper][PyTorch]
    • UNITER: "UNITER: UNiversal Image-TExt Representation Learning", ECCV, 2020 (Microsoft). [Paper][PyTorch]
    • VinVL: "VinVL: Revisiting Visual Representations in Vision-Language Models", CVPR, 2021 (Microsoft). [Paper][Code]
    • CATT: "Causal Attention for Vision-Language Tasks", CVPR, 2021 (NTU Singapore). [Paper][PyTorch]
    • CLIP: "Learning Transferable Visual Models From Natural Language Supervision", ICML, 2021 (OpenAI). [Paper][PyTorch]
    • ViLT: "ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision", ICML, 2021 (Kakao). [Paper][PyTorch]
    • SVO-Probes: "Probing Image-Language Transformers for Verb Understanding", arXiv, 2021 (DeepMind). [Paper]
    • CLIP-ViL: "How Much Can CLIP Benefit Vision-and-Language Tasks?", arXiv, 2021 (Berkeley + UCLA). [Paper][PyTorch]
    • Florence: "Florence: A New Foundation Model for Computer Vision", arXiv, 2021 (Microsoft). [Paper]
    • UFO: "UFO: A UniFied TransfOrmer for Vision-Language Representation Learning", arXiv, 2021 (Microsoft). [Paper]
    • LiT: "LiT: Zero-Shot Transfer with Locked-image text Tuning", CVPR, 2022 (Google). [Paper]
    • UniCL: "Unified Contrastive Learning in Image-Text-Label Space", CVPR, 2022 (Microsoft). [Paper][PyTorch]
    • FLAVA: "FLAVA: A Foundational Language And Vision Alignment Model", CVPR, 2022 (Meta). [Paper][Pretrained Model][Code][Dataset][Website][Demos]
    • LEMON: "Scaling Up Vision-Language Pre-training for Image Captioning", CVPR, 2022 (Microsoft). [Paper]
    • METER: "An Empirical Study of Training End-to-End Vision-and-Language Transformers", CVPR, 2022 (Microsoft). [Paper][PyTorch]
    • CM-mix: "Pre-training image-language transformers for open-vocabulary tasks", CVPRW, 2022 (Google). [Paper]
    • VLMixer: "VLMixer: Unpaired Vision-Language Pre-training via Cross-Modal CutMix", ICML, 2022 (Southern University of Science and Technology). [Paper][Code (in construction)]
    • VLUE: "VLUE: A Multi-Task Benchmark for Evaluating Vision-Language Models", ICML, 2022 (ByteDance). [Paper][Website][PyTorch]
    • X-VLM: "Multi-Grained Vision Language Pre-Training: Aligning Texts with Visual Concepts", ICML, 2022 (ByteDance). [Paper][PyTorch]
    • BLIP: "BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation", ICML, 2022 (Salesforce). [Paper][PyTorch]
    • MS-CLIP: "Learning Visual Representation from Modality-Shared Contrastive Language-Image Pre-training", ECCV, 2022 (Microsoft). [Paper][PyTorch]
    • GRIT-VLP: "GRIT-VLP: Grouped Mini-batch Sampling for Efficient Vision and Language Pre-training", ECCV, 2022 (Microsoft). [Paper][PyTorch]
    • SIMLA: "Single-Stream Multi-Level Alignment for Vision-Language Pretraining", ECCV, 2022 (Northeastern University). [Paper][PyTorch][Website]
    • Switch-BERT: "Switch-BERT: Learning to Model Multimodal Interactions by Switching Attention and Input", ECCV, 2022 (Ant Group). [Paper]
    • OmniVL: "OmniVL: One Foundation Model for Image-Language and Video-Language Tasks", NeurIPS, 2022 (Microsoft). [Paper]
    • UniCLIP: "UniCLIP: Unified Framework for Contrastive Language-Image Pre-training", NeurIPS, 2022 (LG). [Paper]
    • TVLT: "TVLT: Textless Vision-Language Transformer", NeurIPS, 2022 (UNC). [Paper][PyTorch]
    • VLMo: "VLMo: Unified Vision-Language Pre-Training with Mixture-of-Modality-Experts", arXiv, 2022 (Microsoft). [Paper][PyTorch (in construction)]
    • Omnivore: "Omnivore: A Single Model for Many Visual Modalities", arXiv, 2022 (Meta). [Paper][PyTorch]
    • MultiMAE: "MultiMAE: Multi-modal Multi-task Masked Autoencoders", arXiv, 2022 (EPFL). [Paper][PyTorch][Website]
    • Flamingo: "Flamingo: a Visual Language Model for Few-Shot Learning", arXiv, 2022 (DeepMind). [Paper]
    • PyramidCLIP: "PyramidCLIP: Hierarchical Feature Alignment for Vision-language Model Pretraining", arXiv, 2022 (Tencent). [Paper]
    • CoCa: "CoCa: Contrastive Captioners are Image-Text Foundation Models", arXiv, 2022 (Google). [Paper]
    • VLC: "Training Vision-Language Transformers from Captions Alone", arXiv, 2022 (Microsoft). [Paper][Code (in construction)]
    • UViM: "UViM: A Unified Modeling Approach for Vision with Learned Guiding Codes", arXiv, 2022 (Google). [Paper]
    • GIT: "GIT: A Generative Image-to-text Transformer for Vision and Language", arXiv, 2022 (Microsoft). [Paper]
    • CyCLIP: "CyCLIP: Cyclic Contrastive Language-Image Pretraining", arXiv, 2022 (UCLA). [Paper]
    • CCLM: "Cross-View Language Modeling: Towards Unified Cross-Lingual Cross-Modal Pre-training", arXiv, 2022 (ByteDance). [Paper]
    • VL-BEiT: "VL-BEiT: Generative Vision-Language Pretraining", arXiv, 2022 (Microsoft). [Paper]
    • Uni-Perceiver-MoE: "Uni-Perceiver-MoE: Learning Sparse Generalist Models with Conditional MoEs", arXiv, 2022 (SenseTime). [Paper]
    • MetaLM: "Language Models are General-Purpose Interfaces", arXiv, 2022 (Microsoft). [Paper][PyTorch]
    • DaVinci: "Prefix Language Models are Unified Modal Learners", arXiv, 2022 (ByteDance). [Paper]
    • FIBER: "Coarse-to-Fine Vision-Language Pre-training with Fusion in the Backbone", arXiv, 2022 (Microsoft). [Paper][PyTorch]
    • Bridge-Tower: "Bridge-Tower: Building Bridges Between Encoders in Vision-Language Representation Learning", arXiv, 2022 (Microsoft). [Paper][Code (in construction)]
    • e-CLIP: "e-CLIP: Large-Scale Vision-Language Representation Learning in E-commerce", arXiv, 2022 (NAVER). [Paper]
    • LW-Transformer: "Towards Lightweight Transformer via Group-wise Transformation for Vision-and-Language Tasks", arXiv, 2022 (Xiamen University). [Paper][PyTorch]
    • UCM: "Self-Training Vision Language BERTs with a Unified Conditional Model", arXiv, 2022 (NTU, Singapore). [Paper]
    • MaskVLM: "Masked Vision and Language Modeling for Multi-modal Representation Learning", arXiv, 2022 (Amazon). [Paper]
    • LOUPE: "Fine-Grained Semantically Aligned Vision-Language Pre-Training", arXiv, 2022 (Huawei). [Paper]
    • Prefix-conditioning: "Prefix Conditioning Unifies Language and Label Supervision", arXiv, 2022 (Google). [Paper]
    • VLMAE: "VLMAE: Vision-Language Masked Autoencoder", arXiv, 2022 (Tencent). [Paper]
    • BEiT-3: "Image as a Foreign Language: BEiT Pretraining for All Vision and Vision-Language Tasks", arXiv, 2022 (Microsoft). [Paper][PyTorch]
    • ViCHA: "Efficient Vision-Language Pretraining with Visual Concepts and Hierarchical Alignment", arXiv, 2022 (Sorbonne University, France). [Paper][Code (in construction)]
    • DetailCLIP: "Injecting Image Details into CLIP's Feature Space", arXiv, 2022 (Megvii). [Paper]
    • ?: "Pre-training image-language transformers for open-vocabulary tasks", arXiv, 2022 (Google). [Paper]
    • PaLI: "PaLI: A Jointly-Scaled Multilingual Language-Image Model", arXiv, 2022 (Google). [Paper]
    • ERNIE: "ERNIE-ViL 2.0: Multi-view Contrastive Learning for Image-Text Pre-training", arXiv, 2022 (Baidu). [Paper][Paddle]
    • Pix2Struct: "Pix2Struct: Screenshot Parsing as Pretraining for Visual Language Understanding", arXiv, 2022 (Google). [Paper]
    • VoLTA: "VoLTA: Vision-Language Transformer with Weakly-Supervised Local-Feature Alignment", arXiv, 2022 (JHU). [Paper]
    • MAP: "MAP: Modality-Agnostic Uncertainty-Aware Vision-Language Pre-training Model", arXiv, 2022 (Tsinghua + Waseda). [Paper][PyTorch]
    • ?: "One does not fit all! On the Complementarity of Vision Encoders for Vision and Language Tasks", arXiv, 2022 (Technical University of Darmstadt, Germany). [Paper]
    • MAPL: "MAPL: Parameter-Efficient Adaptation of Unimodal Pre-Trained Models for Vision-Language Few-Shot Prompting", arXiv, 2022 (Mila). [Paper]
    • EfficientVLM: "EfficientVLM: Fast and Accurate Vision-Language Models via Knowledge Distillation and Modal-adaptive Pruning", arXiv, 2022 (Bytedance). [Paper][PyTorch (in construction)]
    • xCLIP: "Non-Contrastive Learning Meets Language-Image Pre-Training", arXiv, 2022 (Microsoft). [Paper]
  • Video:

    • COOT: "COOT: Cooperative Hierarchical Transformer for Video-Text Representation Learning", NeurIPS, 2020 (University of Freiburg). [Paper][PyTorch]
    • Parameter-Reduction: "Parameter Efficient Multimodal Transformers for Video Representation Learning", ICLR, 2021 (Seoul National University). [Paper]
    • VML: "VLM: Task-agnostic Video-Language Model Pre-training for Video Understanding", ACL Findings, 2021 (Facebook). [Paper]
    • VATT: "VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text", NeurIPS, 2021 (Google). [Paper][Tensorflow]
    • TAN: "Temporal Alignment Networks for Long-term Video", CVPR, 2022 (Oxford). [Paper][Code (in construction)][Website]
    • HD-VILA: "Advancing High-Resolution Video-Language Representation with Large-Scale Video Transcriptions", CVPR, 2022 (Microsoft). [Paper][GitHub]
    • ATP: "Revisiting the "Video" in Video-Language Understanding", CVPR, 2022 (Stanford). [Paper][Website]
    • ALPRO: "Align and Prompt: Video-and-Language Pre-training with Entity Prompts", CVPR, 2022 (Salesforce). [Paper][PyTorch]
    • ?: "Learning Audio-Video Modalities from Image Captions", ECCV, 2022 (Google). [Paper]
    • MUGEN: "MUGEN: A Playground for Video-Audio-Text Multimodal Understanding and GENeration", ECCV, 2022 (Meta). [Paper][Website]
    • LiteVL: "LiteVL: Efficient Video-Language Learning with Enhanced Spatial-Temporal Modeling", EMNLP, 2022 (Peking University). [Paper]
    • EgoVLP: "Egocentric Video-Language Pretraining", arXiv, 2022 (NUS). [Paper][Code (in construction)]
    • Singularity: "Revealing Single Frame Bias for Video-and-Language Learning", arXiv, 2022 (UNC). [Paper]
    • LAVENDER: "LAVENDER: Unifying Video-Language Understanding as Masked Language Modeling", arXiv, 2022 (Microsoft). [Paper][Code (in construction)]
    • Clover: "Clover: Towards A Unified Video-Language Alignment and Fusion Model", arXiv, 2022 (ByteDance). [Paper][PyTorch (in construction)]
    • ?: "An Empirical Study of End-to-End Video-Language Transformers with Masked Visual Modeling", arXiv, 2022 (Microsoft). [Paper]
    • CLIP-ViP: "CLIP-ViP: Adapting Pre-trained Image-Text Model to Video-Language Representation Alignment", arXiv, 2022 (Microsoft). [Paper][Code (in construction)]

[Back to Overview]

Multi-Modal Retrieval

  • General:
    • Fast-and-Slow: "Thinking Fast and Slow: Efficient Text-to-Visual Retrieval with Transformers", CVPR, 2021 (DeepMind). [Paper]
    • HTR: "Revamping Cross-Modal Recipe Retrieval with Hierarchical Transformers and Self-supervised Learning", CVPR, 2021 (Amazon). [Paper][PyTorch]
    • TERN: "Towards Efficient Cross-Modal Visual Textual Retrieval using Transformer-Encoder Deep Features", CBMI, 2021 (National Research Council, Italy). [Paper]
    • VisualSparta: "VisualSparta: Sparse Transformer Fragment-level Matching for Large-scale Text-to-Image Search", arXiv, 2021 (CMU). [Paper]
    • CCR-CCS: "More Than Just Attention: Learning Cross-Modal Attentions with Contrastive Constraints", arXiv, 2021 (Rutgers + Amazon). [Paper]
    • MCProp: "Transformer-Based Multi-modal Proposal and Re-Rank for Wikipedia Image-Caption Matching", ICLRW, 2022 (National Research Council, Italy). [Paper][PyTorch]
    • TASK-former: "A Sketch Is Worth a Thousand Words: Image Retrieval with Text and Sketch", ECCV, 2022 (Georgia Tech). [Paper][Website]
    • CODER: "CODER: Coupled Diversity-Sensitive Momentum Contrastive Learning for Image-Text Retrieval", ECCV, 2022 (Baidu). [Paper]
    • ?: "Most and Least Retrievable Images in Visual-Language Query Systems", ECCV, 2022 (Old Dominion University, Virginia). [Paper]
    • SpeechCLIP: "SpeechCLIP: Integrating Speech with Pre-Trained Vision and Language Model", IEEE Workshop on Spoken Language Technology (SLT), 2022 (NTU). [Paper]
    • LoopITR: "LoopITR: Combining Dual and Cross Encoder Architectures for Image-Text Retrieval", arXiv, 2022 (UNC). [Paper]
    • TNLBT: "Transformer-based Cross-Modal Recipe Embeddings with Large Batch Training", arXiv, 2022 (The University of Electro-Communications, Japan). [Paper]
    • HiVLP: "HiVLP: Hierarchical Vision-Language Pre-Training for Fast Image-Text Retrieval", arXiv, 2022 (Huawei). [Paper]
    • ?: "Revising Image-Text Retrieval via Multi-Modal Entailment". arXiv, 2022 (Soochow University, China). [Paper]
    • TokenFlow: "TokenFlow: Rethinking Fine-grained Cross-modal Alignment in Vision-Language Retrieval", arXiv, 2022 (Kuaishou). [Paper]
  • Video:
    • MMT: "Multi-modal Transformer for Video Retrieval", ECCV, 2020 (INRIA + Google). [Paper][Website]
    • ClipBERT: "Less is More: ClipBERT for Video-and-Language Learning via Sparse Sampling", CVPR, 2021 (UNC + Microsoft). [Paper][PyTorch]
    • AYCE: "All You Can Embed: Natural Language based Vehicle Retrieval with Spatio-Temporal Transformers", CVPRW, 2021 (University of Modena and Reggio Emilia). [Paper][PyTorch]
    • HiT: "HiT: Hierarchical Transformer with Momentum Contrast for Video-Text Retrieval", ICCV, 2021 (Kuaishou). [Paper]
    • WebVid-2M: "Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval", ICCV, 2021 (Oxford). [Paper]
    • UMT: "UMT: Unified Multi-modal Transformers for Joint Video Moment Retrieval and Highlight Detection", CVPR, 2022 (Tencent). [Paper][Code (in constrcution)]
    • MMFT: "Everything at Once - Multi-modal Fusion Transformer for Video Retrieval", CVPR, 2022 (Goethe University Frankfurt, Germany). [Paper]
    • X-Pool: "X-Pool: Cross-Modal Language-Video Attention for Text-Video Retrieval", CVPR, 2022 (Layer 6 AI, Toronto). [Paper][PyTorch][Website]
    • MVPt: "It's Time for Artistic Correspondence in Music and Video", CVPR, 2022 (Adobe). [Paper][Website]
    • CenterCLIP: "CenterCLIP: Token Clustering for Efficient Text-Video Retrieval", SIGIR, 2022 (Zhejiang University). [Paper]
    • X-CLIP: "X-CLIP: End-to-End Multi-grained Contrastive Learning for Video-Text Retrieval", ACMMM, 2022 (Alibaba). [Paper]
    • HiSE: "Boosting Video-Text Retrieval with Explicit High-Level Semantics", ACMMM, 2022 (Baidu). [Paper]
    • TS2-Net: "TS2-Net: Token Shift and Selection Transformer for Text-Video Retrieval", ECCV, 2022 (Tencent). [Paper][PyTorch]
    • LAFF: "Lightweight Attentional Feature Fusion: A New Baseline for Text-to-Video Retrieval", ECCV, 2022 (Renmin University of China). [Paper]
    • ECLIPSE: "ECLIPSE: Efficient Long-range Video Retrieval using Sight and Sound", ECCV, 2022 (UNC). [Paper][PyTorch][Website]
    • MILES: "MILES: Visual BERT Pre-training with Injected Language Semantics for Video-text Retrieval", ECCV, 2022 (HKU). [Paper]
    • VTC: "VTC: Improving Video-Text Retrieval with User Comments", ECCV, 2022 (Unitary, UK). [Paper][PyTorch][Website]
    • LINAS: "Learning Linguistic Association towards Efficient Text-Video Retrieval", ECCV, 2022 (CAS). [Paper][PyTorch]
    • ?: "Text-Adaptive Multiple Visual Prototype Matching for Video-Text Retrieval", NeurIPS, 2022 (Sun Yat-sen University). [Paper]
    • ConTra: "ConTra: (Con)text (Tra)nsformer for Cross-Modal Video Retrieval", ACCV, 2022 (University of Bristol, UK). [Paper]
    • RaP: "RaP: Redundancy-aware Video-language Pre-training for Text-Video Retrieval", EMNLP, 2022 (CAS). [Paper][PyTorch]
    • BridgeFormer: "BridgeFormer: Bridging Video-text Retrieval with Multiple Choice Questions", arXiv, 2022 (HKU). [Paper][Website]
    • MDMMT-2: "MDMMT-2: Multidomain Multimodal Transformer for Video Retrieval, One More Step Towards Generalization", arXiv, 2022 (Huawei). [Paper]
    • M2HF: "M2HF: Multi-level Multi-modal Hybrid Fusion for Text-Video Retrieval", arXiv, 2022 (Tencent). [Paper]
    • FIRE: "Fighting FIRe with FIRE: Assessing the Validity of Text-to-Video Retrieval Benchmarks", arXiv, 2022 (Meta). [Paper][PyTorch]

[Back to Overview]

Multi-Modal Generation

  • General:
    • AttnGAN: "AttnGAN: Fine-Grained Text to Image Generation with Attentional Generative Adversarial Networks", CVPR, 2018 (Microsoft). [Paper][PyTorch]
    • ControlGAN: "Controllable Text-to-Image Generation", NeurIPS, 2019 (Oxford). [Paper][PyTorch]
    • DALL-E: "Zero-Shot Text-to-Image Generation", ICML, 2021 (OpenAI). [Paper][PyTorch][PyTorch (lucidrains)]
    • CogView: "CogView: Mastering Text-to-Image Generation via Transformers", NeurIPS, 2021 (Tsinghua). [Paper][PyTorch][Website]
    • Layout-VQGAN: "Text-to-Image Synthesis Based on Object-Guided Joint-Decoding Transformer", CVPR, 2022 (CAS). [Paper]
    • Lafite: "Towards Language-Free Training for Text-to-Image Generation", CVPR, 2022 (Microsoft). [Paper][PyTorch]
    • AvatarCLIP: "AvatarCLIP: Zero-Shot Text-Driven Generation and Animation of 3D Avatars", SIGGRAPH, 2022 (NTU, Singapore). [Paper][PyTorch][Website]
    • StoryDALL-E: "StoryDALL-E: Adapting Pretrained Text-to-Image Transformers for Story Continuation", ECCV, 2022 (UNC). [Paper][PyTorch]
    • Make-A-Scene: "Make-A-Scene: Scene-Based Text-to-Image Generation with Human Priors", ECCV, 2022 (Meta). [Paper][Video]
    • TCTIG: "Trace Controlled Text to Image Generation", ECCV, 2022 (Beihang University). [Paper]
    • DALL-Eval: "DALL-Eval: Probing the Reasoning Skills and Social Biases of Text-to-Image Generative Transformers", arXiv, 2022 (UNC). [Paper][PyTorch]
    • DALL-E-2: "Hierarchical Text-Conditional Image Generation with CLIP Latents", arXiv, 2022 (OpenAI). [Paper][Website]
    • CogView2: "CogView2: Faster and Better Text-to-Image Generation via Hierarchical Transformers", arXiv, 2022 (Tsinghua). [Paper][PyTorch]
    • ?: "A very preliminary analysis of DALL-E 2", arXiv, 2022 (NYU). [Paper]
    • Imagen: "Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding", arXiv, 2022 (Google). [Paper][Website]
    • GLIDE: "GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models", arXiv, 2022 (OpenAI). [Paper][PyTorch]
    • ?: "Discovering the Hidden Vocabulary of DALLE-2", arXiv, 2022 (UT Austin). [Paper]
    • Parti: "Scaling Autoregressive Models for Content-Rich Text-to-Image Generation", arXiv, 2022 (Google). [Paper][GitHub][Website]
    • ?: "Prompt-to-Prompt Image Editing with Cross Attention Control", arXiv, 2022 (Google). [Paper]
    • Textual-Inversion: "An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion", arXiv, 2022 (NVIDIA). [Paper][Website]
    • VLMGAN: "Vision-Language Matching for Text-to-Image Synthesis via Generative Adversarial Networks", arXiv, 2022 (Fudan University). [Paper]
    • PDM: "Progressive Denoising Model for Fine-Grained Text-to-Image Generation", arXiv, 2022 (Meituan). [Paper]
    • FS-VQG: "Few-Shot Visual Question Generation: A Novel Task and Benchmark Datasets", arXiv, 2022 (IIT Kharagpur). [Paper]
    • Swinv2-Imagen: "Swinv2-Imagen: Hierarchical Vision Transformer Diffusion Models for Text-to-Image Generation", arXiv, 2022 (Auckland University of Technology). [Paper]
    • UniTune: "UniTune: Text-Driven Image Editing by Fine Tuning an Image Generation Model on a Single Image", arXiv, 2022 (Google). [Paper]
    • VSD: "Visual Spatial Description: Controlled Spatial-Oriented Image-to-Text Generation", arXiv, 2022 (Tianjin University). [Paper][Code (in construction)]
  • Video:
    • CogVideo: "CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers", arXiv, 2022 (Tsinghua University) [Paper][GitHub (in construction)]
    • Make-A-Video: "Make-A-Video: Text-to-Video Generation without Text-Video Data", arXiv, 2022 (Meta). [Paper]
    • Imagen-Video: "Imagen Video: High Definition Video Generation with Diffusion Models", arXiv, 2022 (Google). [Paper][Website]
    • Phenaki: "Phenaki: Variable Length Video Generation From Open Domain Textual Description", arXiv, 2022 (Google). [Paper][PyTorch (LAION-AI, in construction)][Website]
    • ?: "Towards Real-Time Text2Video via CLIP-Guided, Pixel-Level Optimization", arXiv, 2022 (CMU). [Paper][PyTorch][Website]

[Back to Overview]

Visual Document Understanding

  • LayoutLMv2: "LayoutLMv2: Multi-modal Pre-training for Visually-Rich Document Understanding", ACL, 2021 (Microsoft). [Paper][PyTorch]
  • DocFormer: "DocFormer: End-to-End Transformer for Document Understanding", ICCV, 2021 (Amazon). [Paper]
  • LayoutXLM: "LayoutXLM: Multimodal Pre-training for Multilingual Visually-rich Document Understanding", arXiv, 2021 (Microsoft). [Paper][PyTorch]
  • TableFormer: "TableFormer: Table Structure Understanding with Transformers", CVPR, 2022 (IBM). [Paper]
  • TSRFormer: "TSRFormer: Table Structure Recognition with Transformers", ACMMM, 2022 (Microsoft). [Paper]
  • ERNIE-mmLayout: "ERNIE-mmLayout: Multi-grained MultiModal Transformer for Document Understanding", ACMMM, 2022 (Baidu). [Paper]
  • Donut: "Donut: Document Understanding Transformer without OCR", ECCV, 2022 (NAVER). [Paper][PyTorch]
  • I2DFormer: "I2DFormer: Learning Image to Document Attention for Zero-Shot Image Classification", NeurIPS, 2022 (ETHZ). [Paper]
  • DocEnTr: "DocEnTr: An End-to-End Document Image Enhancement Transformer", arXiv, 2022 (UAB, Spain). [Paper][PyTorch]
  • DocSegTr: "DocSegTr: An Instance-Level End-to-End Document Image Segmentation Transformer", arXiv, 2022 (UAB, Spain). [Paper]
  • DiT: "DiT: Self-supervised Pre-training for Document Image Transformer", arXiv, 2022 (Microsoft). [Paper][Code (in construction)]
  • LayoutLMv3: "LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking", arXiv, 2022 (Microsoft). [Paper][PyTorch]
  • MATrIX: "MATrIX - Modality-Aware Transformer for Information eXtraction", arXiv, 2022 (Amazon). [Paper]
  • VLCDoC: "VLCDoC: Vision-Language Contrastive Pre-Training Model for Cross-Modal Document Classification", arXiv, 2022 (La Rochelle University, France). [Paper]
  • Bi-VLDoc: "Bi-VLDoc: Bidirectional Vision-Language Modeling for Visually-Rich Document Understanding", arXiv, 2022 (Alibaba). [Paper]
  • TRUST: "TRUST: An Accurate and End-to-End Table structure Recognizer Using Splitting-based Transformers", arXiv, 2022 (Baidu). [Paper]
  • OCR-VQGAN: "OCR-VQGAN: Taming Text-within-Image Generation", WACV, 2023 (UAB, Spain). [Paper]

[Back to Overview]

Scene Graph

  • BGT-Net: "BGT-Net: Bidirectional GRU Transformer Network for Scene Graph Generation", CVPRW, 2021 (ETHZ). [Paper]
  • STTran: "Spatial-Temporal Transformer for Dynamic Scene Graph Generation", ICCV, 2021 (Leibniz University Hannover, Germany). [Paper][PyTorch]
  • SGG-NLS: "Learning to Generate Scene Graph from Natural Language Supervision", ICCV, 2021 (University of Wisconsin-Madison). [Paper][PyTorch]
  • SGG-Seq2Seq: "Context-Aware Scene Graph Generation With Seq2Seq Transformers", ICCV, 2021 (Layer 6 AI, Canada). [Paper][PyTorch]
  • RELAX: "Image-Text Alignment using Adaptive Cross-attention with Transformer Encoder for Scene Graphs", BMVC, 2021 (Samsung). [Paper]
  • Relation-Transformer: "Scenes and Surroundings: Scene Graph Generation using Relation Transformer", arXiv, 2021 (LMU Munich). [Paper]
  • SGTR: "SGTR: End-to-end Scene Graph Generation with Transformer", CVPR, 2022 (ShanghaiTech). [Paper][Code (in construction)]
  • GCL: "Stacked Hybrid-Attention and Group Collaborative Learning for Unbiased Scene Graph Generation", CVPR, 2022 (Shandong University). [Paper][PyTorch]
  • Relationformer: "Relationformer: A Unified Framework for Image-to-Graph Generation", ECCV, 2022 (TUM). [Paper][Code (in construction)]
  • SVRP: "Towards Open-vocabulary Scene Graph Generation with Prompt-based Finetuning", ECCV, 2022 (Monash University). [Paper]
  • RelTR: "RelTR: Relation Transformer for Scene Graph Generation", arXiv, 2022 (Leibniz University Hannover, Germany). [Paper][PyTorch]

[Back to Overview]

Other Multi-Modal Tasks

  • Prompt Learning:
    • CoCoOp: "Conditional Prompt Learning for Vision-Language Models", CVPR, 2022 (NTU, Singapore). [Paper][PyTorch]
    • ProDA: "Prompt Distribution Learning", CVPR, 2022 (Huawei). [Paper]
    • VPT: "Visual Prompt Tuning", ECCV, 2022 (Cornell). [Paper][PyTorch]
    • PerVL: ""This is my unicorn, Fluffy": Personalizing frozen vision-language representations", ECCV, 2022 (NVIDIA). [Paper][PyTorch]
    • CoOp: "Learning to Prompt for Vision-Language Models", IJCV, 2022 (NTU, Singapore). [Paper][PyTorch]
    • LASP: "Language-Aware Soft Prompting for Vision & Language Foundation Models", arXiv, 2022 (Samsung). [Paper]
    • PLOT: "Prompt Learning with Optimal Transport for Vision-Language Models", arXiv, 2022 (CMU). [Paper]
    • VPT: "Variational prompt tuning improves generalization of vision-language models", arXiv, 2022 (Samsung). [Paper]
    • MaPLe: "MaPLe: Multi-modal Prompt Learning", arXiv, 2022 (MBZUAI). [Paper][PyTorch]
    • CAVPT: "Class-Aware Visual Prompt Tuning for Vision-Language Pre-Trained Model", arXiv, 2022 (Northwestern Polytechnical University, China). [Paper]
    • Visual-Prompting: "Exploring Visual Prompts for Adapting Large-Scale Models", arXiv, 2022 (MIT). [Paper][PyTorch][Website]
    • PGN: "Prompt Generation Networks for Efficient Adaptation of Frozen Vision Transformers", arXiv, 2022 (University of Amsterdam). [Paper][PyTorch]
    • UPT: "Unified Vision and Language Prompt Learning", arXiv, 2022 (NTU, Singapore). [Paper][Code (in construction)]
    • ?: "Visual Classification via Description from Large Language Models", arXiv, 2022 (Columbia). [Paper]
    • CPL: "CPL: Counterfactual Prompt Learning for Vision and Language Models", arXiv, 2022 (UC Santa Cruz). [Paper]
    • PTP: "Prompting through Prototype: A Prototype-based Prompt Learning on Pretrained Vision-Language Models", arXiv, 2022 (Baidu). [Paper]
  • X-Shot:
    • VidIL: "Language Models with Image Descriptors are Strong Few-Shot Video-Language Learners", NeurIPS, 2022 (UIUC). [Paper][PyTorch]
    • LIMoE: "Multimodal Contrastive Learning with LIMoE: the Language-Image Mixture of Experts", arXiv, 2022 (Google). [Paper]
  • Segmentation:
    • VLT: "Vision-Language Transformer and Query Generation for Referring Segmentation", ICCV, 2021 (NTU, Singapore). [Paper][Tensorflow]
    • LAVT: "LAVT: Language-Aware Vision Transformer for Referring Image Segmentation", CVPR, 2022 (Oxford). [Paper]
    • ReSTR: "ReSTR: Convolution-free Referring Image Segmentation Using Transformers", CVPR, 2022 (POSTECH). [Paper][Website]
  • Tracking:
    • ModaMixer: "Divert More Attention to Vision-Language Tracking", arXiv, 2022 (Beijing Jiaotong University). [Paper][PyTorch]
  • Analysis:
    • MM-Explainability: "Generic Attention-model Explainability for Interpreting Bi-Modal and Encoder-Decoder Transformers", ICCV, 2021 (Tel Aviv). [Paper][PyTorch]
    • ?: "Are Multimodal Transformers Robust to Missing Modality?", CVPR, 2022 (University of Delaware). [Paper]
    • VL-InterpreT: "VL-InterpreT: An Interactive Visualization Tool for Interpreting Vision-Language Transformers", CVPR (demo), 2022 (Intel). [Paper][Website][Video]
    • ?: "Understanding Attention for Vision-and-Language Tasks", International Conference on Computational Linguistics (COLING), 2022 (The University of Sydney). [Paper]
    • VL-CheckList: "VL-CheckList: Evaluating Pre-trained Vision-Language Models with Objects, Attributes and Relations", arXiv, 2022 (Zhejiang University). [Paper][Code (in construction)]
  • Speaker Localization:
    • ?: "The Right to Talk: An Audio-Visual Transformer Approach", ICCV, 2021 (University of Arkansas). [Paper]
  • Multi-task:
    • UniT: "Transformer is All You Need: Multimodal Multitask Learning with a Unified Transformer", ICCV, 2021 (Facebook). [Paper][PyTorch][Website]
    • Pix2Seq: "A Unified Sequence Interface for Vision Tasks", arXiv, 2022 (Google). [Paper]
    • Unified-IO: "Unified-IO: A Unified Model for Vision, Language, and Multi-Modal Tasks", arXiv, 2022 (AI2). [Paper][Website]
    • LAVIS: "LAVIS: A Library for Language-Vision Intelligence", arXiv, 2022 (Salesforce). [Paper][PyTorch]
  • Language-based Video Editing:
    • M3L: "Language-based Video Editing via Multi-Modal Multi-Level Transformer", CVPR, 2022 (UCSB). [Paper]
  • Video Summarization:
    • GPT2MVS: "GPT2MVS: Generative Pre-trained Transformer-2 for Multi-modal Video Summarization", ICMR, 2021 (BBC). [Paper]
    • QVHighlights: "QVHighlights: Detecting Moments and Highlights in Videos via Natural Language Queries", NeurIPS, 2021 (UNC). [Paper][PyTorch]
    • HMT: "Hierarchical Multimodal Transformer to Summarize Videos", arXiv, 2021 (Xidian University). [Paper]
    • ?: "Show Me What I Like: Detecting User-Specific Video Highlights Using Content-Based Multi-Head Attention", ACMMM, 2022 (Adobe). [Paper]
    • IV-Sum: "TL;DW? Summarizing Instructional Videos with Task Relevance & Cross-Modal Saliency", ECCV, 2022 (Google). [Paper][Website]
  • Robotics:
    • CRT: "Case Relation Transformer: A Crossmodal Language Generation Model for Fetching Instructions", IROS, 2021 (Keio University). [Paper]
    • TraSeTR: "TraSeTR: Track-to-Segment Transformer with Contrastive Query for Instance-level Instrument Segmentation in Robotic Surgery", ICRA, 2022 (CUHK). [Paper]
  • Multi-modal Fusion:
    • MICA: "Attention Is Not Enough: Mitigating the Distribution Discrepancy in Asynchronous Multimodal Sequence Fusion", ICCV, 2021 (Southwest Jiaotong University). [Paper]
    • IFT: "Image Fusion Transformer", arXiv, 2021 (Johns Hopkins). [Paper][PyTorch]
    • PPT: "PPT Fusion: Pyramid Patch Transformerfor a Case Study in Image Fusion", arXiv, 2021 (?). [Paper]
    • TransFuse: "TransFuse: A Unified Transformer-based Image Fusion Framework using Self-supervised Learning", arXiv, 2022 (Fudan University). [Paper]
    • SwinFuse: "SwinFuse: A Residual Swin Transformer Fusion Network for Infrared and Visible Images", arXiv, 2022 (Taiyuan University of Science and Technology). [Paper]
    • ?: "Array Camera Image Fusion using Physics-Aware Transformers", arXiv, 2022 (University of Arizona). [Paper]
  • Human Interaction:
    • Dyadformer: "Dyadformer: A Multi-modal Transformer for Long-Range Modeling of Dyadic Interactions", ICCVW, 2021 (Universitat de Barcelona). [Paper]
  • Sign Language Translation:
    • LWTA: "Stochastic Transformer Networks with Linear Competing Units: Application to end-to-end SL Translation", ICCV, 2021 (Cyprus University of Technology). [Paper]
  • 3D:
    • 3DRefTransformer: "3DRefTransformer: Fine-Grained Object Identification in Real-World Scenes Using Natural Language", WACV, 2022 (KAUST). [Paper][Website]
    • EDA: "EDA: Explicit Text-Decoupling and Dense Alignment for 3D Visual and Language Learning", arXiv, 2022 (Peking University). [Paper]
  • Speech Recognition:
    • AV-HuBERT: "Robust Self-Supervised Audio-Visual Speech Recognition", arXiv, 2022 (Meta). [Paper][PyTorch]
    • ?: "Transformer-Based Video Front-Ends for Audio-Visual Speech Recognition", arXiv, 2022 (Google). [Paper]
  • Emotion Recognition:
    • ?: "A Pre-trained Audio-Visual Transformer for Emotion Recognition", ICASSP, 2022 (USC). [Paper]
    • MDAN: "MDAN: Multi-level Dependent Attention Network for Visual Emotion Analysis", CVPR, 2022 (Tencent). [Paper]
  • Voice Separation:
    • VoViT: "VoViT: Low Latency Graph-based Audio-Visual Voice Separation Transformer", ECCV, 2022 (Universitat Pompeu Fabra, Spain). [Paper][PyTorch][Website]
  • Language-guided Video Segmentation:
    • Locater: "Local-Global Context Aware Transformer for Language-Guided Video Segmentation", arXiv, 2022 (Zhejiang). [Paper][PyTorch]
  • Audio-Visual:
    • AVCA: "Audio-visual Generalised Zero-shot Learning with Cross-modal Attention and Language", CVPR, 2022 (University of Tubingen, Germany). [Paper][PyTorch]
    • TCaF: "Temporal and cross-modal attention for audio-visual zero-shot learning", ECCV, 2022 (University of Tubingen, Germany). [Paper][PyTorch]
    • ?: "Learning Audio-Video Modalities from Image Captions", ECCV, 2022 (Google). [Paper]
    • AVSBench: "Audio-Visual Segmentation", ECCV, 2022 (SenseTime). [Paper][PyTorch][Website]
    • AVA-Memory: "Audio-Visual Mismatch-Aware Video Retrieval via Association and Adjustment", ECCV, 2022 (KAIST). [Paper]
    • AVE-CLIP: "AVE-CLIP: AudioCLIP-based Multi-window Temporal Transformer for Audio Visual Event Localization", WACV, 2023 (UT Austin). [Paper]
  • Sentiment Analysis:
    • CubeMLP: "CubeMLP: A MLP-based Model for Multimodal Sentiment Analysis and Depression Estimation", ACMMM, 2022 (Zhejiang University). [Paper]
    • MCMulT: "Multi-scale Cooperative Multimodal Transformers for Multimodal Sentiment Analysis in Videos", arXiv, 2022 (Tencent). [Paper]
  • Name Entity Recognition:
    • FMIT: "Flat Multi-modal Interaction Transformer for Named Entity Recognition", International Conference on Computational Linguistics (COLING), 2022 (South China University of Technology). [Paper]
  • Localization via Embodied Dialog:
    • LED-Bert: "Transformer-based Localization from Embodydied Dialog with Large-scale Pre-training", arXiv, 2022 (Georgia Tech).[論文]]

[概要に戻る]


引用

このリポジトリが役に立つ場合は、このリストを引用することを検討してください:

@misc{chen2022transformerpaperlist,
    title = {Ultimate awesome paper list: transformer and attention},
    author = {Chen, Min-Hung},
    journal = {GitHub repository},
    url = {https://github.com/cmhungsteve/Awesome-Transformer-Attention},
    year = {2022},
}

参照