Abstract:
Vision Transformers (ViTs) have emerged as a powerful alternative to convolutional neural networks in computer vision, demonstrating superior performance across numerous tasks due to their ability to capture global relationships through the self-attention mechanism. Unlike previously established methods that focus primarily on local features, ViTs process images by analysing relationships between patches, allowing for flexible architectures that excel at modelling global dependencies. Despite their success, the self-attention mechanism in ViTs faces computational challenges when deployed in resource-constrained environments or real-time applications where efficiency is paramount. This limitation has motivated the exploration of different approaches using quantum computing, that
preserve the global modelling capabilities of ViTs while reducing computational overhead within the self-attention mechanism. While some approaches aim to replicate the self-attention mechanism using trainable compound matrices, this thesis proposes a Quantum Vision Transformer (QViT) architecture native to the quantum paradigm, that leverages the Quantum Singular Value Transformation (QSVT) to approximate the self-attention mechanism. The proposed model integrates parameterized quantum circuits (PQCs) to encode patch-wise image embeddings, employs a Linear Combination of Unitaries (LCU) to mix patch representations in an attention-like manner, and applies the QSVT to introduce non-linear expressivity. In addition, a quantum classification circuit extends the data register with trainable class qubits, which serve as a quantum analogue to the classical class token and are measured to obtain the final outputs. To assess the models’ capability in image classification tasks, it is evaluated on Bars-and-Stripes and binary MNIST datasets, where it achieves up to∼99% accuracy. An analysis of the models’ computational complexity shows improved theoretical scaling with input size, as well as lower parameter counts. The results obtained in this thesis serve as proof-of-concept for the proposed QViT model and its application to computer vision tasks.
Author:
Joel Furtak
Advisors:
Jonas Stein, Michael Kölle, Claudia Linnhoff-Popien
Student Thesis | Published November 2025 | Copyright © QAR-Lab
Direct Inquiries to this work to the Advisors