{"id":16469,"date":"2024-10-31T08:43:17","date_gmt":"2024-10-31T12:43:17","guid":{"rendered":"https:\/\/blogs.mathworks.com\/deep-learning\/?p=16469"},"modified":"2024-11-13T07:23:30","modified_gmt":"2024-11-13T12:23:30","slug":"transformer-models-from-hype-to-implementation","status":"publish","type":"post","link":"https:\/\/blogs.mathworks.com\/deep-learning\/2024\/10\/31\/transformer-models-from-hype-to-implementation\/","title":{"rendered":"Transformer Models: From Hype to Implementation"},"content":{"rendered":"<h6><\/h6>\r\nIn the world of deep learning, transformer models have generated a significant amount of buzz. They have dramatically improved performance across many <a href=\"https:\/\/www.mathworks.com\/discovery\/artificial-intelligence.html\">AI<\/a> applications, from natural language processing (NLP) to computer vision, and have set new benchmarks for tasks like translation, summarization, and even image classification. But what lies beyond the hype? Are they simply the latest trend in AI, or do they offer tangible benefits over previous architectures, like LSTM networks?\r\n<h6><\/h6>\r\nIn this post, we will explore the key aspects of transformer models, why you should consider using transformers for your AI projects, and how to use transformer models with MATLAB.\r\n<h6><\/h6>\r\n&nbsp;\r\n<h6><\/h6>\r\n<p style=\"font-size: 22px; color: #c04c0b;\"><strong>The Basics of Transformer Models<\/strong><\/p>\r\nTransformer models are a special class of <a href=\"https:\/\/www.mathworks.com\/discovery\/deep-learning.html\">deep learning<\/a> models, which were introduced in the 2017 paper: <a href=\"https:\/\/arxiv.org\/abs\/1706.03762\" target=\"_blank\" rel=\"noopener\">Attention Is All You Need<\/a>. At their core, transformer models are designed to process sequential data, such as language or time series data, more efficiently than previous models like <a href=\"https:\/\/www.mathworks.com\/discovery\/rnn.html\">recurrent neural networks<\/a> (RNNs) and <a href=\"https:\/\/www.mathworks.com\/discovery\/lstm.html\">long short-term memory<\/a> (LSTM) networks.\r\n<h6><\/h6>\r\n<img decoding=\"async\" loading=\"lazy\" class=\"alignnone wp-image-16475 \" src=\"https:\/\/blogs.mathworks.com\/deep-learning\/files\/2024\/10\/ml_models.png\" alt=\"Transformer models (e.g. BERT model) are a sub-category of deep learning models (e.g. LSTM), which are a sub-category of machine learning models (e.g. linear regression)\" width=\"669\" height=\"431\" \/>\r\n<h6><\/h6>\r\n<strong>Figure:<\/strong> Hierarchy of <a href=\"https:\/\/www.mathworks.com\/discovery\/machine-learning-models.html\">machine learning models<\/a> down to transformer models. Examples are shown for each category.\r\n<h6><\/h6>\r\nThe key innovation behind transformers is the\u00a0<strong>self-attention mechanism<\/strong>, which enables the model to focus on different parts of an input sequence simultaneously, regardless of their position in the sequence. Unlike RNNs, which process data step-by-step, transformers process inputs in parallel. This means the transformer model can capture relationships across the entire input sequence simultaneously, making it significantly faster and more scalable for large datasets.\r\n<h6><\/h6>\r\nTransformers have generated a lot of hype not just for their performance, but also for their flexibility. Transformer models, unlike LSTMs, don\u2019t rely on sequence order when processing inputs. Instead, they use <strong>positional encoding<\/strong>\u00a0to add information about the position of each token, making them better suited for handling tasks that require capturing both local and global relationships within a sequence.\r\n<h6><\/h6>\r\n&nbsp;\r\n<h6><\/h6>\r\n<p style=\"font-size: 22px; color: #c04c0b;\"><strong>Key Components of Transformer Architecture<\/strong><\/p>\r\nTo better understand transformers, let\u2019s take a look at their main building blocks:\r\n<h6><\/h6>\r\n<ol>\r\n \t<li><strong>Positional Encoding<\/strong>:\r\nSince transformers process data in parallel, they need a way to understand the order of tokens in a sequence. Positional encoding injects information about the token's position into the input, allowing the model to maintain an understanding of sequence structure, even though it's processed in parallel.<\/li>\r\n \t<li><strong>Encoder-Decoder Framework<\/strong>:\r\nThe original transformer model is based on an encoder-decoder structure. The encoder takes an input sequence, processes it through multiple layers, and creates an internal representation. The decoder, in turn, uses this representation to generate an output sequence, which could be a translation, classification, or another type of prediction.<\/li>\r\n \t<li><strong>Multi-Head Attention Mechanism<\/strong>:\r\nSelf-attention allows the model to focus on relevant parts of the sequence. Multi-head attention runs several attention operations in parallel, allowing the model to learn different aspects of the sequence at once. Each head can focus on different parts of the input, giving the transformer more flexibility and accuracy.<\/li>\r\n \t<li><strong>Feed-Forward Layers<\/strong>:\r\nAfter the attention layers, each token passes through a fully connected feed-forward neural network. These layers help the model refine its understanding of each token's relationship within the sequence.<\/li>\r\n<\/ol>\r\n<img decoding=\"async\" loading=\"lazy\" class=\"alignnone wp-image-16478 size-full\" src=\"https:\/\/blogs.mathworks.com\/deep-learning\/files\/2024\/10\/transformer_model_original.png\" alt=\"Originally proposed architecture of a transformer models showing inputs, outputs, and processing blocks\" width=\"439\" height=\"646\" \/>\r\n<h6><\/h6>\r\n<strong>Figure:<\/strong> The transformer model architecture as originally presented in <a href=\"https:\/\/arxiv.org\/abs\/1706.03762\" target=\"_blank\" rel=\"noopener\">Vaswani et al, 2017<\/a>\r\n<h6><\/h6>\r\n&nbsp;\r\n<h6><\/h6>\r\n<p style=\"font-size: 18px;\"><strong>Variants of Transformer Architecture<\/strong><\/p>\r\n\r\n<ul>\r\n \t<li><strong>Encoder-Decoder Framework:<\/strong> Since the originally proposed encoder-decoder framework, encoder-only and decoder-only frameworks have been implemented.\r\n<ol>\r\n \t<li>The encoder-decoder framework is mostly used for machine translation tasks and, to some extent, for object detection (e.g.,\u00a0<a href=\"https:\/\/arxiv.org\/abs\/2005.12872\" target=\"_blank\" rel=\"noopener\">Detection Transformer<\/a>) and image segmentation (e.g.,\u00a0<a href=\"https:\/\/arxiv.org\/abs\/2304.02643\" target=\"_blank\" rel=\"noopener\">Segment Anything Model<\/a>).<\/li>\r\n \t<li>The encoder-only framework is used in models like BERT, and its variants, which are mostly used for classification and question-answering tasks as well as embedding models.<\/li>\r\n \t<li>The decoder-only framework is used in models like GPT and LLaMA, which are mostly used for text generation, summarization, and chat.<\/li>\r\n<\/ol>\r\n<\/li>\r\n \t<li><strong>Multi-Head Attention Mechanism:<\/strong>\u00a0There are two variants of the multi-head attention mechanism: self-attention and cross-attention. Self-attention allows the transformer model to focus on relevant parts of <em>the same <\/em>sequence. The self-attention mechanism is present in encoder-decoder, encoder-only, and decoder-only frameworks.\r\n<h6><\/h6>\r\nCross-attention, on the other hand, allows the transformer model to focus on relevant parts of different sequences. One sequence, which is the query, for example English sentences in a translation task, attends to another sequence, which is the value, for example French sentences in the translation task. This mechanism is only found in the encoder-decoder framework.<\/li>\r\n<\/ul>\r\n<h6><\/h6>\r\n&nbsp;\r\n<h6><\/h6>\r\n<p style=\"font-size: 22px; color: #c04c0b;\"><strong>Benefits of Transformer Models<\/strong><\/p>\r\nTransformer models represent a major shift in how sequence data is handled, compared to previous architectures. They can handle long-range dependencies and large datasets. Their key benefits are:\r\n<h6><\/h6>\r\n<ol>\r\n \t<li><strong>Parallel Processing:<\/strong> One of the primary reasons for the adoption of transformer models is their ability to process data in parallel. Unlike LSTMs, which must handle inputs step by step, transformers analyze the entire sequence at once using the self-attention mechanism. This parallelism allows for faster training, particularly with large datasets, and it significantly improves the model\u2019s ability to capture dependencies across distant parts of the sequence.<\/li>\r\n \t<li><strong>Handling Long-Range Dependencies: <\/strong>Traditional sequence models like LSTMs struggle to retain information from earlier parts of long sequences. While they use mechanisms like gates to help manage this memory, the effect diminishes as the sequence grows longer. Transformers, on the other hand, leverage the self-attention mechanism, allowing the model to weigh the importance of each token in a sequence regardless of its position. This makes them particularly effective for tasks requiring an understanding of long-term dependencies, such as document summarization or text generation.<\/li>\r\n \t<li><strong>Scalability: <\/strong>The parallel nature of transformers makes them highly efficient on modern hardware, such as GPUs, which are designed to handle large-scale matrix operations. This enables transformers to scale well with increasing data sizes, a feature that is critical when working on real-world applications involving large amounts of text, audio, or visual data.<\/li>\r\n \t<li><strong>Versatility Across Domains: <\/strong>Although initially designed for NLP tasks, transformers have demonstrated their adaptability across different domains. From vision transformers, which apply the transformer architecture to image data, to applications in time-series forecasting and even biomedical data analysis, the flexible design of transformers has proven successful in various fields.<\/li>\r\n \t<li><strong>Pretrained Models: <\/strong>The availability of pre-trained transformer models, such as BERT (Bidirectional Encoder Representations from Transformers), GPT (Generative Pre-trained\u00a0Transformer), and ViTs (vision transformers), means that you can use these models right away without needing to build and train them from scratch. Pre-trained models allow you to fine-tune on your specific dataset, which saves both time and computational resources.<\/li>\r\n<\/ol>\r\n<h6><\/h6>\r\n<img decoding=\"async\" loading=\"lazy\" class=\"alignnone wp-image-16481 \" src=\"https:\/\/blogs.mathworks.com\/deep-learning\/files\/2024\/10\/bert_model.png\" alt=\"Architecture of BERT transformer model\" width=\"311\" height=\"357\" \/>\r\n<h6><\/h6>\r\n<strong>Figure:<\/strong> BERT model architecture, which only uses the encoder part of the originally proposed transformer.\r\n<h6><\/h6>\r\n&nbsp;\r\n<h6><\/h6>\r\n<p style=\"font-size: 18px;\"><strong>When to Choose LSTMs<\/strong><\/p>\r\nLSTMs still have their place in certain applications and tasks. While transformers are powerful, they come with higher computational costs. LSTMs are often a good choice for tasks involving short sequences, such as time-series forecasting with limited data, where their simpler structure and lower computational requirements can be advantageous. For example, if you are training a model for an <a href=\"https:\/\/blogs.mathworks.com\/deep-learning\/2024\/09\/04\/embedded-ai-integration-with-matlab-and-simulink\/\">Embedded AI<\/a> application, an LSTM is good option. Also, you might choose to design an LSTM instead of a transformer for applications where pretrained models are not available.\r\n<h6><\/h6>\r\n&nbsp;\r\n<h6><\/h6>\r\n<p style=\"font-size: 22px; color: #c04c0b;\"><strong>Applications of Transformer Models<\/strong><\/p>\r\nTransformers have proven their versatility across multiple domains in NLP and beyond.\r\n<h6><\/h6>\r\n<ul>\r\n \t<li><a href=\"https:\/\/www.mathworks.com\/discovery\/natural-language-processing.html\"><strong>Natural Language Processing<\/strong><\/a><strong> (NLP)<\/strong>: From machine translation to text summarization and even chatbots, transformer-based models like BERT and GPT have set new standards in performance. Their ability to process long sequences and capture context has made them the go-to architecture for most NLP tasks. One of the most significant outcomes of transformer models is the development of Large Language Models (LLMs), such as GPT and LLaMA, which are built on the transformer architecture.<\/li>\r\n \t<li><a href=\"https:\/\/www.mathworks.com\/discovery\/computer-vision.html\"><strong>Computer Vision<\/strong><\/a>: With the introduction of ViTs, the transformer architecture has begun to outperform <a href=\"https:\/\/www.mathworks.com\/discovery\/convolutional-neural-network.html\">convolutional neural networks<\/a> (CNNs) in image classification tasks, especially on large-scale datasets.<\/li>\r\n \t<li><a href=\"https:\/\/www.mathworks.com\/help\/deeplearning\/ug\/time-series-forecasting-using-deep-learning.html\"><strong>Time-Series Forecasting<\/strong><\/a>: While LSTMs have traditionally been used for time-series data, transformers are increasingly being applied to these tasks due to their ability to handle longer sequences and capture complex patterns.<\/li>\r\n<\/ul>\r\n<img decoding=\"async\" loading=\"lazy\" class=\"alignnone wp-image-16484 \" src=\"https:\/\/blogs.mathworks.com\/deep-learning\/files\/2024\/10\/transformers_applications.png\" alt=\"Applications of transformer models include NLP, computer vision, and time-series forecasting\" width=\"557\" height=\"356\" \/>\r\n<h6><\/h6>\r\n&nbsp;\r\n<h6><\/h6>\r\n<p style=\"font-size: 18px;\"><strong>Transformer Models and GenAI<\/strong><\/p>\r\nTransformer-based architectures, such as BERT and GPT, have become the foundation for state-of-the-art NLP systems, enabling breakthroughs in the ability to understand and generate human language with unprecedented accuracy. BERT focuses on understanding language through bidirectional training, making it highly effective for tasks such as question answering and <a href=\"https:\/\/www.mathworks.com\/discovery\/sentiment-analysis.html\">sentiment analysis<\/a>. On the other hand, GPT and other LLMs focus on generating text by predicting the next word in a sequence, allowing them to generate coherent, human-like content.\r\n<h6><\/h6>\r\nGenerative AI (GenAI) builds on this momentum, leveraging transformer models to create text, images, and even music. With the ability to fine-tune large-scale models on domain-specific datasets, transformer-driven GenAI applications are becoming increasingly sophisticated in content generation, customer service automation, software development, and many more.\r\n<h6><\/h6>\r\n&nbsp;\r\n<h6><\/h6>\r\n<p style=\"font-size: 22px; color: #c04c0b;\"><strong>Transformer Models with MATLAB<\/strong><\/p>\r\n<p style=\"font-size: 18px;\"><strong>Transformers for NLP<\/strong><\/p>\r\nWith MATLAB and Text Analytics Toolbox, you can load a built-in pretrained <a href=\"https:\/\/www.mathworks.com\/help\/textanalytics\/ref\/bert.html\">BERT model<\/a>. You can fine-tune this BERT model for <a href=\"https:\/\/www.mathworks.com\/help\/textanalytics\/ug\/train-bert-document-classifier.html\">document classification<\/a>, <a href=\"https:\/\/www.mathworks.com\/help\/textanalytics\/ug\/extract-answers-from-documents-using-BERT.html\">extractive question answering<\/a>, and more NLP tasks.\r\n<h6><\/h6>\r\nYou can also detect out-of-distribution (OOD) data which can be an important part of the <a href=\"https:\/\/blogs.mathworks.com\/deep-learning\/2024\/04\/30\/verification-and-validation-for-ai-from-model-implementation-to-requirements-validation\/\">verification and validation of an AI model<\/a>. OOD data detection is the process of identifying inputs to a deep neural network that might yield unreliable predictions. OOD data refers to data that is different from the data used to train the model. For example, data collected in a different way, at a different time, under different conditions, or for a different task than the data on which the model was originally trained. For an example, see <a href=\"https:\/\/www.mathworks.com\/help\/textanalytics\/ug\/out-of-distribution-detection-for-bert-document-classifier.html\">Out-of-Distribution Detection (OOD) for BERT Document Classifier<\/a>.\r\n<h6><\/h6>\r\n<img decoding=\"async\" loading=\"lazy\" class=\"alignnone wp-image-16487 \" src=\"https:\/\/blogs.mathworks.com\/deep-learning\/files\/2024\/10\/ODD_BERT.png\" alt=\"In-distribution and out-of-distribution scores for energy distribution discriminator\" width=\"558\" height=\"399\" \/>\r\n<h6><\/h6>\r\n<strong>Figure:<\/strong> Detection of out-of-distribution (OOD) data for a BERT document classifier\r\n<h6><\/h6>\r\nYou can access popular LLMs, such as gpt-4, llama3, and mixtral, from MATLAB through an API or by installing the models locally. Then, you can use your preferred model to analyze and generate text. The code you need to access and interact with LLMs using MATLAB is in the\u00a0<a href=\"https:\/\/www.mathworks.com\/matlabcentral\/fileexchange\/163796-large-language-models-llms-with-matlab\">Large Language (LLMs) with MATLAB repository<\/a>.\r\n<h6><\/h6>\r\nYou have three options for accessing LLMs. You can connect MATLAB to the OpenAI\u00ae Chat Completions API (which powers ChatGPT\u2122), Ollama\u2122 (for local LLMs), and Azure\u00ae OpenAI services. To learn more about these options, check out these previous blog posts: <a href=\"https:\/\/blogs.mathworks.com\/deep-learning\/2024\/01\/22\/large-language-models-with-matlab\/\">blog post: OpenAI LLMs with MATLAB<\/a> and <a href=\"https:\/\/blogs.mathworks.com\/deep-learning\/2024\/07\/09\/local-llms-with-matlab\/\">blog post: Local LLMs with MATLAB<\/a>.\r\n<h6><\/h6>\r\n&nbsp;\r\n<h6><\/h6>\r\n<img decoding=\"async\" loading=\"lazy\" class=\"alignnone wp-image-16490 size-full\" src=\"https:\/\/blogs.mathworks.com\/deep-learning\/files\/2024\/10\/llms_with_matlab_repo.png\" alt=\"Screenshot of LLMS with MATLAB repository\" width=\"1329\" height=\"931\" \/>\r\n<h6><\/h6>\r\n<strong>Figure:<\/strong> File Exchange Repository: Large Language Models (LLMs) with MATLAB\r\n<h6><\/h6>\r\n&nbsp;\r\n<h6><\/h6>\r\n<p style=\"font-size: 18px;\"><strong>Transformers for Computer Vision<\/strong><\/p>\r\nWith MATLAB and Computer Vision Toolbox, you can load a built-in pretrained vision transformer (ViT), which you can <a href=\"https:\/\/www.mathworks.com\/help\/vision\/ug\/transfer-learning-using-pretrained-vit-network.html\">fine-tune for image classification<\/a> and other computer vision tasks like <a href=\"https:\/\/www.mathworks.com\/discovery\/object-detection.html\">object detection<\/a> and semantic segmentation. ViTs can also be used for image generation. You can also use the <a href=\"https:\/\/www.mathworks.com\/help\/images\/ref\/segmentanythingmodel.html\">Segment Anything Model (SAM)<\/a> for semantic segmentation of objects in an image. For more details, see <a href=\"https:\/\/www.mathworks.com\/help\/images\/getting-started-with-segment-anything-model.html\">Get Started with SAM for Image Segmentation<\/a>.\r\n<h6><\/h6>\r\n<img decoding=\"async\" loading=\"lazy\" class=\"alignnone wp-image-16493 size-full\" src=\"https:\/\/blogs.mathworks.com\/deep-learning\/files\/2024\/10\/TransferLearningUsingPretrainedViTNetworkExample_01.png\" alt=\"Architecture of vision transformer model\" width=\"780\" height=\"441\" \/>\r\n<h6><\/h6>\r\n<strong>Figure:<\/strong> Fine-tuning vision transformer (ViT) model with MATLAB\r\n<h6><\/h6>\r\n&nbsp;\r\n<h6><\/h6>\r\n<p style=\"font-size: 18px;\"><strong>Design Transformer Models<\/strong><\/p>\r\nWith MATLAB and Deep Learning Toolbox, you can design a transform model from scratch by using built-in layers, such as <a href=\"https:\/\/www.mathworks.com\/help\/deeplearning\/ref\/nnet.cnn.layer.attentionlayer.html\">attentionLayer<\/a>, <a href=\"https:\/\/www.mathworks.com\/help\/deeplearning\/ref\/nnet.cnn.layer.selfattentionlayer.html\">selfAttentionLayer<\/a>, and <a href=\"https:\/\/www.mathworks.com\/help\/deeplearning\/ref\/nnet.cnn.layer.positionembeddinglayer.html\">positionEmbeddingLayer<\/a>. In my next blog post, I will show you how to design a transformer model for time-series forecasting. In the meantime, you can check out <a href=\"https:\/\/www.mathworks.com\/matlabcentral\/fileexchange\/161016-transformer-networks-for-time-series-prediction\">this demo<\/a> on using transformers for time-series prediction in quantitative finance.\r\n<h6><\/h6>\r\n&nbsp;\r\n<h6><\/h6>\r\n<p style=\"font-size: 22px; color: #c04c0b;\"><strong>Conclusion<\/strong><\/p>\r\nThe transformer model has evolved from an academic breakthrough into an incredibly useful tool for real-world applications. The ability of transformers to handle long-range dependencies, process sequences in parallel, and scale to massive datasets has made them the go-to architecture for tasks in NLP, computer vision, and beyond.\r\n<h6><\/h6>\r\nThe technology continues to evolve and is accessible for you to use in MATLAB. Take advantage of these transformer capabilities to enhance your own projects. Comment below to discuss your exciting outcomes, enabled by the transformer architecture.\r\n<h6><\/h6>","protected":false},"excerpt":{"rendered":"<div class=\"overview-image\"><img src=\"https:\/\/blogs.mathworks.com\/deep-learning\/files\/2024\/10\/ml_models.png\" class=\"img-responsive attachment-post-thumbnail size-post-thumbnail wp-post-image\" alt=\"\" decoding=\"async\" loading=\"lazy\" \/><\/div><p>\r\nIn the world of deep learning, transformer models have generated a significant amount of buzz. They have dramatically improved performance across many AI applications, from natural language... <a class=\"read-more\" href=\"https:\/\/blogs.mathworks.com\/deep-learning\/2024\/10\/31\/transformer-models-from-hype-to-implementation\/\">read more >><\/a><\/p>","protected":false},"author":194,"featured_media":16475,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":[],"categories":[9],"tags":[],"_links":{"self":[{"href":"https:\/\/blogs.mathworks.com\/deep-learning\/wp-json\/wp\/v2\/posts\/16469"}],"collection":[{"href":"https:\/\/blogs.mathworks.com\/deep-learning\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blogs.mathworks.com\/deep-learning\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blogs.mathworks.com\/deep-learning\/wp-json\/wp\/v2\/users\/194"}],"replies":[{"embeddable":true,"href":"https:\/\/blogs.mathworks.com\/deep-learning\/wp-json\/wp\/v2\/comments?post=16469"}],"version-history":[{"count":40,"href":"https:\/\/blogs.mathworks.com\/deep-learning\/wp-json\/wp\/v2\/posts\/16469\/revisions"}],"predecessor-version":[{"id":16653,"href":"https:\/\/blogs.mathworks.com\/deep-learning\/wp-json\/wp\/v2\/posts\/16469\/revisions\/16653"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/blogs.mathworks.com\/deep-learning\/wp-json\/wp\/v2\/media\/16475"}],"wp:attachment":[{"href":"https:\/\/blogs.mathworks.com\/deep-learning\/wp-json\/wp\/v2\/media?parent=16469"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blogs.mathworks.com\/deep-learning\/wp-json\/wp\/v2\/categories?post=16469"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blogs.mathworks.com\/deep-learning\/wp-json\/wp\/v2\/tags?post=16469"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}