{"id":16604,"date":"2024-11-12T12:20:52","date_gmt":"2024-11-12T17:20:52","guid":{"rendered":"https:\/\/blogs.mathworks.com\/deep-learning\/?p=16604"},"modified":"2024-12-02T18:13:33","modified_gmt":"2024-12-02T23:13:33","slug":"how-to-design-transformer-model-for-time-series-forecasting","status":"publish","type":"post","link":"https:\/\/blogs.mathworks.com\/deep-learning\/2024\/11\/12\/how-to-design-transformer-model-for-time-series-forecasting\/","title":{"rendered":"How to Design Transformer Model for Time-Series Forecasting"},"content":{"rendered":"<h6><\/h6>\r\nIn <a href=\"https:\/\/blogs.mathworks.com\/deep-learning\/?p=16469&amp;draftsforfriends=AUzYbIxyODBFaCLNelUrn5RfHzkfYDj9\">this previous blog post<\/a>, we explored the key aspects and benefits of transformer models, described how you can use pretrained models with MATLAB, and promised a blog post that shows you how to design transformers from scratch using built-in deep learning layers. In this blog post, I am going to provide you the code you need to design a transformer model for time-series forecasting.\r\n<h6><\/h6>\r\nThe originally proposed architecture for transformers (Figure 1) includes encoder and decoder blocks. Since then, encoder-only (like the BERT model) and decoder-only (like GPT models) have been implemented. In this post, I will show you how to design a transformer model for time-series forecasting using only decoder blocks.\r\n<h6><\/h6>\r\n<img decoding=\"async\" loading=\"lazy\" class=\"alignnone wp-image-16607 \" src=\"https:\/\/blogs.mathworks.com\/deep-learning\/files\/2024\/10\/architecture_adapted.png\" alt=\"Simplified encoder-decoder architecture of originally proposed transformer model\" width=\"413\" height=\"495\" \/>\r\n<h6><\/h6>\r\n<strong>Figure 1:<\/strong> The original encoder-decoder architecture of a transformer model (adapted from <a href=\"https:\/\/arxiv.org\/abs\/1706.03762\" target=\"_blank\" rel=\"noopener\">Vaswani et al, 2017<\/a>)\r\n<h6><\/h6>\r\n&nbsp;\r\n<h6><\/h6>\r\n<p style=\"font-size: 20px; color: #c04c0b;\"><strong>Decoder-Only Transformer Architecture<\/strong><\/p>\r\nThe architecture of the transformer model, which we are designing, is shown in Figure 2. The model includes two decoder blocks that use masked multi-head attention. In decoder-only transformers, masked self-attention is used to ensure that the model can only access previous tokens in the input sequence. In encoder-only transformers, self-attention mechanisms are used that attend to all tokens in the input sequence.\r\n<h6><\/h6>\r\nBy applying a mask over future positions in the input sequence, the model preserves the causality constraint necessary for tasks like text generation and time-series forecasting, where each output token must be generated in a left-to-right manner. Without masked self-attention, the model could access information from future tokens, which would violate the sequential nature of generation and introduce unintended data leakage into the forecasting.\r\n<h6><\/h6>\r\n<img decoding=\"async\" loading=\"lazy\" class=\"alignnone wp-image-16610 \" src=\"https:\/\/blogs.mathworks.com\/deep-learning\/files\/2024\/10\/decoder_architecture.png\" alt=\"Architecture of transformer model with two decoders for time-series forecasting\" width=\"770\" height=\"722\" \/>\r\n<h6><\/h6>\r\n<strong>Figure 2: <\/strong>Architecture of decoder-only transformer model that we are designing\r\n<h6><\/h6>\r\n&nbsp;\r\n<h6><\/h6>\r\n<p style=\"font-size: 20px; color: #c04c0b;\"><strong>Decoder-Only Transformer Design<\/strong><\/p>\r\nHere, I am going to provide you MATLAB code to design, train, and analyze a decoder-only transformer architecture.\r\n<h6><\/h6>\r\nDefine the layers of the transformer network.\r\n<pre>numFeatures = 1;\r\nnumHeads = 4; \r\nnumKeyChannels = 256; \r\nfeedforwardHiddenSize = 512; \r\nmodelHiddenSize = 256; \r\nmaxSequenceLength = 120;\r\n\r\ndecoderLayers = [ \r\n    sequenceInputLayer(numFeatures,Name=\"in\")\r\n    fullyConnectedLayer(modelHiddenSize,Name=\"embedding\")\r\n    positionEmbeddingLayer(modelHiddenSize,maxSequenceLength,Name=\"position_embed\") \r\n    additionLayer(2,Name=\"embed_add\") \r\n    layerNormalizationLayer(Name=\"embed_norm\") \r\n    selfAttentionLayer(numHeads,numKeyChannels,AttentionMask=\"causal\") \r\n    additionLayer(2,Name=\"attention_add\") \r\n    layerNormalizationLayer(Name=\"attention_norm\") \r\n    fullyConnectedLayer(feedforwardHiddenSize) \r\n    geluLayer \r\n    fullyConnectedLayer(modelHiddenSize) \r\n    additionLayer(2,Name=\"feedforward_add\") \r\n    layerNormalizationLayer(Name=\"decoder1_norm\") \r\n    selfAttentionLayer(numHeads,numKeyChannels,AttentionMask=\"causal\") \r\n    additionLayer(2,Name=\"attention2_add\") \r\n    layerNormalizationLayer(Name=\"attention2_norm\") \r\n    fullyConnectedLayer(feedforwardHiddenSize) \r\n    geluLayer \r\n    fullyConnectedLayer(modelHiddenSize) \r\n    additionLayer(2,Name=\"feedforward2_add\") \r\n    layerNormalizationLayer(Name=\"decoder2_norm\") \r\n    fullyConnectedLayer(numFeatures,Name=\"head\")];\r\n<\/pre>\r\n<h6><\/h6>\r\nConvert the layer array to a dlnetwork object.\r\n<pre>net = dlnetwork(decoderLayers,Initialize=false);\r\n<\/pre>\r\n<h6><\/h6>\r\nConnect the layers in the network.\r\n<pre>net = connectLayers(net,\"embedding\",\"embed_add\/in2\");\r\nnet = connectLayers(net,\"embed_norm\",\"attention_add\/in2\");\r\nnet = connectLayers(net,\"attention_norm\",\"feedforward_add\/in2\");\r\nnet = connectLayers(net,\"decoder1_norm\",\"attention2_add\/in2\");\r\nnet = connectLayers(net,\"attention2_norm\",\"feedforward2_add\/in2\");\r\n<\/pre>\r\n<h6><\/h6>\r\nInitialize the learnable and state parameters of the network.\r\n<pre>net = initialize(net);\r\n<\/pre>\r\n<h6><\/h6>\r\nVisualize and understand the architecture of the transformer network.\r\n<pre>analyzeNetwork(net)\r\n<\/pre>\r\n<h6><\/h6>\r\n<img decoding=\"async\" loading=\"lazy\" class=\"alignnone wp-image-16616 size-full\" src=\"https:\/\/blogs.mathworks.com\/deep-learning\/files\/2024\/10\/decoder_analyzed.png\" alt=\"Screenshot of transformer's architecture and layers created by MATLAB network analyzer\" width=\"1315\" height=\"1220\" \/>\r\n<h6><\/h6>\r\n&nbsp;\r\n<h6><\/h6>\r\n<p style=\"font-size: 20px; color: #c04c0b;\"><strong>Conclusion<\/strong><\/p>\r\nMany pretrained transformer models exist for natural language processing and computer vision tasks. In fact, such pretrained models are available for you in MATLAB (see <a href=\"https:\/\/www.mathworks.com\/help\/textanalytics\/ref\/bert.html\">BERT<\/a> and <a href=\"https:\/\/www.mathworks.com\/help\/vision\/ref\/visiontransformer.html\">ViT<\/a>). However, time-series forecasting is a newer application for transformers with limited availability of pretrained models. Take advantage of the code provided in this post to build your own transformer model for time-series forecasting or adapt it for your task, and comment below to share your results.\r\n<h6><\/h6>","protected":false},"excerpt":{"rendered":"<div class=\"overview-image\"><img src=\"https:\/\/blogs.mathworks.com\/deep-learning\/files\/2024\/10\/decoder_architecture.png\" class=\"img-responsive attachment-post-thumbnail size-post-thumbnail wp-post-image\" alt=\"\" decoding=\"async\" loading=\"lazy\" \/><\/div><p>\r\nIn this previous blog post, we explored the key aspects and benefits of transformer models, described how you can use pretrained models with MATLAB, and promised a blog post that shows you how to... <a class=\"read-more\" href=\"https:\/\/blogs.mathworks.com\/deep-learning\/2024\/11\/12\/how-to-design-transformer-model-for-time-series-forecasting\/\">read more >><\/a><\/p>","protected":false},"author":194,"featured_media":16610,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":[],"categories":[36,9],"tags":[],"_links":{"self":[{"href":"https:\/\/blogs.mathworks.com\/deep-learning\/wp-json\/wp\/v2\/posts\/16604"}],"collection":[{"href":"https:\/\/blogs.mathworks.com\/deep-learning\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blogs.mathworks.com\/deep-learning\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blogs.mathworks.com\/deep-learning\/wp-json\/wp\/v2\/users\/194"}],"replies":[{"embeddable":true,"href":"https:\/\/blogs.mathworks.com\/deep-learning\/wp-json\/wp\/v2\/comments?post=16604"}],"version-history":[{"count":11,"href":"https:\/\/blogs.mathworks.com\/deep-learning\/wp-json\/wp\/v2\/posts\/16604\/revisions"}],"predecessor-version":[{"id":16668,"href":"https:\/\/blogs.mathworks.com\/deep-learning\/wp-json\/wp\/v2\/posts\/16604\/revisions\/16668"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/blogs.mathworks.com\/deep-learning\/wp-json\/wp\/v2\/media\/16610"}],"wp:attachment":[{"href":"https:\/\/blogs.mathworks.com\/deep-learning\/wp-json\/wp\/v2\/media?parent=16604"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blogs.mathworks.com\/deep-learning\/wp-json\/wp\/v2\/categories?post=16604"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blogs.mathworks.com\/deep-learning\/wp-json\/wp\/v2\/tags?post=16604"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}