How to Design Transformer Model for Time-Series Forecasting
![Simplified encoder-decoder architecture of originally proposed transformer model](https://blogs.mathworks.com/deep-learning/files/2024/10/architecture_adapted.png)
Decoder-Only Transformer Architecture
The architecture of the transformer model, which we are designing, is shown in Figure 2. The model includes two decoder blocks that use masked multi-head attention. In decoder-only transformers, masked self-attention is used to ensure that the model can only access previous tokens in the input sequence. In encoder-only transformers, self-attention mechanisms are used that attend to all tokens in the input sequence. By applying a mask over future positions in the input sequence, the model preserves the causality constraint necessary for tasks like text generation and time-series forecasting, where each output token must be generated in a left-to-right manner. Without masked self-attention, the model could access information from future tokens, which would violate the sequential nature of generation and introduce unintended data leakage into the forecasting.![Architecture of transformer model with two decoders for time-series forecasting](https://blogs.mathworks.com/deep-learning/files/2024/10/decoder_architecture.png)
Decoder-Only Transformer Design
Here, I am going to provide you MATLAB code to design, train, and analyze a decoder-only transformer architecture. Define the layers of the transformer network.numFeatures = 1; numHeads = 4; numKeyChannels = 256; feedforwardHiddenSize = 512; modelHiddenSize = 256; maxSequenceLength = 120; decoderLayers = [ sequenceInputLayer(numFeatures,Name="in") fullyConnectedLayer(modelHiddenSize,Name="embedding") positionEmbeddingLayer(modelHiddenSize,maxSequenceLength,Name="position_embed") additionLayer(2,Name="embed_add") layerNormalizationLayer(Name="embed_norm") selfAttentionLayer(numHeads,numKeyChannels,AttentionMask="causal") additionLayer(2,Name="attention_add") layerNormalizationLayer(Name="attention_norm") fullyConnectedLayer(feedforwardHiddenSize) geluLayer fullyConnectedLayer(modelHiddenSize) additionLayer(2,Name="feedforward_add") layerNormalizationLayer(Name="decoder1_norm") selfAttentionLayer(numHeads,numKeyChannels,AttentionMask="causal") additionLayer(2,Name="attention2_add") layerNormalizationLayer(Name="attention2_norm") fullyConnectedLayer(feedforwardHiddenSize) geluLayer fullyConnectedLayer(modelHiddenSize) additionLayer(2,Name="feedforward2_add") layerNormalizationLayer(Name="decoder2_norm") fullyConnectedLayer(numFeatures,Name="head")];Convert the layer array to a dlnetwork object.
net = dlnetwork(decoderLayers,Initialize=false);Connect the layers in the network.
net = connectLayers(net,"embedding","embed_add/in2"); net = connectLayers(net,"embed_norm","attention_add/in2"); net = connectLayers(net,"attention_norm","feedforward_add/in2"); net = connectLayers(net,"decoder1_norm","attention2_add/in2"); net = connectLayers(net,"attention2_norm","feedforward2_add/in2");Initialize the learnable and state parameters of the network.
net = initialize(net);Visualize and understand the architecture of the transformer network.
analyzeNetwork(net)
![Screenshot of transformer's architecture and layers created by MATLAB network analyzer](https://blogs.mathworks.com/deep-learning/files/2024/10/decoder_analyzed.png)
Conclusion
Many pretrained transformer models exist for natural language processing and computer vision tasks. In fact, such pretrained models are available for you in MATLAB (see BERT and ViT). However, time-series forecasting is a newer application for transformers with limited availability of pretrained models. Take advantage of the code provided in this post to build your own transformer model for time-series forecasting or adapt it for your task, and comment below to share your results.- Category:
- AI Application,
- Deep Learning
Comments
To leave a comment, please click here to sign in to your MathWorks Account or create a new one.