Deep Learning

Understanding and using deep learning networks

Defining Your Own Network Layer 6

Posted by Steve Eddins,

One of the new Neural Network Toolbox features of R2017b is the ability to define your own network layer. Today I'll show you how to make an exponential linear unit (ELU) layer.

Joe helped me with today's post. Joe is one of the few developers who have been around MathWorks longer than I have. In fact, he's one of the people who interviewed me when I applied for a job here. I've had the pleasure of working closely with Joe for the past several years on many aspects of MATLAB design. He really loves tinkering with deep learning networks.

Joe came across the paper "Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs)," by Clevert, Unterthiner, and Hichreiter, and he wanted to make an ELU layer using R2017b.

$f(x) = \left\{\begin{array}{ll} x & x > 0\\ \alpha(e^x - 1) & x \leq 0 \end{array} \right.$

Let's compare the ELU shape with a couple of other commonly used activation functions.

alpha1 = 1;
elu_fcn = @(x) x.*(x > 0) + alpha1*(exp(x) - 1).*(x <= 0);

alpha2 = 0.1;
leaky_relu_fcn = @(x) alpha2*x.*(x <= 0) + x.*(x > 0);

relu_fcn = @(x) x.*(x > 0);

fplot(elu_fcn,[-10 3],'LineWidth',2)
hold on
fplot(leaky_relu_fcn,[-10 3],'LineWidth',2)
fplot(relu_fcn,[-10 3],'LineWidth',2)
hold off
ax = gca;
ax.XAxisLocation = 'origin';
ax.YAxisLocation = 'origin';
box off
legend({'ELU','Leaky ReLU','ReLU'},'Location','northwest')

Joe wanted to make a ELU layer with one learned alpha value per channel. He followed the procedure outlined in Define a Layer with Learnable Parameters to make an ELU layer that works with the Neural Network Toolbox.

Below is the template for a layer with learnable parameters. We'll explore how to fill in this template to make an ELU layer.

classdef myLayer < nnet.layer.Layer

    properties
        % (Optional) Layer properties

        % Layer properties go here
    end

    properties (Learnable)
        % (Optional) Layer learnable parameters

        % Layer learnable parameters go here
    end
    
    methods
        function layer = myLayer()
            % (Optional) Create a myLayer
            % This function must have the same name as the layer

            % Layer constructor function goes here
        end
        
        function Z = predict(layer, X)
            % Forward input data through the layer at prediction time and
            % output the result
            %
            % Inputs:
            %         layer    -    Layer to forward propagate through
            %         X        -    Input data
            % Output:
            %         Z        -    Output of layer forward function
            
            % Layer forward function for prediction goes here
        end

        function [Z, memory] = forward(layer, X)
            % (Optional) Forward input data through the layer at training
            % time and output the result and a memory value
            %
            % Inputs:
            %         layer  - Layer to forward propagate through
            %         X      - Input data
            % Output:
            %         Z      - Output of layer forward function
            %         memory - Memory value which can be used for
            %                  backward propagation

            % Layer forward function for training goes here
        end

        function [dLdX, dLdW1, ..., dLdWn] = backward(layer, X, Z, dLdZ, memory)
            % Backward propagate the derivative of the loss function through 
            % the layer
            %
            % Inputs:
            %         layer             - Layer to backward propagate through
            %         X                 - Input data
            %         Z                 - Output of layer forward function            
            %         dLdZ              - Gradient propagated from the deeper layer
            %         memory            - Memory value which can be used in
            %                             backward propagation
            % Output:
            %         dLdX              - Derivative of the loss with respect to the
            %                             input data
            %         dLdW1, ..., dLdWn - Derivatives of the loss with respect to each
            %                             learnable parameter
            
            % Layer backward function goes here
        end
    end
end

For our ELU layer with a learnable alpha parameter, here's one way to write the constructor and the Learnable property block.

classdef eluLayer < nnet.layer.Layer

    properties (Learnable)
        alpha
    end
    
    methods
        function layer = eluLayer(num_channels,name)
            layer.Type = 'Exponential Linear Unit';
            
            % Assign layer name if it is passed in.
            if nargin > 1
                layer.Name = name;
            end
            
            % Give the layer a meaningful description.
            layer.Description = "Exponential linear unit with " + ...
                num_channels + " channels";
            
            % Initialize the learnable alpha parameter.
            layer.alpha = rand(1,1,num_channels);
        end

The predict function is where we implement the activation function. Remember its mathematical form:

$f(x) = \left\{\begin{array}{ll} x & x > 0\\ \alpha(e^x - 1) & x \leq 0 \end{array} \right.$

Note: The expression (exp(min(X,0)) - 1) in the predict function is written that way to avoid computing the exponential of large positive numbers, which could result in infinities and NaNs popping up.

        function Z = predict(layer,X)
            % Forward input data through the layer at prediction time and
            % output the result
            %
            % Inputs:
            %         layer    -    Layer to forward propagate through
            %         X        -    Input data
            % Output:
            %         Z        -    Output of layer forward function
            
            % Expressing the computation in vectorized form allows it to
            % execute directly on the GPU.
            Z = (X .* (X > 0)) + ...
                (layer.alpha.*(exp(min(X,0)) - 1) .* (X <= 0));
        end

The backward function implements the derivatives of the loss function, which are needed for training. The Define a Layer with Learnable Parameters documentation page explains how to derive the needed quantities.

        function [dLdX, dLdAlpha] = backward(layer, X, Z, dLdZ, ~)
            % Backward propagate the derivative of the loss function through 
            % the layer
            %
            % Inputs:
            %         layer             - Layer to backward propagate through
            %         X                 - Input data
            %         Z                 - Output of layer forward function            
            %         dLdZ              - Gradient propagated from the deeper layer
            %         memory            - Memory value which can be used in
            %                             backward propagation [unused]
            % Output:
            %         dLdX              - Derivative of the loss with
            %                             respect to the input data
            %         dLdAlpha          - Derivatives of the loss with
            %                             respect to alpha
            
            % Original expression:
            % dLdX = (dLdZ .* (X > 0)) + ...
            %     (dLdZ .* (layer + Z) .* (X <= 0));
            %
            % Optimized expression:
            dLdX = dLdZ .* ((X > 0) + ...
                ((layer.alpha + Z) .* (X <= 0)));            
            
            dLdAlpha = exp(min(X,0) - 1) .* dLdZ;
            % Sum over the image rows and columns.
            dLdAlpha = sum(sum(dLdAlpha,1),2);
            % Sum over all the observations in the mini-batch.
            dLdAlpha = sum(dLdAlpha,4);
        end

That's all we need for our layer. We don't need to implement the forward function because our layer doesn't have memory and doesn't need to do anything special for training.

Load in the sample digits training set, and show one of the images from it.

[XTrain, YTrain] = digitTrain4DArrayData;
imshow(XTrain(:,:,:,1010),'InitialMagnification','fit')
YTrain(1010)
ans = 

  categorical

     2 

Make a network that uses our new ELU layer.

layers = [ ...
    imageInputLayer([28 28 1])
    convolution2dLayer(5,20)
    batchNormalizationLayer
    eluLayer(20)
    fullyConnectedLayer(10)
    softmaxLayer
    classificationLayer];

Train the network.

options = trainingOptions('sgdm');
net = trainNetwork(XTrain,YTrain,layers,options);
Training on single GPU.
Initializing image normalization.
|=========================================================================================|
|     Epoch    |   Iteration  | Time Elapsed |  Mini-batch  |  Mini-batch  | Base Learning|
|              |              |  (seconds)   |     Loss     |   Accuracy   |     Rate     |
|=========================================================================================|
|            1 |            1 |         0.03 |       2.5173 |        5.47% |       0.0100 |
|            2 |           50 |         0.63 |       0.4548 |       85.16% |       0.0100 |
|            3 |          100 |         1.20 |       0.1550 |       96.88% |       0.0100 |
|            4 |          150 |         1.78 |       0.0951 |       99.22% |       0.0100 |
|            6 |          200 |         2.37 |       0.0499 |       99.22% |       0.0100 |
|            7 |          250 |         2.96 |       0.0356 |      100.00% |       0.0100 |
|            8 |          300 |         3.55 |       0.0270 |      100.00% |       0.0100 |
|            9 |          350 |         4.13 |       0.0168 |      100.00% |       0.0100 |
|           11 |          400 |         4.74 |       0.0145 |      100.00% |       0.0100 |
|           12 |          450 |         5.32 |       0.0118 |      100.00% |       0.0100 |
|           13 |          500 |         5.89 |       0.0119 |      100.00% |       0.0100 |
|           15 |          550 |         6.45 |       0.0074 |      100.00% |       0.0100 |
|           16 |          600 |         7.03 |       0.0079 |      100.00% |       0.0100 |
|           17 |          650 |         7.60 |       0.0086 |      100.00% |       0.0100 |
|           18 |          700 |         8.18 |       0.0065 |      100.00% |       0.0100 |
|           20 |          750 |         8.76 |       0.0066 |      100.00% |       0.0100 |
|           21 |          800 |         9.34 |       0.0052 |      100.00% |       0.0100 |
|           22 |          850 |         9.92 |       0.0054 |      100.00% |       0.0100 |
|           24 |          900 |        10.51 |       0.0051 |      100.00% |       0.0100 |
|           25 |          950 |        11.12 |       0.0044 |      100.00% |       0.0100 |
|           26 |         1000 |        11.73 |       0.0049 |      100.00% |       0.0100 |
|           27 |         1050 |        12.31 |       0.0040 |      100.00% |       0.0100 |
|           29 |         1100 |        12.93 |       0.0041 |      100.00% |       0.0100 |
|           30 |         1150 |        13.56 |       0.0040 |      100.00% |       0.0100 |
|           30 |         1170 |        13.80 |       0.0043 |      100.00% |       0.0100 |
|=========================================================================================|

Check the accuracy of the network on our test set.

[XTest, YTest] = digitTest4DArrayData;
YPred = classify(net, XTest);
accuracy = sum(YTest==YPred)/numel(YTest)
accuracy =

    0.9872

Look at one of the images in the test set and see how it was classified by the network.

k = 1500;
imshow(XTest(:,:,:,k),'InitialMagnification','fit')
YPred(k)
ans = 

  categorical

     2 

Now you've seen how to define your own layer, include it in a network, and train it up.


Get the MATLAB code

Published with MATLAB® R2017b

6 CommentsOldest to Newest

Eric replied on : 1 of 6

Thanks for a great blog post. Is there an easy way to modify this code so that the user can determine at run-time whether alpha is learned or fixed? Or does a separate class need to be defined with alpha outside of the Learnable properties block?

Eric Shields replied on : 2 of 6

Batch normalization may not be necessary with ELUs. Clevert, et al, indicate that “Batch normalization improved ReLU and LReLU networks, but did not improve ELU and SReLU networks.” On the example code I get 10% faster performance for the same accuracy by removing the batch normalization layer.

guillaume godin replied on : 5 of 6

Hi Steve,

I implemented the ELU like this: to optimize computation (using memory data)

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
    function Z = predict(layer, X)
            % Forward input data through the layer and output the result
            Z = max(0, X) + layer.Alpha .*(exp(min(0, X))-1);
            
        end
        
        
        function [Z, memory] = forward(layer, X)
            % (Optional) Forward input data through the layer at training
            % time and output the result and a memory value
            %
            % Inputs:
            %         layer  - Layer to forward propagate through
            %         X      - Input data
            % Output:
            %         Z      - Output of layer forward function
            %         memory - Memory value which can be used for
            %                  backward propagation

            % Layer forward function for training goes here
            memory = exp(min(0, X))-1;
            Z = max(0, X) + layer.Alpha .* memory;

        end
        
        
        function [dLdX, dLdAlpha] = backward(layer, X, Z, dLdZ, memory)
            % Backward propagate the derivative of the loss function through 
            % the layer 
            % y = dLdZ
            
            dLdX = (layer.Alpha+Z) .* dLdZ; % negative part => (a+f(x))*y
            dLdX(X>0) = dLdZ(X>0); % positive part => y
            
            % derivat the the 
            dLdAlpha = memory .* dLdZ; % negative part only => exp(x)-1 
            dLdAlpha = sum(sum(dLdAlpha,1),2);
            
            % Sum over all observations in mini-batch
            dLdAlpha = sum(dLdAlpha,4);            
         end

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

my results are similar to yours:

|========================================================================================|
|  Epoch  |  Iteration  |  Time Elapsed  |  Mini-batch  |  Mini-batch  |  Base Learning  |
|         |             |   (hh:mm:ss)   |   Accuracy   |     Loss     |      Rate       |
|========================================================================================|
|       1 |           1 |       00:00:00 |        8.59% |       2.6340 |          0.0100 |
|       2 |          50 |       00:00:04 |       80.47% |       0.5700 |          0.0100 |
|       3 |         100 |       00:00:07 |       96.88% |       0.1470 |          0.0100 |
|       4 |         150 |       00:00:11 |       97.66% |       0.1322 |          0.0100 |
|       6 |         200 |       00:00:14 |       99.22% |       0.0621 |          0.0100 |
|       7 |         250 |       00:00:18 |       99.22% |       0.0395 |          0.0100 |
|       8 |         300 |       00:00:21 |      100.00% |       0.0212 |          0.0100 |
|       9 |         350 |       00:00:24 |      100.00% |       0.0191 |          0.0100 |
|      11 |         400 |       00:00:28 |      100.00% |       0.0170 |          0.0100 |
|      12 |         450 |       00:00:31 |      100.00% |       0.0119 |          0.0100 |
|      13 |         500 |       00:00:35 |      100.00% |       0.0116 |          0.0100 |
|      15 |         550 |       00:00:38 |      100.00% |       0.0056 |          0.0100 |
|      16 |         600 |       00:00:42 |      100.00% |       0.0099 |          0.0100 |
|      17 |         650 |       00:00:45 |      100.00% |       0.0080 |          0.0100 |
|      18 |         700 |       00:00:49 |      100.00% |       0.0058 |          0.0100 |
|      20 |         750 |       00:00:52 |      100.00% |       0.0063 |          0.0100 |
|      21 |         800 |       00:00:56 |      100.00% |       0.0055 |          0.0100 |
|      22 |         850 |       00:01:00 |      100.00% |       0.0060 |          0.0100 |
|      24 |         900 |       00:01:03 |      100.00% |       0.0045 |          0.0100 |
|      25 |         950 |       00:01:06 |      100.00% |       0.0039 |          0.0100 |
|      26 |        1000 |       00:01:10 |      100.00% |       0.0033 |          0.0100 |
|      27 |        1050 |       00:01:13 |      100.00% |       0.0046 |          0.0100 |
|      29 |        1100 |       00:01:17 |      100.00% |       0.0042 |          0.0100 |
|      30 |        1150 |       00:01:20 |      100.00% |       0.0040 |          0.0100 |
|      30 |        1170 |       00:01:21 |      100.00% |       0.0042 |          0.0100 |
|========================================================================================|
>> [XTest, YTest] = digitTest4DArrayData;
YPred = classify(net, XTest);
accuracy = sum(YTest==YPred)/numel(YTest)

accuracy =

    0.9896

BR,

Guillaume

Add A Comment

Your email address will not be published. Required fields are marked *

Preview: hide