# Defining Your Own Network Layer (Revisited)

Today I want to follow up on my previous post, Defining Your Own Network Layer. There were two reader comments that caught my attention.

The first comment, from Eric Shields, points out a key conclusion from the Clevert, Unterthiner, and Hichreiter paper that I overlooked. I initially focused just on the definition of the exponential linear unit function, but Eric pointed out that the authors concluded that batch normalization, which I used in my simple network, might not be needed when using an ELU layer.

Here's a reminder (from the previous post) about what the ELU curve looks like.

And here's the simple network that used last time.

layers = [ ...
imageInputLayer([28 28 1])
convolution2dLayer(5,20)
batchNormalizationLayer
eluLayer(20)
fullyConnectedLayer(10)
softmaxLayer
classificationLayer];


I used the sample digits training set.

[XTrain, YTrain] = digitTrain4DArrayData;
imshow(XTrain(:,:,:,1010),'InitialMagnification','fit')
YTrain(1010)

ans =

categorical

2



Now I'll train the network again, using the same options as last time.

options = trainingOptions('sgdm');
net = trainNetwork(XTrain,YTrain,layers,options);

Training on single GPU.
Initializing image normalization.
|=========================================================================================|
|     Epoch    |   Iteration  | Time Elapsed |  Mini-batch  |  Mini-batch  | Base Learning|
|              |              |  (seconds)   |     Loss     |   Accuracy   |     Rate     |
|=========================================================================================|
|            1 |            1 |         0.01 |       2.5899 |       10.16% |       0.0100 |
|            2 |           50 |         0.61 |       0.4156 |       85.94% |       0.0100 |
|            3 |          100 |         1.20 |       0.1340 |       96.88% |       0.0100 |
|            4 |          150 |         1.80 |       0.0847 |       98.44% |       0.0100 |
|            6 |          200 |         2.41 |       0.0454 |      100.00% |       0.0100 |
|            7 |          250 |         3.01 |       0.0253 |      100.00% |       0.0100 |
|            8 |          300 |         3.60 |       0.0219 |      100.00% |       0.0100 |
|            9 |          350 |         4.19 |       0.0141 |      100.00% |       0.0100 |
|           11 |          400 |         4.85 |       0.0128 |      100.00% |       0.0100 |
|           12 |          450 |         5.46 |       0.0126 |      100.00% |       0.0100 |
|           13 |          500 |         6.05 |       0.0099 |      100.00% |       0.0100 |
|           15 |          550 |         6.66 |       0.0079 |      100.00% |       0.0100 |
|           16 |          600 |         7.27 |       0.0084 |      100.00% |       0.0100 |
|           17 |          650 |         7.86 |       0.0075 |      100.00% |       0.0100 |
|           18 |          700 |         8.45 |       0.0081 |      100.00% |       0.0100 |
|           20 |          750 |         9.05 |       0.0066 |      100.00% |       0.0100 |
|           21 |          800 |         9.64 |       0.0057 |      100.00% |       0.0100 |
|           22 |          850 |        10.24 |       0.0054 |      100.00% |       0.0100 |
|           24 |          900 |        10.83 |       0.0049 |      100.00% |       0.0100 |
|           25 |          950 |        11.44 |       0.0055 |      100.00% |       0.0100 |
|           26 |         1000 |        12.04 |       0.0046 |      100.00% |       0.0100 |
|           27 |         1050 |        12.66 |       0.0041 |      100.00% |       0.0100 |
|           29 |         1100 |        13.25 |       0.0044 |      100.00% |       0.0100 |
|           30 |         1150 |        13.84 |       0.0038 |      100.00% |       0.0100 |
|           30 |         1170 |        14.08 |       0.0042 |      100.00% |       0.0100 |
|=========================================================================================|


Note that the training took about 14.1 seconds.

Check the accuracy of the trained network.

[XTest, YTest] = digitTest4DArrayData;
YPred = classify(net, XTest);
accuracy = sum(YTest==YPred)/numel(YTest)

accuracy =

0.9878



Now let's make another network without the batch normalization layer.

layers2 = [ ...
imageInputLayer([28 28 1])
convolution2dLayer(5,20)
eluLayer(20)
fullyConnectedLayer(10)
softmaxLayer
classificationLayer];


Train it up again.

net2 = trainNetwork(XTrain,YTrain,layers2,options);

Training on single GPU.
Initializing image normalization.
|=========================================================================================|
|     Epoch    |   Iteration  | Time Elapsed |  Mini-batch  |  Mini-batch  | Base Learning|
|              |              |  (seconds)   |     Loss     |   Accuracy   |     Rate     |
|=========================================================================================|
|            1 |            1 |         0.01 |       2.3022 |        7.81% |       0.0100 |
|            2 |           50 |         0.52 |       1.6631 |       51.56% |       0.0100 |
|            3 |          100 |         1.04 |       1.4368 |       52.34% |       0.0100 |
|            4 |          150 |         1.58 |       1.0426 |       61.72% |       0.0100 |
|            6 |          200 |         2.12 |       0.8223 |       72.66% |       0.0100 |
|            7 |          250 |         2.67 |       0.6842 |       80.47% |       0.0100 |
|            8 |          300 |         3.21 |       0.6461 |       78.13% |       0.0100 |
|            9 |          350 |         3.79 |       0.4181 |       85.94% |       0.0100 |
|           11 |          400 |         4.33 |       0.4163 |       86.72% |       0.0100 |
|           12 |          450 |         4.88 |       0.2115 |       96.09% |       0.0100 |
|           13 |          500 |         5.42 |       0.1817 |       97.66% |       0.0100 |
|           15 |          550 |         5.96 |       0.1809 |       96.09% |       0.0100 |
|           16 |          600 |         6.53 |       0.1001 |      100.00% |       0.0100 |
|           17 |          650 |         7.07 |       0.0899 |      100.00% |       0.0100 |
|           18 |          700 |         7.61 |       0.0934 |       99.22% |       0.0100 |
|           20 |          750 |         8.14 |       0.0739 |       99.22% |       0.0100 |
|           21 |          800 |         8.68 |       0.0617 |      100.00% |       0.0100 |
|           22 |          850 |         9.22 |       0.0462 |      100.00% |       0.0100 |
|           24 |          900 |         9.76 |       0.0641 |      100.00% |       0.0100 |
|           25 |          950 |        10.29 |       0.0332 |      100.00% |       0.0100 |
|           26 |         1000 |        10.86 |       0.0317 |      100.00% |       0.0100 |
|           27 |         1050 |        11.41 |       0.0378 |       99.22% |       0.0100 |
|           29 |         1100 |        11.96 |       0.0235 |      100.00% |       0.0100 |
|           30 |         1150 |        12.51 |       0.0280 |      100.00% |       0.0100 |
|           30 |         1170 |        12.73 |       0.0307 |      100.00% |       0.0100 |
|=========================================================================================|


That took about 12.7 seconds to train, about a 10% reduction. Check the accuracy.

[XTest, YTest] = digitTest4DArrayData;
YPred = classify(net2, XTest);
accuracy = sum(YTest==YPred)/numel(YTest)

accuracy =

0.9808



Eric said he got the same accuracy, whereas I am seeing a slightly lower accuracy. But I haven't really explored this further, and I so I wouldn't draw any conclusions. I just wanted to take the opportunity to go back and mention one of the important points of the paper that I overlooked last time.

A second reader, another Eric, wanted to know if alpha could be specified as a learnable or non learnable parameter at run time.

The answer: Yes, but not without defining a second class. Recall this portion of the template for defining a layer with learnable properties:

    properties (Learnable)
% (Optional) Layer learnable parameters

% Layer learnable parameters go here
end



That Learnable attribute of the properties block is a fixed part of the class definition. It can't be changed dynamically. So, you need to define a second class. I'll call mine eluLayerFixedAlpha. Here's the properties block:

    properties
alpha
end



And here's a constructor that takes alpha as an input argument.

    methods
function layer = eluLayerFixedAlpha(alpha,name)
layer.Type = 'Exponential Linear Unit';
layer.alpha = alpha;

% Assign layer name if it is passed in.
if nargin > 1
layer.Name = name;
end

% Give the layer a meaningful description.
layer.Description = "Exponential linear unit with alpha: " + ...
alpha;
end



I also modifed the backward method to remove the computation and output argument associated with the derivative of the loss function with respect to alpha.

        function dLdX = backward(layer, X, Z, dLdZ, ~)
% Backward propagate the derivative of the loss function through
% the layer
%
% Inputs:
%         layer             - Layer to backward propagate through
%         X                 - Input data
%         Z                 - Output of layer forward function
%         dLdZ              - Gradient propagated from the deeper layer
%         memory            - Memory value which can be used in
%                             backward propagation [unused]
% Output:
%         dLdX              - Derivative of the loss with
%                             respect to the input data

dLdX = dLdZ .* ((X > 0) + ...
((layer.alpha + Z) .* (X <= 0)));
end



Let's try it. I'm just going to make up a value for alpha.

alpha = 1.0;
layers3 = [ ...
imageInputLayer([28 28 1])
convolution2dLayer(5,20)
eluLayerFixedAlpha(alpha)
fullyConnectedLayer(10)
softmaxLayer
classificationLayer];
net3 = trainNetwork(XTrain,YTrain,layers3,options);
YPred = classify(net3, XTest);
accuracy = sum(YTest==YPred)/numel(YTest)

Training on single GPU.
Initializing image normalization.
|=========================================================================================|
|     Epoch    |   Iteration  | Time Elapsed |  Mini-batch  |  Mini-batch  | Base Learning|
|              |              |  (seconds)   |     Loss     |   Accuracy   |     Rate     |
|=========================================================================================|
|            1 |            1 |         0.01 |       2.3005 |       14.06% |       0.0100 |
|            2 |           50 |         0.48 |       1.4979 |       53.13% |       0.0100 |
|            3 |          100 |         0.97 |       1.2162 |       57.81% |       0.0100 |
|            4 |          150 |         1.48 |       1.1427 |       67.97% |       0.0100 |
|            6 |          200 |         1.99 |       0.9837 |       67.19% |       0.0100 |
|            7 |          250 |         2.50 |       0.8110 |       70.31% |       0.0100 |
|            8 |          300 |         3.04 |       0.7347 |       75.00% |       0.0100 |
|            9 |          350 |         3.55 |       0.5937 |       81.25% |       0.0100 |
|           11 |          400 |         4.05 |       0.5686 |       78.13% |       0.0100 |
|           12 |          450 |         4.56 |       0.4678 |       85.94% |       0.0100 |
|           13 |          500 |         5.06 |       0.3461 |       88.28% |       0.0100 |
|           15 |          550 |         5.57 |       0.3515 |       87.50% |       0.0100 |
|           16 |          600 |         6.07 |       0.2582 |       92.97% |       0.0100 |
|           17 |          650 |         6.58 |       0.2216 |       92.97% |       0.0100 |
|           18 |          700 |         7.08 |       0.1705 |       96.09% |       0.0100 |
|           20 |          750 |         7.59 |       0.1212 |       98.44% |       0.0100 |
|           21 |          800 |         8.09 |       0.0925 |       98.44% |       0.0100 |
|           22 |          850 |         8.59 |       0.1045 |       97.66% |       0.0100 |
|           24 |          900 |         9.10 |       0.1289 |       96.09% |       0.0100 |
|           25 |          950 |         9.60 |       0.0710 |       99.22% |       0.0100 |
|           26 |         1000 |        10.10 |       0.0722 |       99.22% |       0.0100 |
|           27 |         1050 |        10.60 |       0.0600 |       99.22% |       0.0100 |
|           29 |         1100 |        11.10 |       0.0688 |       99.22% |       0.0100 |
|           30 |         1150 |        11.61 |       0.0519 |      100.00% |       0.0100 |
|           30 |         1170 |        11.82 |       0.0649 |       99.22% |       0.0100 |
|=========================================================================================|

accuracy =

0.9702