Blog

Thoughts and ideas from Kassebaum Engineering LLC

Robust Machine Learning using Feed-Forward Neural Networks and Algorithmic Information Theory

October 2, 2019 at 12:00 AM

by John A. Kassebaum, PHD PE - Kassebaum Engineering LLC

Customer software engineering that is application-specific.

Feed-Forward Neural Networks (FFNNs) are universal approximators and may be adapted to solve any representable problem to an arbitrary degree of accuracy (whether for predictive modelling or classification), limited only by the amount and quality of training data. One problem in using FFNNs is caused by memorization of a limited set of training data, which causes the FFNN to perform poorly on non-training data. The time and computational effort to produce a well-trained FFNN may also be significant and can seem arbitrary. In addition, the architecture and size of the network to be used is often not obvious. All of these problems can be solved easily using algorithmic-information-theoretic notions pioneered by Kolmogorov along with the realization that any one FFNN can be well-trained with little effort, but will always have individual bias based on its architecture, size, and initialization prior to training. Fortunately, one can quickly create a very robust (well-generalizing) collection of FFNNs of the appropriate size and architecture using an algorithmic information measurement called the Minimum Description Length (MDL). I will give you the process, then define the various parts, such as MDL and the training methodologies.

Process

Study training data and the data space, and if other data is not available, then separate some data to use for testing/validation only.
Generate one-by-one a collection of FFNNs of various sizes as a search for the best network size or architecture base by minimizing the MDL criterion.

Choose a size and architecture for each network arbitrarily … as a search for an ideal.
Initialize all layers except the output layer of the FFNN using the Nguyen-Widrow (or similarly motivated) method.
Compute the parameters (weights) of the output layer by minimizing least mean square error over the set of training data.
Compute the MDL fitness criterion for each FFNN generated.
Change network size and/or architecture and repeat from (1). (MDL will converge)

Once the approximate ideal network size is discovered, then that size will be used to create a small collection of FFNNs, which will be used as a committee (a linear opinion pool) by voting or averaging their individual results to solve our learning problem.

As above (in 2), generate FFNNs of the selected size, and measure the standard deviation (how likely members are to agree) over the set of pool members until it does not change much per additional FFNN. This is likely to be a small number, like 3-10, but if your problem is very complex it may be higher.
Use the generated committee -> Test it against your reserved testing data.

Performance on your testing data should be similar to the performance on your training data if they are essentially similar.
Test multiple committees (pools) of FFNNs to see that collectively they perform similarly.

Discussion

This is a very quick process. The computation of the least-squares is the most intensive process, and it’s light-weight and requires no iteration. Although each individual FFNN may be individually biased as a result of varying size, architecture, and initialization; the committees (pools) formed from them are robust to these biases. These resulting collections of FFNNs are unbiased.

Nguyen-Widrow Initialization

Nguyen-Widrow initialization of the hidden layer weights/parameters spreads the hidden nodes over the input data space (or, for other layers, the output space of the previous layer). It does this in such a fashion that the whole input data space is uniformly covered. In general, that is the best you can ever do without using the training data. Don’t do additional training or adjustment to improve it, since training will just cause overspecialization and cause the resulting network to perform poorly on non-training data later.

Citations

From Matlab:
https://www.mathworks.com/help/deeplearning/ref/initnw.html
Original academic paper:
https://web.stanford.edu/class/ee373b/nninitialization.pdf
An academic paper with an improvement: https://link.springer.com/chapter/10.1007%2F978-981-13-6095-4_11

Minimize Least Mean Square Error

In order to set the parameters/weight values of the output layer, apply the training data to the hidden layer to get its output. Then, minimize the Least-Mean-Square error over the weights/parameters of a weighted sum of the hidden nodes output values for the set of training output values. This will give you the values to use for the output layer weights/parameters.

Minimum Description Length Criterion

The Minimum Description Length (MDL) criterion is an algorithmic information-theoretic measure (from Kolmogorov information theory) for the amount of information (in bits) used in the algorithm represented by an FFNN. It is the sum of two parts: 1) the information in the residual error resulting from the use of an FFNN; and 2) the information in the algorithm - the bits in the parameters/weights of the FFNN.

MDL = (w) * BRe + (1-w) * BPnn, where w ~= 0.5, but may be varied from 0.0 to 1.0
BRe = (# of samples) * log2(1 + mean square error)
BPnn = (# of parameters) * log2(1 + mean square parameter value)

Note that small parameters are mostly insignificant to network performance, that is the reason to use the mean square parameter value.

Citations (more available on request)

An Introduction to Kolmogorov Complexity and Its Applications
https://link.springer.com/book/10.1007%2F978-3-030-11298-1
Hypothesis Selection and Testing by the MDL Principle
https://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.22.9851
Model Selection Using Information Theory and the MDL Principle
https://repository.upenn.edu/cgi/viewcontent.cgi?article=1120&context=statistics_papers

Summary

In short, you create a small set of FFNNs of the ideal size to avoid overspecialization and then combine their outputs as an average. Each FFNN is trained quickly on the available training data by using the Nguyen-Widrow initialization to set the hidden layer weights and using Least Mean Squares to set the output layer weights. A search is conducted for the optimal size of the FFNN’s hidden layer(s) by minimizing the calculated MDL as a fitness criterion. This method is quick and robust to overspecialization.

For further information, feedback, or to pose questions please contact the author at: jak@KassebaumEngineering.com.