Warm tip: This article is reproduced from serverfault.com, please click

Can a dense layer on many inputs be represented as a single matrix multiplication?

发布于 2020-11-28 06:43:35

Denote a[2, 3] to be a matrix of dimension 2x3. Say there are 10 elements in each input and the network is a two-element classifier (cat or dog, for example). Say there is just one dense layer. For now I am ignoring the bias vector. I know this is an over-simplified neural net, but it is just for this example. Each output in a dense layer of a neural net can be calculated as

output = matmul(input, weights)

Where weights is a weight matrix 10x2, input is an input vector 1x10, and output is an output vector 1x2.

My question is this: Can an entire series of inputs be computed at the same time with a single matrix multiplication? It seems like you could compute

output = matmul(input, weights)

Where there are 100 inputs total, and input is 100x10, weights is 10x2, and output is 100x2.

In back propagation, you could do something similar:

input_err = matmul(output_err, transpose(weights))
weights_err = matmul(transpose(input), output_err)
weights -= learning_rate*weights_err

Where weights is the same, output_err is 100x2, and input is 100x10.

However, I tried to implement a neural network in this way from scratch and I am currently unsuccessful. I am wondering if I have some other error or if my approach is fundamentally wrong.

Thanks!

Questioner
Meredith
Viewed
0
Meredith 2020-11-29 02:23:24

If anyone else is wondering, I found the answer to my question. This does not in fact work, for a few reasons. Essentially, computing all inputs in this way is like running a network with a batch size equal to the number of inputs. The weights do not get updated between inputs, but rather all at once. And so while it seems that calculating together would be valid, it makes it so that each input does not individually influence the training step by step. However, with a reasonable batch size, you can do 2d matrix multiplications, where the input is batch_size by input_size in order to speed up training.

In addition, if predicting on many inputs (in the test stage, for example), since no weights are updated, an entire matrix multiplication of num_inputs by input_size can be run to compute all inputs in parallel.