uz_mlp_three_layer#
This IP-Core implements a three layer MLP network based on [1]. The implementation and nomenclature follows the principles outlined in uz_nn. The MLP is hard coded to have three hidden layer with ReLU activation function for hidden layers. The output uses Linear activation.
Features:
Feedforward calculation in fixed point (32 bit, 14 fractional bits, signed)
Precision: 6.1e-5
Max values: +/- 131072
No overflow detection regarding the fixed point data type! User has to make sure that min/max is not violated
Variable number of inputs which can be configured by software from 2 to 16
Variable number of outputs which can be configured by software (2,4,6,8).
Number of hidden layers is fixed to 3!
Number of neurons in each hidden layer is fixed to 64!
Activation function is ReLU for all hidden layer (with a saturation at max value!)
Activation function is Linear for output layer
Bias and weights can be written to the network at runtime
Fully compatible with uz_nn to use IP-Core as an accelerator
Uses Matrix math as input and outputs
The IP-Core is always configured by the processing system.
The configuration (e.g., number of inputs) is valid for using the inputs from AXI and from the PL (depending on use_axi_input
).
The calculation of one forward pass is triggered by a rising edge either from AXI or by the PL (enable_nn
port).
The calculation trigger is a OR between the AXI and PL ports, thus no priority is used.
The output valid is low during calculation and high if a valid result is available on the output ports.
If a calculation is triggered before the calculation is finished, the trigger is ignored.
Usage#
The usage of the IP-Core driver depends heavily on uz_nn.
First, an instance of the software network has to be initialized, e.g., by loading parameters from a header.
Additionally, an array for the output data of the IP-Core has to be declared (see Matrix math).
The uz_mlp_three_layer_ip_init
function writes all parameters of the network into the IP-Core.
Thus, the network exist twice: one copy in the processor and one copy in the IP-Core (parameters are stored in BRAM).
During execution, only the input and output values are written.
Note that the Global configuration has to be adjusted to include at least one MLP IP-Core driver instance, one software network and four layers.
#include "../uz/uz_nn/uz_nn.h"
#include "../IP_Cores/uz_mlp_three_layer/uz_mlp_three_layer.h"
#define NUMBER_OF_INPUTS 13U
#define NUMBER_OF_NEURONS_IN_FIRST_LAYER 64U
#define NUMBER_OF_NEURONS_IN_SECOND_LAYER 64U
#define NUMBER_OF_NEURONS_IN_THIRD_LAYER 64U
#define NUMBER_OF_OUTPUTS 4
#define NUMBER_OF_HIDDEN_LAYER 3
float x[NUMBER_OF_INPUTS] = {1.0f, 2.0f, 3.0f, 4.0f, 5.0f, 6.0f, 7.0f, 8.0f, 9.0f, 10.0f, 11.0f, 12.0f, 13.0f};
float w_1[NUMBER_OF_INPUTS * NUMBER_OF_NEURONS_IN_FIRST_LAYER] = {
#include "layer1_weights.csv"
};
float b_1[NUMBER_OF_NEURONS_IN_FIRST_LAYER] = {
#include "layer1_bias.csv"
};
float y_1[NUMBER_OF_NEURONS_IN_FIRST_LAYER] = {0};
float w_2[NUMBER_OF_NEURONS_IN_FIRST_LAYER * NUMBER_OF_NEURONS_IN_SECOND_LAYER] = {
#include "layer2_weights.csv"
};
float b_2[NUMBER_OF_NEURONS_IN_SECOND_LAYER] = {
#include "layer2_bias.csv"
};
float y_2[NUMBER_OF_NEURONS_IN_SECOND_LAYER] = {0};
float w_3[NUMBER_OF_NEURONS_IN_FIRST_LAYER * NUMBER_OF_NEURONS_IN_SECOND_LAYER] = {
#include "layer2_weights.csv"
};
float b_3[NUMBER_OF_NEURONS_IN_THIRD_LAYER] = {
#include "layer3_bias.csv"
};
float y_3[NUMBER_OF_NEURONS_IN_THIRD_LAYER] = {0};
float w_4[NUMBER_OF_NEURONS_IN_THIRD_LAYER * NUMBER_OF_OUTPUTS] = {
#include "layer4_weights.csv"
};
float b_4[NUMBER_OF_OUTPUTS] = {
#include "layer4_bias.csv"
};
float y_4[NUMBER_OF_OUTPUTS] = {0};
struct uz_nn_layer_config software_nn_config[4] = {
[0] = {
.activation_function = ReLU,
.number_of_neurons = NUMBER_OF_NEURONS_IN_FIRST_LAYER,
.number_of_inputs = NUMBER_OF_INPUTS,
.length_of_weights = UZ_MATRIX_SIZE(w_1),
.length_of_bias = UZ_MATRIX_SIZE(b_1),
.length_of_output = UZ_MATRIX_SIZE(y_1),
.weights = w_1,
.bias = b_1,
.output = y_1},
[1] = {.activation_function = ReLU, .number_of_neurons = NUMBER_OF_NEURONS_IN_SECOND_LAYER, .number_of_inputs = NUMBER_OF_NEURONS_IN_SECOND_LAYER, .length_of_weights = UZ_MATRIX_SIZE(w_2), .length_of_bias = UZ_MATRIX_SIZE(b_2), .length_of_output = UZ_MATRIX_SIZE(y_2), .weights = w_2, .bias = b_2, .output = y_2},
[2] = {.activation_function = ReLU, .number_of_neurons = NUMBER_OF_NEURONS_IN_THIRD_LAYER, .number_of_inputs = NUMBER_OF_NEURONS_IN_THIRD_LAYER, .length_of_weights = UZ_MATRIX_SIZE(w_3), .length_of_bias = UZ_MATRIX_SIZE(b_3), .length_of_output = UZ_MATRIX_SIZE(y_3), .weights = w_3, .bias = b_3, .output = y_3},
[3] = {.activation_function = linear, .number_of_neurons = NUMBER_OF_OUTPUTS, .number_of_inputs = NUMBER_OF_NEURONS_IN_THIRD_LAYER, .length_of_weights = UZ_MATRIX_SIZE(w_4), .length_of_bias = UZ_MATRIX_SIZE(b_4), .length_of_output = UZ_MATRIX_SIZE(y_4), .weights = w_4, .bias = b_4, .output = y_4}};
float mlp_ip_output[NUMBER_OF_OUTPUTS] = {0}; // Data storage of network output for uz_matrix
void init_network(void){
uz_nn_t* software_network = uz_nn_init(software_nn_config, 4);
struct uz_mlp_three_layer_ip_config_t config = {
.base_address = BASE_ADDRESS,
.use_axi_input = true,
.software_network = software_network};
uz_mlp_three_layer_ip_t*vmlp_ip_instance = uz_mlp_three_layer_ip_init(config);
struct uz_matrix_t input_data = {0};
struct uz_matrix_t output_data = {0};
uz_matrix_t* p_input_data= uz_matrix_init(&input_data,x,UZ_MATRIX_SIZE(x),1,UZ_MATRIX_SIZE(x));
uz_matrix_t* p_output_data= uz_matrix_init(&output_data,mlp_ip_output,UZ_MATRIX_SIZE(mlp_ip_output),1,UZ_MATRIX_SIZE(mlp_ip_output));
uz_mlp_three_layer_ff_blocking(mlp_ip_instance, p_input_data, p_output_data);
uz_nn_ff(software_network, p_input_data);
// y_4 (calculated by software network) is now "equal" (minus rounding error due to fixed point)
// to mlp_ip_output (calculated by IP-Core)
// Use uz_nn_get_output_data to get software nn data for further processing
}
Concurrent execution#
The regular calculation with the IP-Core using the software driver and writing the inputs by AXI (use_axi_inputs
is true) is a blocking operation.
The driver triggers the calculation and waits until it is finished.
The processor can not do any other tasks.
uz_mlp_three_layer_ff_blocking(instance, input, output); // Takes 30us (example)
uz_sleep_useconds(10); // Takes 10us
// Takes 40us total
sequenceDiagram participant Processor participant Driver Processor->>Driver: uz_mlp_three_layer_ff_blocking Driver->>IP-Core: Write input Driver->>IP-Core: Trigger calculation loop Driver->>IP-Core: Read valid output Driver->>Driver: Valid output true? end Driver->>IP-Core: Read output Driver->>Processor: Return output values
An alternative to the blocking calculation is a concurrent approach. In this, the IP-Core calculation is triggered, the processor is free to do other tasks, and the data is fetched after the calculation is finished. This way the calculation between trigger and get result does not add to the total required time if the task in between takes less time than the IP-Core calculation. Note that this means the actual calculation time of network without the communication overhead of the read/write operations.
uz_mlp_three_layer_ff_trigger(instance, input); // Takes 30us (example)
uz_sleep_useconds(10); // Takes 10us
uz_mlp_three_layer_ff_get_result_blocking(instance, output);
// Takes 30us total
sequenceDiagram participant Processor participant Driver Processor->>Driver: uz_mlp_three_layer_ff_trigger Driver->>IP-Core: Write input Driver->>IP-Core: Trigger calculation Driver->>Processor: return Processor->>Software: Do something else Software->>Processor: return Processor->>Driver: uz_mlp_three_layer_ff_get_result_blocking loop Driver->>IP-Core: Read valid output Driver->>Driver: Valid output true? end Driver->>IP-Core: Read output Driver->>Processor: Return output values
Unsafe version#
In addition to the regular function to calculate a feedforward pass, unsafe versions of the driver exist (_unsafe
).
These functions are considerably faster than their safe counterparts (up to \(30~\mu s\)) but violate the software rules outlined in Software Development Guidelines.
It is strongly advised to manually test by comparing the safe and unsafe versions before using _unsafe!
Driver reference#
-
typedef struct uz_mlp_three_layer_ip_t uz_mlp_three_layer_ip_t#
Object definition of the IP-Core driver.
-
struct uz_mlp_three_layer_ip_config_t#
Configuration struct for the IP-Core.
-
uz_mlp_three_layer_ip_t *uz_mlp_three_layer_ip_init(struct uz_mlp_three_layer_ip_config_t config)#
Initializes one driver instance of the IP-Core.
- Parameters:
config –
- Returns:
uz_mlp_three_layer_ip_t*
-
void uz_mlp_three_layer_ff_blocking(uz_mlp_three_layer_ip_t *self, uz_matrix_t *input_data, uz_matrix_t *output_data)#
Calculates one forward pass of the network. This function is blocking in the sense that data is written to the IP-Core, the valid output flag is polled, and the output is read. I.e., this function waits for the IP-Core to finish the calculation.
- Parameters:
self – Pointer to IP-Core driver instance
input_data – Pointer to input data
output_data – Pointer to which the output data is written
-
void uz_mlp_three_layer_ff_trigger(uz_mlp_three_layer_ip_t *self, uz_matrix_t *input_data)#
Triggers the calculation of one forward pass of the network. This function is not blocking and returns after the calculation is startet. This enables the concurrent execution of code on the processor while the IP-Core calculates the result of the network.
- Parameters:
self –
input_data –
-
void uz_mlp_three_layer_ff_get_result_blocking(uz_mlp_three_layer_ip_t *self, uz_matrix_t *output_data)#
Returns the calculation result of the last forward pass of the network. User has to take care to trigger the calculation before getting the results. The function is blocking, i.e., the valid output flag of the IP-Core is polled and the first available valid data is read. Intended to use with uz_mlp_three_layer_ff_trigger to allow concurrent calculations on PS and PL.
- Parameters:
self –
output_data –
-
void uz_mlp_three_layer_ff_blocking_unsafe(uz_mlp_three_layer_ip_t *self, uz_matrix_t *input_data, uz_matrix_t *output_data)#
Same functionality as uz_mlp_three_layer_ff_blocking but violates coding rules to improve calculation speed. Is approximately 30us faster than the safe version. Compare results of safe and unsafe version before usage!
- Parameters:
self –
input_data –
output_data –
Implementation details#
Configuration#
The IP-Core has the following configuration possibilities.
enable_nn
Calculates one feedforward pass of the network with the current inputs. Calculation start on a rising edge of
enable_nn
. Can be triggered either by software (AXI) or by external port from PL.
disable_pl_trigger
If set, the trigger from the PL is disabled. Thus, a rising edge on enable_nn from the PL does not trigger a calculation and the calculation can only triggered from the PS. Intended to be used for debugging purposes if the PL trigger is connected to a reoccurring trigger such as the PWM or ADC IP-Core.
use_axi_input
Network uses the FPGA inputs for the feedforward pass if
use_axi_input
is FALSE. Ifuse_axi_input
is true, the inputs from the AXI signals are used.
axi_number_of_inputs
Sets the number of inputs of the network.
axi_number_of_inputs
can be set to any value between 2 and 16. The value has to be consistent with the values for bias and weights that are stored in the IP-Core!
axi_output_number_configuration
Sets the number of outputs of the network.
axi_output_number_configuration
can be set to 2, 4, 6, or 8 outputs. The value in this config register has to be set to \((number\_of\_outputs/2)-1\).
Output scheme#
The output is always a vector with 8 elements, independent of the number of used outputs of the network that are configured by AXI. Due to the parallel calculation of the result, the following output mapping applies. Note that this is handled by the software driver if the output is read by software. For using the output on the external ports in the PL, the mapping has to be taken into account by the user.
For 8 outputs:
For 6 outputs:
For 4 outputs:
For 2 outputs:
Parallel calculation#
The calculation of the network is split up and done in parallel to speed it up. The split up is done on a neuron basis in each layer, i.e., with a parallelization of 4, four DSP slices are used and each DSP calculates 1/4 of the output vector independent of each other.
Example with four inputs, parallelization of four, and eight neurons:
The multiplication \(xw\) is split up by splitting \(w\) into 4 parts.
The bias is split up by splitting \(b\) into 4 parts.
The results are calculated by:
The weight parameters are written to block RAM (BRAM) in the IP-Core for each layer with the following memory layout:
The bias parameters are written to block RAM (BRAM) in the IP-Core for each layer with the following memory layout:
Due to the parallelization, the matrix is split, e.g., into four parts for four parallel DSPs:
Note
This ordering is the transposed definition compared to what is used in Matrix math to match the hardware setup of the IP-Core. Thus, a matrix of type uz_matrix_t
has to be transposed. The init function of the driver handles this by calling uz_mlp_three_layer_set_weights
, which handles writing the correct parameters into the BRAM of the IP-Core!
Write parameters to network#
To write parameters to the BRAM of the IP-Core the following mechanism is used:
Write a zero to
axi_wrEnBias
to prevent writes to the wrong addressWrite the number of the layer (one-based, input is 1, first hidden layer is 2, output layer is 4)
Write data
Write address (bias is zero-based, weights are one-based)
Write to enable the number of parallel PCU that shall be set (one-based!) (
axi_wrEnBias
)
For bias:
Write the address to
axi_bias_addr
, the address of the bias is zero-based!Write the data to
axi_bias
Write the number of the parallel DSP to
axi_write_bias_enable
(one-based)
For weights:
Address is one-based!
Interfaces#
Port Name |
Port Type |
Data Type |
Target Platform Interfaces |
Function |
---|---|---|---|---|
enable_nn |
Input |
bool |
AXI |
Triggers one feedforward pass of the network |
x1 |
Input |
sfix32_En14 |
External |
Network input from FPGA |
x2 |
Input |
sfix32_En14 |
External |
Network input from FPGA |
x3 |
Input |
sfix32_En14 |
External |
Network input from FPGA |
x4 |
Input |
sfix32_En14 |
External |
Network input from FPGA |
x5 |
Input |
sfix32_En14 |
External |
Network input from FPGA |
x6 |
Input |
sfix32_En14 |
External |
Network input from FPGA |
x7 |
Input |
sfix32_En14 |
External |
Network input from FPGA |
x8 |
Input |
sfix32_En14 |
External |
Network input from FPGA |
x9 |
Input |
sfix32_En14 |
External |
Network input from FPGA |
x10 |
Input |
sfix32_En14 |
External |
Network input from FPGA |
x11 |
Input |
sfix32_En14 |
External |
Network input from FPGA |
x12 |
Input |
sfix32_En14 |
External |
Network input from FPGA |
x13 |
Input |
sfix32_En14 |
External |
Network input from FPGA |
x14 |
Input |
sfix32_En14 |
External |
Network input from FPGA |
x15 |
Input |
sfix32_En14 |
External |
Network input from FPGA |
x16 |
Input |
sfix32_En14 |
External |
Network input from FPGA |
axi_bias |
Input |
sfix32_En14 |
AXI |
Bias data |
axi_weight |
Input |
sfix32_En14 |
AXI |
Weight data |
axi_bias_addr |
Input |
ufix10 |
AXI |
Address to which bias data is written |
axi_weight_addr |
Input |
ufix10 |
AXI |
Address to which weight data is written |
axi_wrEnBias |
Input |
uint8 |
AXI |
Enables write to Bias – number of parallel MAC (1 to 8) |
axi_wrEnWeights |
Input |
uint8 |
AXI |
Enables write to Weights – number of parallel MAC (1 to 8) |
axi_layerNr |
Input |
uint8 |
AXI |
Determines to which layer the parameters are written |
use_axi_input |
Input |
bool |
AXI |
Uses axi signals as inputs for the network if signal is TRUE |
axi_x_input |
Input |
sfix32_En14 (8) |
AXI |
Input to network |
axi_number_of_inputs |
Input |
ufix10 |
AXI |
Set the number of inputs of the network |
axi_output_number_configuration |
Input |
ufix10 |
AXI |
Sets the number of outputs – set to num_outputs/2-1 – only 2-4-6-8 are supported |
layer2_rdy |
Output |
bool |
External |
Unused debug |
finished |
Output |
bool |
External |
Single TRUE pulse after calculation is finished |
out |
Output |
sfix32_En14 |
External |
Output of neural network |
out1 |
Output |
sfix32_En14 |
External |
Output of neural network |
out2 |
Output |
sfix32_En14 |
External |
Output of neural network |
out3 |
Output |
sfix32_En14 |
External |
Output of neural network |
out4 |
Output |
sfix32_En14 |
External |
Output of neural network |
out5 |
Output |
sfix32_En14 |
External |
Output of neural network |
out6 |
Output |
sfix32_En14 |
External |
Output of neural network |
out7 |
Output |
sfix32_En14 |
External |
Output of neural network |
biasData |
Output |
sfix32_En14 |
External |
Debug |
biasWriteAddr |
Output |
ufix10 |
External |
Debug |
biasWriteEnable |
Output |
bool |
External |
Enable to write bias value (debug) |
valid_output |
Output |
bool |
External |
TRUE after a calculation is finished and a new has not started yet |
axi_nn_output |
Output |
sfix32_En14 (8) |
AXI |
Output vector of network with 8 elements |
axi_valid_output |
Output |
bool |
AXI |
TRUE after a calculation is finished and a new has not started yet |
disable_pl_trigger |
Input |
bool |
AXI |
enable_nn does not trigger calculation if set to TRUE |