uz_mlp_three_layer#

This IP-Core implements a three layer MLP network based on [1]. The implementation and nomenclature follows the principles outlined in uz_nn. The MLP is hard coded to have three hidden layer with ReLU activation function for hidden layers. The output uses Linear activation.

../../_images/mlp_three_layer.svg — Fig. 351 Implemented MLP network of the IP-Core#

Features:

Feedforward calculation in fixed point (32 bit, 14 fractional bits, signed)
Precision: 6.1e-5
Max values: +/- 131072
No overflow detection regarding the fixed point data type! User has to make sure that min/max is not violated
Variable number of inputs which can be configured by software from 2 to 16
Variable number of outputs which can be configured by software (2,4,6,8).
Number of hidden layers is fixed to 3!
Number of neurons in each hidden layer is fixed to 64!
Activation function is ReLU for all hidden layer (with a saturation at max value!)
Activation function is Linear for output layer
Bias and weights can be written to the network at runtime
Fully compatible with uz_nn to use IP-Core as an accelerator
Uses Matrix math as input and outputs

The IP-Core is always configured by the processing system. The configuration (e.g., number of inputs) is valid for using the inputs from AXI and from the PL (depending on use_axi_input). The calculation of one forward pass is triggered by a rising edge either from AXI or by the PL (enable_nn port). The calculation trigger is a OR between the AXI and PL ports, thus no priority is used. The output valid is low during calculation and high if a valid result is available on the output ports. If a calculation is triggered before the calculation is finished, the trigger is ignored.

Usage#

The usage of the IP-Core driver depends heavily on uz_nn. First, an instance of the software network has to be initialized, e.g., by loading parameters from a header. Additionally, an array for the output data of the IP-Core has to be declared (see Matrix math). The uz_mlp_three_layer_ip_init function writes all parameters of the network into the IP-Core. Thus, the network exist twice: one copy in the processor and one copy in the IP-Core (parameters are stored in BRAM). During execution, only the input and output values are written. Note that the Global configuration has to be adjusted to include at least one MLP IP-Core driver instance, one software network and four layers.

#include "../uz/uz_nn/uz_nn.h"
#include "../IP_Cores/uz_mlp_three_layer/uz_mlp_three_layer.h"
#define NUMBER_OF_INPUTS 13U
#define NUMBER_OF_NEURONS_IN_FIRST_LAYER 64U
#define NUMBER_OF_NEURONS_IN_SECOND_LAYER 64U
#define NUMBER_OF_NEURONS_IN_THIRD_LAYER 64U
#define NUMBER_OF_OUTPUTS 4
#define NUMBER_OF_HIDDEN_LAYER 3

float x[NUMBER_OF_INPUTS] = {1.0f, 2.0f, 3.0f, 4.0f, 5.0f, 6.0f, 7.0f, 8.0f, 9.0f, 10.0f, 11.0f, 12.0f, 13.0f};
float w_1[NUMBER_OF_INPUTS * NUMBER_OF_NEURONS_IN_FIRST_LAYER] = {
#include "layer1_weights.csv"
};
float b_1[NUMBER_OF_NEURONS_IN_FIRST_LAYER] = {
#include "layer1_bias.csv"
};
float y_1[NUMBER_OF_NEURONS_IN_FIRST_LAYER] = {0};

float w_2[NUMBER_OF_NEURONS_IN_FIRST_LAYER * NUMBER_OF_NEURONS_IN_SECOND_LAYER] = {
#include "layer2_weights.csv"
};
float b_2[NUMBER_OF_NEURONS_IN_SECOND_LAYER] = {
#include "layer2_bias.csv"
};
float y_2[NUMBER_OF_NEURONS_IN_SECOND_LAYER] = {0};

float w_3[NUMBER_OF_NEURONS_IN_FIRST_LAYER * NUMBER_OF_NEURONS_IN_SECOND_LAYER] = {
#include "layer2_weights.csv"
};
float b_3[NUMBER_OF_NEURONS_IN_THIRD_LAYER] = {
#include "layer3_bias.csv"
};
float y_3[NUMBER_OF_NEURONS_IN_THIRD_LAYER] = {0};

float w_4[NUMBER_OF_NEURONS_IN_THIRD_LAYER * NUMBER_OF_OUTPUTS] = {
#include "layer4_weights.csv"
};
float b_4[NUMBER_OF_OUTPUTS] = {
#include "layer4_bias.csv"
};
float y_4[NUMBER_OF_OUTPUTS] = {0};
struct uz_nn_layer_config software_nn_config[4] = {
    [0] = {
        .activation_function = ReLU,
        .number_of_neurons = NUMBER_OF_NEURONS_IN_FIRST_LAYER,
        .number_of_inputs = NUMBER_OF_INPUTS,
        .length_of_weights = UZ_MATRIX_SIZE(w_1),
        .length_of_bias = UZ_MATRIX_SIZE(b_1),
        .length_of_output = UZ_MATRIX_SIZE(y_1),
        .weights = w_1,
        .bias = b_1,
        .output = y_1},
    [1] = {.activation_function = ReLU, .number_of_neurons = NUMBER_OF_NEURONS_IN_SECOND_LAYER, .number_of_inputs = NUMBER_OF_NEURONS_IN_SECOND_LAYER, .length_of_weights = UZ_MATRIX_SIZE(w_2), .length_of_bias = UZ_MATRIX_SIZE(b_2), .length_of_output = UZ_MATRIX_SIZE(y_2), .weights = w_2, .bias = b_2, .output = y_2},
    [2] = {.activation_function = ReLU, .number_of_neurons = NUMBER_OF_NEURONS_IN_THIRD_LAYER, .number_of_inputs = NUMBER_OF_NEURONS_IN_THIRD_LAYER, .length_of_weights = UZ_MATRIX_SIZE(w_3), .length_of_bias = UZ_MATRIX_SIZE(b_3), .length_of_output = UZ_MATRIX_SIZE(y_3), .weights = w_3, .bias = b_3, .output = y_3},
    [3] = {.activation_function = linear, .number_of_neurons = NUMBER_OF_OUTPUTS, .number_of_inputs = NUMBER_OF_NEURONS_IN_THIRD_LAYER, .length_of_weights = UZ_MATRIX_SIZE(w_4), .length_of_bias = UZ_MATRIX_SIZE(b_4), .length_of_output = UZ_MATRIX_SIZE(y_4), .weights = w_4, .bias = b_4, .output = y_4}};

 float mlp_ip_output[NUMBER_OF_OUTPUTS] = {0}; // Data storage of network output for uz_matrix

 void init_network(void){
    uz_nn_t* software_network = uz_nn_init(software_nn_config, 4);

    struct uz_mlp_three_layer_ip_config_t config = {
      .base_address = BASE_ADDRESS,
      .use_axi_input = true,
      .software_network = software_network};
    uz_mlp_three_layer_ip_t*vmlp_ip_instance = uz_mlp_three_layer_ip_init(config);

   struct uz_matrix_t input_data = {0};
   struct uz_matrix_t output_data = {0};
   uz_matrix_t* p_input_data= uz_matrix_init(&input_data,x,UZ_MATRIX_SIZE(x),1,UZ_MATRIX_SIZE(x));
   uz_matrix_t* p_output_data= uz_matrix_init(&output_data,mlp_ip_output,UZ_MATRIX_SIZE(mlp_ip_output),1,UZ_MATRIX_SIZE(mlp_ip_output));
   uz_mlp_three_layer_ff_blocking(mlp_ip_instance, p_input_data, p_output_data);
   uz_nn_ff(software_network, p_input_data);
   // y_4 (calculated by software network) is now "equal" (minus rounding error due to fixed point)
   // to mlp_ip_output (calculated by IP-Core)
   // Use uz_nn_get_output_data to get software nn data for further processing
 }

Concurrent execution#

The regular calculation with the IP-Core using the software driver and writing the inputs by AXI (use_axi_inputs is true) is a blocking operation. The driver triggers the calculation and waits until it is finished. The processor can not do any other tasks.

uz_mlp_three_layer_ff_blocking(instance, input, output); // Takes 30us (example)
uz_sleep_useconds(10);                                   // Takes 10us
                                                         // Takes 40us total

        sequenceDiagram
    participant Processor
    participant Driver
    Processor->>Driver: uz_mlp_three_layer_ff_blocking
    Driver->>IP-Core: Write input
    Driver->>IP-Core: Trigger calculation
    loop
        Driver->>IP-Core: Read valid output
        Driver->>Driver: Valid output true?
    end
    Driver->>IP-Core: Read output
    Driver->>Processor: Return output values

An alternative to the blocking calculation is a concurrent approach. In this, the IP-Core calculation is triggered, the processor is free to do other tasks, and the data is fetched after the calculation is finished. This way the calculation between trigger and get result does not add to the total required time if the task in between takes less time than the IP-Core calculation. Note that this means the actual calculation time of network without the communication overhead of the read/write operations.

uz_mlp_three_layer_ff_trigger(instance, input);                 // Takes 30us (example)
uz_sleep_useconds(10);                                          // Takes 10us
uz_mlp_three_layer_ff_get_result_blocking(instance, output);
                                                                // Takes 30us total

        sequenceDiagram
    participant Processor
    participant Driver
    Processor->>Driver: uz_mlp_three_layer_ff_trigger
    Driver->>IP-Core: Write input
    Driver->>IP-Core: Trigger calculation
    Driver->>Processor: return
    Processor->>Software: Do something else
    Software->>Processor: return
    Processor->>Driver: uz_mlp_three_layer_ff_get_result_blocking
    loop
        Driver->>IP-Core: Read valid output
        Driver->>Driver: Valid output true?
    end
    Driver->>IP-Core: Read output
    Driver->>Processor: Return output values

Unsafe version#

In addition to the regular function to calculate a feedforward pass, unsafe versions of the driver exist (_unsafe). These functions are considerably faster than their safe counterparts (up to \(30~\mu s\)) but violate the software rules outlined in Software Development Guidelines. It is strongly advised to manually test by comparing the safe and unsafe versions before using _unsafe!

Driver reference#

typedef struct uz_mlp_three_layer_ip_t uz_mlp_three_layer_ip_t#: Object definition of the IP-Core driver.

struct uz_mlp_three_layer_ip_config_t#

Configuration struct for the IP-Core.

Public Members

uint32_t base_address#: Base address of IP-Core

bool use_axi_input#: Flag to use inputs from axi (true), or from PL inputs (false)

uz_nn_t *software_network#: Pointer to a compatible uz_nn instance. This instance has to match the IP-Core regarding layer and neuron sizes.

uz_mlp_three_layer_ip_t *uz_mlp_three_layer_ip_init(struct uz_mlp_three_layer_ip_config_t config)#

Initializes one driver instance of the IP-Core.

Parameters:

config –

Returns:

uz_mlp_three_layer_ip_t*

void uz_mlp_three_layer_ff_blocking(uz_mlp_three_layer_ip_t *self, uz_matrix_t *input_data, uz_matrix_t *output_data)#

Calculates one forward pass of the network. This function is blocking in the sense that data is written to the IP-Core, the valid output flag is polled, and the output is read. I.e., this function waits for the IP-Core to finish the calculation.

Parameters:

self – Pointer to IP-Core driver instance
input_data – Pointer to input data
output_data – Pointer to which the output data is written

void uz_mlp_three_layer_ff_trigger(uz_mlp_three_layer_ip_t *self, uz_matrix_t *input_data)#

Triggers the calculation of one forward pass of the network. This function is not blocking and returns after the calculation is startet. This enables the concurrent execution of code on the processor while the IP-Core calculates the result of the network.

Parameters:

self –
input_data –

void uz_mlp_three_layer_ff_get_result_blocking(uz_mlp_three_layer_ip_t *self, uz_matrix_t *output_data)#

Returns the calculation result of the last forward pass of the network. User has to take care to trigger the calculation before getting the results. The function is blocking, i.e., the valid output flag of the IP-Core is polled and the first available valid data is read. Intended to use with uz_mlp_three_layer_ff_trigger to allow concurrent calculations on PS and PL.

Parameters:

self –
output_data –

void uz_mlp_three_layer_ff_blocking_unsafe(uz_mlp_three_layer_ip_t *self, uz_matrix_t *input_data, uz_matrix_t *output_data)#

Same functionality as uz_mlp_three_layer_ff_blocking but violates coding rules to improve calculation speed. Is approximately 30us faster than the safe version. Compare results of safe and unsafe version before usage!

Parameters:

self –
input_data –
output_data –

Implementation details#

Configuration#

The IP-Core has the following configuration possibilities.

enable_nn

Calculates one feedforward pass of the network with the current inputs. Calculation start on a rising edge of enable_nn. Can be triggered either by software (AXI) or by external port from PL.

disable_pl_trigger

If set, the trigger from the PL is disabled. Thus, a rising edge on enable_nn from the PL does not trigger a calculation and the calculation can only triggered from the PS. Intended to be used for debugging purposes if the PL trigger is connected to a reoccurring trigger such as the PWM or ADC IP-Core.

use_axi_input

Network uses the FPGA inputs for the feedforward pass if use_axi_input is FALSE. If use_axi_input is true, the inputs from the AXI signals are used.

axi_number_of_inputs

Sets the number of inputs of the network. axi_number_of_inputs can be set to any value between 2 and 16. The value has to be consistent with the values for bias and weights that are stored in the IP-Core!

axi_output_number_configuration

Sets the number of outputs of the network. axi_output_number_configuration can be set to 2, 4, 6, or 8 outputs. The value in this config register has to be set to \((number\_of\_outputs/2)-1\).

Output scheme#

The output is always a vector with 8 elements, independent of the number of used outputs of the network that are configured by AXI. Due to the parallel calculation of the result, the following output mapping applies. Note that this is handled by the software driver if the output is read by software. For using the output on the external ports in the PL, the mapping has to be taken into account by the user.

For 8 outputs:

\[y=\begin{bmatrix} y_1 & y_2 & y_3 & y_4 & y_5 & y_6 & y_7 & y_8 \end{bmatrix}\]

For 6 outputs:

\[y=\begin{bmatrix} y_1 & y_2 & y_3 & 0 & y_4 & y_5 & y_6 & 0 \end{bmatrix}\]

For 4 outputs:

\[y=\begin{bmatrix} y_1 & y_2 & 0 & 0 & y_3 & y_4 & 0 & 0 \end{bmatrix}\]

For 2 outputs:

\[y=\begin{bmatrix} y_1 & 0 & 0 & 0 & y_2 & 0 & 0 & 0 \end{bmatrix}\]

Parallel calculation#

The calculation of the network is split up and done in parallel to speed it up. The split up is done on a neuron basis in each layer, i.e., with a parallelization of 4, four DSP slices are used and each DSP calculates 1/4 of the output vector independent of each other.

Example with four inputs, parallelization of four, and eight neurons:

\[\begin{split}x &=\begin{bmatrix} 1 & 2 & 3 & 4 \end{bmatrix} \\ w &=\begin{bmatrix} \color{red} 1 & \color{red} 2 & 3 & 4 & \color{green}5 & \color{green}6 & 7 & 8\\ \color{red} 9 & \color{red} 10 & 11 & 12 & \color{green}13 &\color{green} 14 & 15 & 16 \\ \color{red}17 & \color{red}18 & 19 & 20 & \color{green}21 & \color{green}22 & 23 & 24 \\ \color{red}25 & \color{red}26 & 27 & 28 & \color{green} 29 & \color{green}30 & 31 & 32 \end{bmatrix} \\ b &=\begin{bmatrix} 1 & 2 & 3 & 4 & 5 & 6 & 7 & 8 \end{bmatrix}\end{split}\]

The multiplication \(xw\) is split up by splitting \(w\) into 4 parts.

\[\begin{split}w_1 &= \begin{bmatrix} \color{red}1 & \color{red}2 \\ \color{red}9 & \color{red}10 \\ \color{red}17 & \color{red}18 \\ \color{red}25 & \color{red}26 \end{bmatrix} \\ w_2 &= \begin{bmatrix} 3 & 4 \\ 11 & 12 \\ 19 & 20 \\ 27 & 28 \end{bmatrix} \\ w_3 &= \begin{bmatrix} \color{green} 5 & \color{green}6 \\ \color{green}13 &\color{green} 14 \\ \color{green}21 & \color{green}22 \\ \color{green}29 &\color{green} 30 \end{bmatrix} \\ w_4 &= \begin{bmatrix} 7 & 8 \\ 15 & 16 \\ 23 & 24 \\ 31 & 32 \end{bmatrix}\end{split}\]

The bias is split up by splitting \(b\) into 4 parts.

\[\begin{split}b_1 &= \begin{bmatrix} \color{red}1 & \color{red}2 \end{bmatrix} \\ b_2 &= \begin{bmatrix} 3 & 4 \end{bmatrix} \\ b_3 &= \begin{bmatrix} \color{green} 5 & \color{green}6 \end{bmatrix} \\ b_4 &= \begin{bmatrix} 7 & 8 \end{bmatrix}\end{split}\]

The results are calculated by:

\[\begin{split}y_1 &= x w_1 + b_1=\begin{bmatrix} 171 & 182 \end{bmatrix}\\ y_2 &= x w_2 + b_2=\begin{bmatrix} 193 & 204 \end{bmatrix}\\ y_3 &= x w_3 + b_3=\begin{bmatrix} 215 & 226 \end{bmatrix}\\ y_4 &= x w_4 + b_4=\begin{bmatrix} 237 & 248 \end{bmatrix} \\ y &= \begin{bmatrix} 171 & 182 & 193 & 204 & 215 & 226 & 237 & 248 \end{bmatrix}\end{split}\]

The weight parameters are written to block RAM (BRAM) in the IP-Core for each layer with the following memory layout:

\[\begin{split}w =\begin{bmatrix} 1 & 2 & 3 & 4 & 5 & 6 & 7 & 8\\ 9 & 10 & 11 & 12 & 13 & 14 & 15 & 16 \\ 17 & 18 & 19 & 20 & 21 & 22 & 23 & 24 \\ 25 & 26 & 27 & 28 & 29 & 30 & 31 & 32 \end{bmatrix}\end{split}\]

\[w =\begin{bmatrix} 1& 9& 17& 25& 2& 10& 18& 26& 3& 11& 19& 27& 4& 12& 20& 28& 5& 13& 21& 29& 6& 14& 22& 30& 7& 15& 23& 31& 8& 16& 24& 32 \end{bmatrix}\]

The bias parameters are written to block RAM (BRAM) in the IP-Core for each layer with the following memory layout:

\[w =\begin{bmatrix} 1 & 2 & 3 & 4 & 5 & 6 & 7 & 8 \end{bmatrix}\]

Due to the parallelization, the matrix is split, e.g., into four parts for four parallel DSPs:

\[\begin{split}w_1 &= \begin{bmatrix} 1 & 9 & 17 & 25 & 2 & 10 & 18 & 26 \end{bmatrix} \\ w_2 &= \begin{bmatrix} 3 & 11 & 19 & 27 &4 & 12 & 20 & 28\end{bmatrix} \\ w_3 &= \begin{bmatrix} 5 & 13 & 21 & 29 & 6 &14 & 22 &30 \end{bmatrix} \\ w_4 &= \begin{bmatrix} 7 & 15 & 23 &31 & 8 & 16 & 24 & 32\end{bmatrix}\end{split}\]

Note

This ordering is the transposed definition compared to what is used in Matrix math to match the hardware setup of the IP-Core. Thus, a matrix of type uz_matrix_t has to be transposed. The init function of the driver handles this by calling uz_mlp_three_layer_set_weights, which handles writing the correct parameters into the BRAM of the IP-Core!

Write parameters to network#

To write parameters to the BRAM of the IP-Core the following mechanism is used:

Write a zero to axi_wrEnBias to prevent writes to the wrong address
Write the number of the layer (one-based, input is 1, first hidden layer is 2, output layer is 4)
Write data
Write address (bias is zero-based, weights are one-based)
Write to enable the number of parallel PCU that shall be set (one-based!) (axi_wrEnBias)

For bias:

Write the address to axi_bias_addr, the address of the bias is zero-based!
Write the data to axi_bias
Write the number of the parallel DSP to axi_write_bias_enable (one-based)

For weights:

Address is one-based!

Interfaces#

Table 118 Interfaces of three layer MLP IP-Core#
Port Name	Port Type	Data Type	Target Platform Interfaces	Function
enable_nn	Input	bool	AXI	Triggers one feedforward pass of the network
x1	Input	sfix32_En14	External	Network input from FPGA
x2	Input	sfix32_En14	External	Network input from FPGA
x3	Input	sfix32_En14	External	Network input from FPGA
x4	Input	sfix32_En14	External	Network input from FPGA
x5	Input	sfix32_En14	External	Network input from FPGA
x6	Input	sfix32_En14	External	Network input from FPGA
x7	Input	sfix32_En14	External	Network input from FPGA
x8	Input	sfix32_En14	External	Network input from FPGA
x9	Input	sfix32_En14	External	Network input from FPGA
x10	Input	sfix32_En14	External	Network input from FPGA
x11	Input	sfix32_En14	External	Network input from FPGA
x12	Input	sfix32_En14	External	Network input from FPGA
x13	Input	sfix32_En14	External	Network input from FPGA
x14	Input	sfix32_En14	External	Network input from FPGA
x15	Input	sfix32_En14	External	Network input from FPGA
x16	Input	sfix32_En14	External	Network input from FPGA
axi_bias	Input	sfix32_En14	AXI	Bias data
axi_weight	Input	sfix32_En14	AXI	Weight data
axi_bias_addr	Input	ufix10	AXI	Address to which bias data is written
axi_weight_addr	Input	ufix10	AXI	Address to which weight data is written
axi_wrEnBias	Input	uint8	AXI	Enables write to Bias – number of parallel MAC (1 to 8)
axi_wrEnWeights	Input	uint8	AXI	Enables write to Weights – number of parallel MAC (1 to 8)
axi_layerNr	Input	uint8	AXI	Determines to which layer the parameters are written
use_axi_input	Input	bool	AXI	Uses axi signals as inputs for the network if signal is TRUE
axi_x_input	Input	sfix32_En14 (8)	AXI	Input to network
axi_number_of_inputs	Input	ufix10	AXI	Set the number of inputs of the network
axi_output_number_configuration	Input	ufix10	AXI	Sets the number of outputs – set to num_outputs/2-1 – only 2-4-6-8 are supported
layer2_rdy	Output	bool	External	Unused debug
finished	Output	bool	External	Single TRUE pulse after calculation is finished
out	Output	sfix32_En14	External	Output of neural network
out1	Output	sfix32_En14	External	Output of neural network
out2	Output	sfix32_En14	External	Output of neural network
out3	Output	sfix32_En14	External	Output of neural network
out4	Output	sfix32_En14	External	Output of neural network
out5	Output	sfix32_En14	External	Output of neural network
out6	Output	sfix32_En14	External	Output of neural network
out7	Output	sfix32_En14	External	Output of neural network
biasData	Output	sfix32_En14	External	Debug
biasWriteAddr	Output	ufix10	External	Debug
biasWriteEnable	Output	bool	External	Enable to write bias value (debug)
valid_output	Output	bool	External	TRUE after a calculation is finished and a new has not started yet
axi_nn_output	Output	sfix32_En14 (8)	AXI	Output vector of network with 8 elements
axi_valid_output	Output	bool	AXI	TRUE after a calculation is finished and a new has not started yet
disable_pl_trigger	Input	bool	AXI	enable_nn does not trigger calculation if set to TRUE

uz_mlp_three_layer#

Usage#

Concurrent execution#

Unsafe version#

Driver reference#

Implementation details#

Configuration#

Output scheme#

Parallel calculation#

Write parameters to network#

Interfaces#

Sources#

This Page