What are the basics of TensorFlow

Introduction of TensorFlow

TensorFlow is currently one of the most important frameworks for programming neural networks, deep learning models and other machine learning algorithms. It is based on a C ++ low level backend, which is controlled by a Python library. TensorFlow can be run on both CPU and GPU (clusters). An R package has recently also existed with which TensorFlow can be used. TensorFlow is not software for deep learning or data science beginners but is clearly aimed at experienced users who have solid programming knowledge. For some time now, however, there has been Keras, a high-level API that is based on TensorFlow with simplified functions and makes the implementation of standard models quick and easy. This is not only interesting for deep learning beginners but also for experts who can prototype their models faster and more efficiently by using Keras.

The following article is intended to explain the central elements and concepts of TensorFlow in more detail and to clarify them using a practical example. The focus is not on the formal mathematical representation of how neural networks work, but on the basic concepts and terminology of TensorFlow and their implementation in Python.

Tensors

In its original meaning, a tensor describes the absolute value of so-called quaterions, complex numbers that expand the range of real numbers. Nowadays, however, this meaning is no longer used. Today, a tensor is understood as a generalization of scalars, vectors and matrices. A two-dimensional tensor is therefore a matrix with rows and columns (i.e. two dimensions). Higher-dimensional matrices in particular are often referred to as tensors. The meaning of the tensor is in principle independent of the number of dimensions present. Thus, a vector can be described as a 1-dimensional tensor. So tensors flow in TensorFlow - through the so-called graph.

The graph

The basic functionality of TensorFlow is based on a so-called graph. This refers to an abstract representation of the underlying mathematical problem in the form of a directed diagram. The diagram consists of nodes and edges that are connected to each other. The nodes of the graph represent data and mathematical operations in TensorFlow. By correctly connecting the nodes, a graph can be created that contains the necessary data and mathematical operations to create a neural network. The following example should clarify the basic functionality:

In the picture above, we want to add two numbers. The two numbers are stored in the variables and. The variables flow through the graph up to the square box, on which an addition is carried out. The result of the addition is saved in the variable. The variables, and can be understood as placeholders, called "placeholder" in TensorFlow. All numbers that are used for and are processed in the same way. This abstract representation of the mathematical operations to be performed in the core of TensorFlow. The following code shows the implementation of this simple example in Python:

First the TensorFlow Library is imported. Then the two placeholders and are defined using. Since TensorFlow is based on a C ++ backend, the data types of the placeholders must be defined in advance and cannot be adjusted at runtime. This is done within the function with the argument, which corresponds to an 8-bit integer. The function is now used to add the two placeholders together and save them in the variable. The graph is initialized using and then executed at the point using. Of course, this example is a trivial operation. The steps and calculations required in neural networks are significantly more complex. The basic functionality of the graph-based execution remains, however.

placeholder

As already described, placeholders play a central role in TensorFlow. Placeholders usually contain all data that are required for training the neural network. These are usually inputs (the inputs to the model) and outputs (the variables to be predicted).

In the code example above, two placeholders are defined. should serve as a placeholder for the inputs of the model, as a placeholder for the actually observed outputs in the data. In addition to the data type of the placeholders, the dimension of the tensors that are stored in the placeholders must also be defined. This is controlled via the function argument. In the example, the inputs are a tensor of the dimension and the output is a one-dimensional tensor. The parameter instructs TensorFlow to keep this dimension flexible, since at the current stage it is still unclear what extent the data for training the model will have.

variables

In addition to placeholders, variables are another core concept of how TensorFlow works. While placeholders are used to store the input and output data, variables are flexible and can change their values ​​during the runtime of the calculation. The most important areas of application for variables in neural networks are the weighting matrices of the neurons (weights) and bias vectors (biases), which are continuously adapted to the data during training. The variables for a single-layer, feedforward network are defined in the following code block.

In the example code, inputs and outputs are defined. The number of neurons in the hidden layer is. In the next step, the required variables are instantiated. For a simple feedforward network, the weighting matrices and bias values ​​between the input and hidden layers are first required. These are created in the objects and using the function. The function that we will discuss in more detail in the next section is still used within. After defining the required variables between the input and hidden layers, the weights and biases between the hidden and output layers are also instantiated.

It is important to understand which dimensions the required matrices of weights and biases must assume so that they can be processed correctly. The rule of thumb for weighting matrices in simple feedforward networks is that the second dimension of the previous layer represents the first dimension of the current layer. What sounds very complex at first is ultimately nothing more than the passing on of outputs from layer to layer in the network. The dimension of the bias values ​​usually corresponds to the number of neurons in the current layer. In the above example, the number of inputs and the number of neurons become a weighting matrix of the shape and a bias vector in the hidden layer of the extent. Between the hidden and output layer, the weighting matrix has the shape and the bias vector has the shape.

initialization

When defining the variables in the code block in the previous section, the functions and were used. The way in which the initial weighting matrices and bias vectors are filled has a major influence on how quickly and how well the model can adapt to the available data. This is related to the fact that neural networks and deep learning models are trained using numerical optimization methods, which always start to adapt the parameters of the model from a certain starting position. If an advantageous starting position for training the neural network is now selected, this generally has a positive effect on the computation time and the quality of adaptation of the model.

Various initialization strategies are implemented in TensorFlow. Starting with simple matrices with always the same value, e.g. to random values, e.g. or to more complex functions such as or. Depending on which initialization of the weights and bias values ​​is carried out, the result of the model training can vary to a greater or lesser extent.

In our example we use two different initialization strategies for the weights and bias values. While it is used to initialize the weights, we use the bias values.

Network architecture design

After implementing the necessary weighting and bias variables, the next step is to create the network architecture, also known as the topology. Both placeholders and variables are combined with one another in the form of successive matrix multiplications.

Furthermore, when specifying the topology, the activation functions of the neurons are also specified. Activation functions carry out a non-linear transformation of the outputs of the hidden layer before they are passed on to the next layer. This makes the entire system non-linear and can therefore adapt to both linear and non-linear functions. There are innumerable activation functions for neural networks that have evolved over time. Today's standard in the development of deep learning models is the so-called Rectified Linear Unit (ReLU), which has proven to be advantageous in many applications.

ReLU activation functions are implemented in TensorFlow using. The activation function receives the output of the matrix multiplication between placeholder and weighting matrix as well as the addition of the bias values ​​and transforms them,. The result of the non-linear transformation is passed on as output to the next layer, which uses this as input for a new matrix multiplication. Since the second layer is already the output layer, no new ReLU transformation is carried out in this example. In order for the dimensionality of the output layer to match that of the data, the matrix must be transposed again. Otherwise there may be problems with estimating the model.

The above figure is intended to schematically illustrate the architecture of the network. The model consists of three parts: (1) the input layer, (2) the hidden layer and (3) the output layer. This architecture is called a feedforward network. Feedforward describes the fact that the data only flows in one direction through the network. Other types of neural networks and deep learning models include architectures that allow the data to move "backwards" or in loops in the network.

Cost function

The cost function of the neural network is used to calculate a measure to determine the deviation between the forecast of the model and the actually observed data. Various cost functions are available for this, depending on whether it is a classification or regression. Today, so-called cross entropy is mostly used for classifications, while the mean squared error (MSE) is used for regressions. In principle, any mathematically differentiable function can be used as a cost function.

In the example above, the root mean square deviation is implemented as a cost function. Both functions are available in TensorFlow for this purpose, and they can easily be combined with one another. You can see that the function arguments of on the one hand is the placeholder that contains the actually observed outputs and the object that contains the forecasts generated by the model. The data actually observed converge with the model forecasts at the cost function and are compared with one another.

optimizer

The optimizer has the task of adapting the weights and bias values ​​of the network during the training on the basis of the calculated model deviations of the cost function. To do this, TensorFlow calculates so-called gradients of the cost function, which indicate the direction in which the weights and bias values ​​must be adjusted in order to minimize the cost function of the model. The development of fast and stable optimizers is a large branch of research in the field of neural networks and deep learning.

The so-called optimizer is used here, which is currently one of the most frequently used optimizers. Adam stands for Adaptive moment estimation and is a methodical combination of two other optimization techniques (AdaGrad and RMSProp). At this point we will not go into the mathematical details of the optimizers, as this would go far beyond the scope of this introduction. It is important to understand that there are different optimizers that are based on different strategies for calculating the necessary adjustments to the weighting and bias values.

session

The TensorFlow Session is the basic framework for executing the graph. The session is started with the command. Before the session has started, no calculation can be made within the graph.

In the code example, a TensorFlow session is instantiated in the object and can then be executed anywhere in the graph. In the current development versions (dev-builds) of TensorFlow, approaches arise to execute the code without defining a session. However, this is currently not included in the stable build.

training

After the necessary components of the neural network have been defined, they can now be connected to one another as part of the model training. Today, the training of neural networks usually takes place via what is known as "minibatch training". Minibatch means that repeated random samples of the inputs and outputs are used to adjust the weights and bias values ​​of the network. For this purpose, a parameter is defined that controls the size of the random sample (batch) of data. Here it is often the case that the data are drawn without replacing, so that each observation in the data set for a training round (also called epoch) is only presented once to the network. The number of epochs is also defined as a parameter by the user.

The individual batches are transferred to the TensorFlow graph using the placeholders created previously and using the feed dictionary and used accordingly in the model. This happens in combination with the previously defined session.

In the example above, the optimization step is carried out in the graph. In order for TensorFlow to be able to perform the necessary calculations, data must be transferred to the graph via the argument, which should replace the placeholders and for the calculations.

After the transfer by means of the multiplication with the weighting matrix between input and hidden layer is fed into the network and transformed non-linearly by the activation function. The result of the hidden layer is then multiplied by the weighting matrix between the hidden and output layer and passed on to the output layer. Here, the cost function calculates the difference between the forecast of the network and the values ​​actually observed. On the basis of the optimizer, the gradients are now calculated for each individual weighting parameter in the network. The gradients, in turn, are the basis on which the weightings are adjusted in the direction of minimizing the cost function. This process is also called Gradient Descent. The process just described is then carried out again with the next batch. In each iteration, the neural network moves closer to minimizing costs, i.e. in the direction of a smaller deviation between the forecast and the observed values.

Application example

In the following example, the previously conveyed concepts are now to be illustrated using a practical example. To do this, we first need some data on the basis of which the model can be trained. These can be simulated quickly and easily using the function contained in.

As in the example above, we use 10 inputs and 1 output to create a neural network to forecast the observed output. Then we define placeholders, initialization, variables, network architecture, cost function, optimizer and session in TensorFlow.

Now the training of the model begins. To do this, we first need an outer loop that is executed over the number of defined epochs. Within each iteration of the outer loop, the data is randomly divided into batches and presented to the network one after the other in an inner loop. At the end of an epoch, the MSE, i.e. the mean square deviation of the model from the actually observed data, is calculated and output.

A total of 15 batches per epoch are presented to the network. Within 100 epochs, the model is able to reduce the MSE from an initial 20,988.60 to 55.13 (note: these values ​​differ from execution to execution due to the random initialization of the weights and bias values ​​and the random drawing of the batches). The figure below shows the course of the mean square deviation during training.

It can be seen that with a sufficiently high number of epochs, the model is able to reduce the training error to close to 0. What sounds advantageous at first is a problem for real machine learning projects. The capacity of neural networks is often so high that they simply “memorize” the training data and generalize poorly on new, unseen data. One speaks here of a so-called overfitting of the training data. For this reason, forecast errors are often monitored for unseen test data during training. As a rule, the training of the model is terminated at the point at which the forecast error of the test data begins to increase again. This is called early stopping.

Summary and Outlook

With TensorFlow, Google has succeeded in setting a milestone in the field of deep learning research. Thanks to the concentrated intellectual capacity of Google, it was possible to create software that was able to establish itself as a quasi-standard in the development of neural networks and deep learning models after a very short time. At STATWORX, we also work successfully with TensorFlow in data science consulting to develop deep learning models and neural networks for our customers.

About the author

Sebastian Heinz

I am the founder and CEO of STATWORX. I enjoy writing about machine learning and AI, especially about neural networks and deep learning. In my spare time, I love to cook, eat and drink as well as traveling the world.