In my efforts to learn some AI programming, i have stumbled upon this subject of Neural Networks and set myself the goal of creating some Nets using AutoHotkey. After digging into the subject by searching and reading tutorials here and there i have come across this post by Milo-Spencer-Harper which describes in detail the creation of an extremelly simple neural net in Python. After succesfully translating (or sort of) the code to AutoHotkey, i have decided to write this tutorial based off of what i understood, in order to better cement my knowledge of the basics.
Hopefully this will also help someone out in the AutoHotkey community
*Note: Section I of this tutorial has been covered in an AutoHotkey webinar on March 20th 2018. You can check the webinar in this link (Special thanks to Joe Glines and Jackie Sztuk for making it possible). The written tutorial below will also provide an expanded view on the subject so don't forget to read it either. Section II will further expand on the subject with a simple Multi-Layer Neural Network implementation. MLNNs are considered the vanilla form of Neural Networks, so be sure to check it too
SECTION I
1. What is a Neural Network
Artificial Neural Networks (ANNs) are models that implement machine learning algorithms, which are based on some aspects of our current understanding of the workings of the brain (the low-level electrical part only, not the bio-chemical, of course). The most important thing about ANNs is that they allow us to implement some algorithms which would be far too complex to manually program. This is done by having the machine program the code itself through learning sessions. With the knowledge of ANNs, programming complex tasks such as handwriting and voice recognition by machines is now doable. There are numerous other examples of succesfull implementations and as the knowledge of ANNs spreads, we are getting almost daily news of new tasks that were not programmable before, but have now been succesufully programmed using ANNs.
Here are a few AI video examples if you want to know more about AI implemented through ANNs (you can watch them later if you just want to follow this tutorial):
https://www.youtube.com/watch?v=qv6UVOQ0F44
https://www.youtube.com/watch?v=P7XHzqZjXQs (turn on the subtitles in this one for english)
https://www.youtube.com/watch?v=Ipi40cb_RsI
2. How do Neural Networks work?
As stated in section 1, ANNs base themselves on some ideas brought up by studies on biological neurons, synapses, their electric impulses and etc. Thus, the most basic constituent of a Neural Network is a neuron-like component represented by a mathemathical formulae. This formula simulates the thought process of each neuron and it's consequent behaviour as in which impulses it will send through it's synapses to other neurons when many different stimuli arrive to it. A single-neuron Neural Network with 3 synapses as input and 1 as output is represented in the image below:
As you can see in the image above, the "neuron" receives signals (stimuli) thorugh it's input synapses, works these signals in a unique way and than fires an according output through another synapse. The output of this neuron can than be used as an input to another neuron in a neural network such as the one below (or as a final output):
It is important to understand that each neuron does a unique thing with the signals it receives (the formulaes to work the inputs in each neurons are different). In ANNs, we say that each neuron attributes a unique "weight" to each of the inputs it receives and uses these weights to rework the actual inputs into an output. The weight is like a measure of strength of a synapse in a biological neurons. If you check the center-bottomost neuron in the image above, what does it do with the inputs received? It weights a possible input from B as a "light" negative component (multiplying it -1x) and a possible input from C as "heavy" positive component (multiplying it +5x). So if the stimuli value coming to it through B is 10 and the stimuli value coming from C is 6, what will be the output H of this neuron?
Total = WeightB * InputB + WeightC * InputC
Or
-1x10 + 5x6 = 20
So twenty will be the output. Thus, we can see that there are two basic components in the processing of a neuron: the "weight" of an input and the "value" of an input. This is what recreates an inportant aspect of biological neural networks in ANNs: The weight attributed to an input can be reworked through the training of the network, so that the same input value can be accounted for in an infinitely many different ways. This is how we simulate the strengthening (or withering) of synapses in biological neurons. If through training, the network creator code discovers that a part of what is being fed as input to it is irrelevant to the intended results, it will lower its weight substantially. If, however, it discovers that another input is very relevant to the intended result, it will rise this other inputs weight. This is the basics of how training works in an ANN: finding the correct weights to treat the incoming inputs (For now, this is sufficient. We'll keep other concepts that could also apply for another ocasion).
So further following up our results above: In our 5-neuron network above the value of 20, as calculated by the left-bottomost, would be fed as input to the left-uppermost neuron (and than be treated with a weight of 4 by it) and so on, in an intricate chain of inputs and outputs through different neurons up to a point in which the many individual weights applied wil have reworked all the inputs into a final output.
3. Ok, so how are these weights calculated (or how is the training)?
Through trial and error and aproximation. During training cycles, the net creator code feeds a net with samples of inputs to which an expected output is known, and the net than processes these into a final output using it's weights. Than, the net creator code compares the output values with the expected outputs and readjusts the weights of the network in each neuron to aproximate the final output to the expected output (thus, recreating the net in each iteration). It is important to consider that this trial and error is highly oriented towards a goal: if the actual output is too much lower than the expected one, the weights will be forced up considerably by the net creator. If, however, they are just a little bit lower, the weights are recalculated up just a bit. Likewise, if they are too far above the expected result, the weights will be lowered considerably, and if they are just a bit above the expected result, the weights will go down just a bit.
The math to do all this training and the net recreation is simple. It follows the pseudo-code below:
Code: Select all
THIS_TRAINING_ITERATION_OUTPUT := TRAINING_VALUES * CURRENT_WEIGHTS ; The current net is used to calculate a final output.
OUTPUT_SET_BETWEEN_0_AND_1 := GRADIENT(SIGMOID(THIS_TRAINING_ITERATION_OUTPUT)) ; Than, we rework the final output to a representative value between 0 and 1 using a sigmoid* function and a gradient* function.
ADJUST_CURRENT_WEIGHT(AVAILABLE KNOWLEDGE, OUTPUT_SET_BETWEEN_0_AND_1) ; And than we readjust our weights (recreate the net) using both the available knowledge in the samples and this training sections outputs)
Notes:
Sigmoid function: An Activation Function. The sigmoid function is used to rework any value (from -infinite to +infinite) to a point in a S-shaped curve between 0 and 1. Negative values are presented as between 0 and 0.5, while positive values are presented as between 0.5 and 1:
Gradient of the sigmoid function: Since the sigmoid function is an S-shaped curve, equal distances in A and B represent different distances in the 0 to 1 scale depending on where these values are located (in other words, there are distortions in the distances). For this reason, we use the gradient of the sigmoid, which represents these exact distortions in the sigmoid curve. The value of the gradient for any particular distance A - B allows us to rework the sigmoid distance A - B to better picture the actual distances between these values (remember: during training we have to readjust the weight based on the distance between the current calculation and the expected value of the training sample!).
Sigmoid in Blue, Gradient of sigmoid in Green: 4. Enougth theory! let's get practical!
Suppose we have the following situation: given any combination of 3 binary values, an unknown underlying rule is being applied to find a fourth binary value (dependant exclusively on the first 3, of course). We do not know what rule this is, and neither does the network we are going to create, but we do have 4 samples of binary combinations and for each of them, we know the fourth byte value (or the correct answer based on the unknown underlying rule). The case table is presented below.
Can you figure out the underlying rule and the most probable value of the three question marks
If you took a few seconds to analyze the samples in the table above, you probably figured it out already. The answer is always the same as the first input byte. This means that this input byte should hold a decisive weight in the final answer while the others, not so much. If we were to check the 3 other possible validation cases one can provide ([0,0,0], [0,1,0] and [1,1,0]), it is rather obvious now that we will not even have to look at the values of input bytes 2 and 3 to find the answers. So if we were to program a function to find the solution it could just be something like this:
Code: Select all
FIND_SOLUTION(first_byte_value, second_byte_value, third_byte_value)
{
return first_byte_value
}
Let me present you a commented code that does exactly that and than we will take our conclusions!
Note: The coments in the code are part of this tutorial! Don't skip reading the code!
Code: Select all
/*
1. PREPARATION STEPS
*/
Random, Weight1, -4.0, 4.0 ; We start by initializing random numbers into the weight variables (this simulates a first hipotesis of a solution and allows the beggining of the training).
Random, Weight2, -4.0, 4.0
Random, Weight3, -4.0, 4.0
WEIGHTS := Array([Weight1],[Weight2],[Weight3]) ; And than organize them into a matrix.
TRAINING_INPUTS := Array([0,0,1],[1,1,1],[1,0,1],[0,1,1]) ; We will also feed the net creator code with the values of the inputs in the training samples (all organized in a matrix too).
EXPECTED_OUTPUTS := Array([0],[1],[1],[0]) ; And we will also provide the net creator with the expected answers to our training samples so that the net creator can properly train the net.
/*
2 . ACTUAL TRAINING
*/
Loop 10000 ; And now we do the net creator code (which is the training code). It will perform 10.000 training cycles.
{
Loop 4 ; For each training cycle, this net creator code will train the network of weights using the four training samples.
{
ACQUIRED_OUTPUT := 1 / (1 + exp(-1 * MATRIX_ROW_TIMES_COLUMN_MULTIPLY(TRAINING_INPUTS, WEIGHTS, A_Index))) ; First, the net is set to calculate some possible results using the weights we currently have. (At the first iteration of the loop these weights are absolutely random, but don't forget they will be recalculated every time). We use a sigmoid function here to set any results (from -infinite to +infinite) to a value between 0 and 1.
SIGMOID_GRADIENT := ACQUIRED_OUTPUT * (1 - ACQUIRED_OUTPUT) ; But since the sigmoid function has a curve like shape, the distance between values is highly distorted depending on the position they occupy in the S-shaped curve, so we will also use the sigmoids gradient function to correctly account for that (This is to find better pictures of the actual distances between values while still keeping the results between 0 and 1).
WEIGHTS[1,1] += TRAINING_INPUTS[A_Index, 1] * ((EXPECTED_OUTPUTS[A_Index, 1] - ACQUIRED_OUTPUT) * SIGMOID_GRADIENT) ; Than, each weight is recalculated using the available knowledge in the samples and also the current calculated results.
WEIGHTS[2,1] += TRAINING_INPUTS[A_Index, 2] * ((EXPECTED_OUTPUTS[A_Index, 1] - ACQUIRED_OUTPUT) * SIGMOID_GRADIENT)
WEIGHTS[3,1] += TRAINING_INPUTS[A_Index, 3] * ((EXPECTED_OUTPUTS[A_Index, 1] - ACQUIRED_OUTPUT) * SIGMOID_GRADIENT)
; Breaking the formula above: Weight is adjusted (we use +=, not :=) by getting the value of the input byte and multiplying it by the difference between calculated input (sigmoidally treated) and expected input, after this difference is adjusted by the gradient of the sigmoid (removing the sigmoidal distortions).
}
}
/*
3. FINAL RESULTS
*/
; VALIDATION CASE [1,0,0]:
Input1 := 1, Input2 := 0, Input3 := 0
; After recalculating the weights in 10.000 iterations of training, we apply them by multiplying these weights to inputs that resembles a new case
; (this new case is a validation sample, not one of the training ones: [1, 0 ,0])
MSGBOX % "VALIDATION CASE: `n" . Input1 . "`, " . Input2 . "`, " . Input3 . "`n`nFINAL WEIGHTS: `nWEIGHT1: " . WEIGHTS[1,1] . "`nWEIGHT2: " . WEIGHTS[2,1] . "`nWEIGHT3: " . WEIGHTS[3,1] . "`n`nWEIGHTED SOLUTION: `n" Input1 * WEIGHTS[1,1] + Input2 * WEIGHTS[2,1] + Input3 * WEIGHTS[3,1] . "`n`nFINAL SOLUTION: `n" . (1 / (1 + EXP(-1 * (Input1 * WEIGHTS[1,1] + Input2 * WEIGHTS[2,1] + Input3 * WEIGHTS[3,1])))) . "`n`nComments: `nA FINAL SOLUTION between 0.5 and 1.0 means the final network thinks the solution is 1. How close the value is to 1 means how certain the net is of that. `nA FINAL SOLUTION between 0 and 0.5 means the final network thinks the solution is 0. How close the value is of 0 means how certain the net is of that."
; Breaking the output numbers:
; WEIGHTED_SOLUTION: If this is positive, the net believes the answer is 1 (If zero or negative, it belives the answer is 0). The higher a positive value is,
; the more certain the net is of its answer being 1. The lower a negative value is, the more certain the net is of its answer being 0.
; FINAL SOLUTION: A sigmoidally treated weighted_solution. If this is above 0.50, the net believes the answer to be 1. The closer to 1, the more certain
; the net is about that. If this is 0.50 or below it, the net believes the answer to be 0. The closer to 0, the more certain the net is about that.
Return
; The function below is just a single step in multiplying matrices (this is repeated many times to multiply an entire matrix). It is used because the input_data, weights and expected results were set into matrices for organization purposes.
MATRIX_ROW_TIMES_COLUMN_MULTIPLY(A,B,RowOfA)
{
If (A[RowOfA].MaxIndex() != B.MaxIndex())
{
msgbox, 0x10, Error, Number of Columns in the first matrix must be equal to the number of rows in the second matrix.
Return
}
Result := 0
Loop % A[RowOfA].MaxIndex()
{
Result += A[RowOfA, A_index] * B[A_Index, 1]
}
Return Result
}
5. Conclusions of the test above
The code was indeed able to approximate the expected results for [1,0,0]: it presented it as ~0.999 (which is close enougth to 1).
Furthermore, if we study the comented code above, (and if we play with it, changing some values) we will notice some interesting facts about ANN creator codes and ANNs themselves.
1. First, a network is just something like this (if we stick to basics, of course):
Result := Weight1 * Input1 + Weight2 * Input2 + Weight3 + Input3
2. If the network is a function of weights that reworks inputs, than we also have that a network creator code is really just a code that obtains these correct weights (And It does so by training the network, which just means aproximating the values of the weight to account for any underlying rules that may be present in the samples).
3. The programmer DOES NOT provide the underlying rules in a network creator code and unlike in our case-study, most times he/she DOESN'T EVEN KNOW these underlying rules, as they are just too complex (i.e: how to tell a number handwritten in a 30x50 image based on individual the pixel values?). In this case, the programmer just provides a number of samples with correct labels, and a means for the machine to recalculate a number of weights ir order to absorb some underlying rules in the samples and imprint them into these weights.
4. The final weights are somewhat unnusual if you actually look at them. The case we presented may have at first make us think that the final network would have weights like [+infinite, 0, 0], but the network presented them as being [+9.68, -0.20 and -4.64] (or something along these values). These values may seem odd at first (almost as if totally random), but they are not: Since we DID NOT provided any underlying rule for the output net, the program is free to find ANY VALUES that will accomodate the underlying rule. This means that the output weights just have to be ANY values that correctly implement the underlying rule (which is this: First byte being 1 makes the weighted solution positive, while being 0 makes it negative (or zero) and the second and third byte don't really change this).
5. If you study the results of neural networks, sometimes you can actually find some quite interesting ideas. The network in the first and second video examples presented in section 1 of this tutorial actually surprised me: The networks discovered that the google dinosaur was better off ducking all the time and mario can be played with more ease if you move around spin-jumping all the time. (clever brats! ).
6. Back to the case-study we provided: Did you also noticed that no training samples had a value of 0 for the third byte? This resulted in a big difference between the weights of the second and third byte, but try adding a fifth training sample with such as case (like [0,1,0]) and see what the final weights become: Surprisingly, they become something like [+12.80, -4.21, -4.21], which just means the network can be changed to treat the second input byte in a similar fashion to the third byte but still sticks to providing a valid answer for the underlying rule: the first byte is the only one that truly matters to make the weighted solution positive or negative.
7. This new value for the weights also implies something interesting: No matter what are the values of the second and third byte, the output will always be negative if the first byte is not 1 and always be positive if it is 1. This means that [+12.80, -4.21, -4.21] is actually also equivalent to [+infinite, 0, 0] when it comes to correctly implementing the underlying rule.
8. Another curious aspect of this ANN we created is that if we run the net on an input [0,0,0] we get a very interesting result: 0,50. This is caused by the sigmoid function we are using to represent the final value: if 0 would be -infinite and 1 would be +infinite, than 0,50 is in fact the weighted output for [0,0,0], which is always 0: There is no actual work being done by the net here as anything (any weight) multiplied by 0 equals 0. Thus, any conclusions we derive from this case are just arbitrary at this moment. We cannot actually say the network concluded that the rule would yield a zero or a one in this case. There is, however, a way to have the net work even on a [0,0,0] case: We just add what we call a bias to the calculation. Biases will not be added to the codes in this tutorial, but they are a regular adition to Neural Networks that serves to tackle things like noise in images. If you want to experiment with biases yourself, try adding a fourth input parameter whose value is always 1 and whose weight is also to be calculated by the net (or just set a number to be added (or subtracted) alongside the calculations for each synapse).
6. A single network creator code can create nets to solve more than one problem.
In the section 4 we created a network creator code that trained a network to learn an underlying rule from a specific pattern and than apply that same rule to solve new questions. The underlying rule in question was: from an input of 3 bits, return the value of the first bit. But what if it was a different one? Something like: from an input of 3 bits, return the inverse of the second? Would we have to change our network creator code to do this instead?
The answer is NO. All we need to do is to change our training samples. The network will absorb whichever rule it can find in the training samples.
(actually that is a huge overstatement, but we will consider it like so for now).
Let's see how this works. First, our new case table:
And now, to change the training samples to acomodate the new rule. The first change is in the EXPECTED_OUTPUTS line. This is line number 11.
Lets change it from this:
Code: Select all
TRAINING_INPUTS := Array([0,0,1],[1,1,1],[1,0,1],[0,1,1]) ; We will also feed the net creator code with the values of the inputs in the training samples (all organized in a matrix too).
EXPECTED_OUTPUTS := Array([0],[1],[1],[0]) ; And we will also provide the net creator with the expected answers to our training samples so that the net creator can properly train the net.
Code: Select all
TRAINING_INPUTS := Array([0,0,1],[1,1,1],[1,0,1],[0,1,1]) ; We will also feed the net creator code with the values of the inputs in the training samples (all organized in a matrix too).
EXPECTED_OUTPUTS := Array([1],[0],[1],[0]) ; And we will also provide the net creator with the expected answers to our training samples so that the net creator can properly train the net.
Let's change it From this:
Code: Select all
; VALIDATION CASE [1,0,0]:
Input1 := 1, Input2 := 0, Input3 := 0
Code: Select all
; VALIDATION CASE [1,1,0]:
Input1 := 1, Input2 := 1, Input3 := 0
Checking the new weights attributed is also quite interesting:
Input1: ~+0.2086
Input2: ~-9.6704
Input3: ~+4.6278
However you look at it, these weights means what we expected them to mean: an input2 of 1 will result in a negatively weighted number and an input2 of 1 will result in a positively weighted number. Success!
7. But be sure check what you are really feeding it!
Since the neural network will be trained to find any underlying rules in the samples (and not just a rule we think to be present in the samples), care must be taken when choosing what to feed it as training samples. For one such example, consider the table below. The underlying rule i am trying to teach the net here is this: If both the second and the third inputs are 1, than the result must be 1. If not, than the result must be 0.
We have 2 examples that yield 1 and two examples that yield 0 in the table above. This should be enougth to train the net to succesfully answer a validation sample of [1,1,0] as 0 right?
Well, NO! Look carefully: there is ANOTHER possible underlying rule in place here: If the second input is 0, result is 0, if it is 1, result is 1. If the network absorbs this rule instead, it will answer the validation sample as 1 (and since this rule is more simple than the first one, it will probably be the one absorbed!).
To have the network solve the riddle as we want it to, we will need to change our training samples and feed something more apropriate to the network creator code.
This should do:
As we changed the first training sample to a new training sample [0,1,0] = 0, we have now made absolutely sure that a rule "second bit is the answer" is not possible. Running the code now will properly yield our desired results.
8. Excellent! And what else?
As we have seen in the section 6, a netwok creator code is very powerful: it can learn many different patterns and create many different networks to solve them. But there is unfortunately a limit to it's power. Some underlying rules are just too complex for this simple model of ANN we are implementing.
Consider the following case table.
The underlying rule of this table is this: If (input1 is 1 and input2 is 0) OR (input1 is 0 and input2 is 1), the result is 1. Otherwise it is 0. This is what we call a XOR problem (exclusive-OR). It is a very possible situation, and it is also very possible to devise an ANN to solve it. But if you just adjust our code to the table above you will see it does NOT work. Even if you try to add every possible input as training sample, it will still not work: it will simply never work with out curent code. The output you get if you try it (because you always get an output) will probably be inconclusive, so that if you run the code ten times or so, the trained net will actually shift between positive and negative at random. The net is simply unable to solve this problem.
Fortunately though, and as mentioned, this problem has been solved already and we CAN create ANNs that solve XOR problems. The way to do this is simple: we need a multi-layered ANN. This is necessary so that we can have our net understand multi-dimensional solutions. We can implement a multi-layer ANN using a trick called BackPropagation. So if we just change our network creator code to an implementation of ANNs that includes these concepts, we will succeed!
Muti-Layered ANNS will be covered in the section II of this tutorial, but for the moment, let's enjoy what we have achieved so far!
Artificial Neural Networks are a field of knowledge that has flourished in the recent years and it is in continuous development. New concepts, new models, new ideas, there is just so much to talk that this basic tutorial will not suffice to include it all. But if it was somehow succesfull, it may have ignited a spark of curiosity in you and you may well be on your way to become an experienced ANN programmer. How about consulting what is available elsewhere and help improve the current boundaries of what tasks are considered programable?
We are currently in a decade in which the making of ANNs is still considered a crafting of sorts, and those who craft them now hold a new power to change the world. If you can develop a network creator code that comes up with a net to solve a new problem, this can be quite valuable!
Thanks for reading all this and feel free to post in any questions you like