Artificial neural networks robert j schalkoff pdf download






















Ultimately, optical chips look very promising. Yet, it may be years before optical chips see the light of day in commercial applications. But before jumping into the various networks, a more complete understanding of the inner workings of an neural network is needed.

As stated earlier, artificial neural networks are a large class of parallel processing architectures which are useful in specific types of complex problems. These architectures should not be confused with common parallel processing configurations which apply many sequential processing units to standard computing topologies. Instead, neural networks are radically different than conventional Von Neumann computers in that they crudely mimic the fundamental properties of man's brain. As mentioned earlier, artificial neural networks are loosely based o n biology.

Current research into the brain's physiology has unlocked only a limited understanding of how neurons work or even what constitutes intelligence in general.

Researchers are working in both the biological and engineering fields to further decipher the key mechanisms for how m a n learns and reacts to everyday experiences. Improved knowledge in neural processing helps create better, more succinct artificial networks. It also creates a cornucopia of new, and ever evolving, architectures. Kunihiko Fukushima, a senior research scientist in Japan, describes the give and take of building a neural network model; "We try to follow physiological evidence as faithfully as possible.

For parts not yet clear, however, we construct a hypothesis and build a model that follows that hypothesis. We then analyze or simulate the behavior of the model and compare it with that of the brain. If we find any discrepancy in the behavior between the model and the brain, we change the initial hypothesis and modify the model.

We repeat this procedure until the model behaves in the same way as the brain. Neural computing is about machines, not brains. It is the process of trying to build processing systems that draw upon the highly successful designs naturally occuring in biology. This linkage with biology is the reason that there is a common architectural thread throughout today's artificial neural networks. Figure 4. NeuralWare sells a artificial neural network design and development software package.

Their processing element model shows that networks designed for prediction can be very similar to networks designed for classification or any other network category. Prediction, classification and other network categories will be discussed later. The point here is that all artificial neural processing elements have common components. These components are valid whether the neuron is used for input, output, or is in one of the hidden layers.

Component 1. Weighting Factors: A neuron usually receives many simultaneous inputs. Each input has its own relative weight which gives the input the impact that it needs on the processing element's summation function. These weights perform the same type of function as do the the varying synaptic strengths of biological neurons. In both cases, some inputs are made more important than others so that they have a greater effect on the processing element as they combine to produce a neural response.

Weights are adaptive coefficients within the network that determine the intensity of the input signal as registered by the artificial neuron. They are a measure of an input's connection strength. Component 2. Summation Function: The first step in a processing element's operation is to compute the weighted sum of all of the inputs. Mathematically, the inputs and the corresponding weights are vectors which can be represented as i1, i2.

The total input signal is the dot, or inner, product of these two vectors. This simplistic summation function is found by muliplying each component of the i vector by the corresponding component of the w vector and then adding up all the products.

The result is a single number, not a multi-element vector. Geometrically, the inner product of two vectors can be considered a measure of their similarity. If the vectors point in the same direction, the inner product is maximum; if the vectors point in opposite direction degrees out of phase , their inner product is minimum. The summation function can be more complex than just the simple input and weight sum of products. In addition to a simple product summing, the summation function can select the minimum, maximum, majority, product, or several normalizing algorithms.

The specific algorithm for combining neural inputs is determined by the chosen network architecture and paradigm. Some summation functions have an additional process applied to the result before it is passed on to the transfer function. This process is sometimes called the activation function. The purpose of utilizing an activation function is to allow the summation output to vary with respect to time.

Activation functions currently are pretty much confined to research. Most of the current network implementations use an "identity" activation function, which is equivalent to not having one.

Additionally, such a function is likely to be a component of the network as a whole rather than of each individual processing element component. Component 3. Transfer Function: The result of the summation function, almost always the weighted sum, is transformed to a working output through an algorithmic process known as the transfer function.

In the transfer function the summation total can be compared with some threshold to determine the neural output. If the sum is greater than the threshold value, the processing element generates a signal. If the sum of the input and weight products is less than the threshold, no signal or some inhibitory signal is generated.

Both types of response are significant. The threshold, or transfer function, is generally non-linear. Linear straight-line functions are limited because the output is simply proportional to the input. Linear functions are not very useful. That was the problem i n the earliest network models as noted in Minsky and Papert's book Perceptrons.

The transfer function could be something as simple as depending upon whether the result of the summation function is positive or negative. The network could output zero and one, one and minus one, or other numeric combinations. The transfer function would then be a "hard limiter" or step function.

See Figure 4. Another type of transfer function, the threshold or ramping function, could mirror the input within a given range and still act as a hard limiter outside that range. It is a linear function that has been clipped to minimum and maximum values, making it non-linear. Yet another option would be a sigmoid or S-shaped curve. That curve approaches a minimum and maximum value at the asymptotes. It is common for this curve to be called a sigmoid when it ranges between 0 and 1, and a hyperbolic tangent when it ranges between -1 and 1.

Mathematically, the exciting feature of these curves is that both the function and its derivatives are continuous. This option works fairly well and is often the transfer function of choice. Other transfer functions are dedicated to specific network architectures and will be discussed later. Prior to applying the transfer function, uniformly distributed random noise may be added. The source and amount of this noise is determined by the learning mode of a given network paradigm.

This noise is normally referred to as "temperature" of the artificial neurons. The name, temperature, is derived from the physical phenomenon that as people become too hot or cold their ability to think is affected.

Electronically, this process is simulated by adding noise. Indeed, by adding different levels of noise to the summation result, more brain-like transfer functions are realized.

To more closely mimic nature's characteristics, some experimenters are using a gaussian noise source. Gaussian noise is similar to uniformly distributed noise except that the distribution of random numbers within the temperature range is along a bell curve. The use of temperature is an ongoing research area and is not being applied to many engineering applications. But this temperature coefficient is a global term which is applied to the gain of the transfer function. It should not be confused with the more common term, temperature, which is simple noise being added to individual neurons.

In contrast, the global temperature coefficient allows the transfer function to have a learning variable much like the synaptic input weights. This concept is claimed to create a network which has a significantly faster by several order of magnitudes learning rate and provides more accurate results than other feedforward, back-propagation networks. Component 4. Scaling and Limiting: After the processing element's transfer function, the result can pass through additional processes which scale and limit.

This scaling simply multiplies a scale factor times the transfer value, and then adds an offset. Limiting is the mechanism which insures that the scaled result does not exceed an upper or lower bound.

This limiting is in addition to the hard limits that the original transfer function may have performed. This type of scaling and limiting is mainly used in topologies to test biological neuron models, such as James Anderson's brain-state-in-the-box. Component 5. Output Function Competition : Each processing element is allowed one output signal which it may output to hundreds of other neurons.

This is just like the biological neuron, where there are many inputs and only one output action. Normally, the output is directly equivalent to the transfer function's result.

Some network topologies, however, modify the transfer result to incorporate competition among neighboring processing elements. Neurons are allowed to compete with each other, inhibiting processing elements unless they have great strength. Competition can occur at one or both of two levels. First, competition determines which artificial neuron will be active, or provides an output. Second, competitive inputs help determine which processing element will participate in the learning or adaptation process.

Component 6. Error Function and Back-Propagated Value: In most learning networks the difference between the current output and the desired output is calculated. This raw error is then transformed by the error function to match a particular network architecture. The most basic architectures use this error directly, but some square the error while retaining its sign, some cube the error, other paradigms modify the raw error to fit their specific purposes.

The artificial neuron's error is then typically propagated into the learning function of another processing element. This error term is sometimes called the current error. The current error is typically propagated backwards to a previous layer. Normally, this back-propagated value, after being scaled by the learning function, is multiplied against each of the incoming connection weights to modify them before the next learning cycle. Component 7. Learning Function: The purpose of the learning function is to modify the variable connection weights on the inputs of each processing element according to some neural based algorithm.

This process of changing the weights of the input connections to achieve some desired result can also be called the adaption function, as well as the learning mode. There are two types of learning: supervised and unsupervised. Supervised learning requires a teacher. The teacher may be a training set of data or an observer who grades the performance of the network results.

Either way, having a teacher is learning by reinforcement. When there is no external teacher, the system must organize itself by some internal criteria designed into the network. This is learning by doing.

The vast majority of artificial neural network solutions have been trained with supervision. In this mode, the actual output of a neural network is compared to the desired output. Weights, which are usually randomly set to begin with, are then adjusted by the network so that the next iteration, or cycle, will produce a closer match between the desired and the actual output.

The learning method tries to minimize the current errors of all processing elements. This global error reduction is created over time by continuously modifying the input weights until an acceptable network accuracy is reached. With supervised learning, the artificial neural network must be trained before it becomes useful. Training consists of presenting input and output data to the network. This data is often referred to as the training set.

That is, for each input set provided to the system, the corresponding desired output set is provided as well. In most applications, actual data must be used. This training phase can consume a lot of time. In prototype systems, with inadequate processing power, learning can take weeks. This training is considered complete when the neural network reaches an user defined performance level.

This level signifies that the network has achieved the desired statistical accuracy as it produces the required outputs for a given sequence of inputs. When no further learning is necessary, the weights are typically frozen for the application.

Some network types allow continual training, at a much slower rate, while in operation. This helps a network to adapt to gradually changing conditions. Not only do the sets have to be large but the training sessions must include a wide variety of data.

If the network is trained just one example at a time, all the weights set so meticulously for one fact could be drastically altered in learning the next fact. The previous facts could be forgotten in learning something new. As a result, the system has to learn everything together, finding the best weight settings for the total set of facts. For example, in teaching a system to recognize pixel patterns for the ten digits, if there were twenty examples of each digit, all the examples of the digit seven should not be presented at the same time.

How the input and output data is represented, or encoded, is a major component to successfully instructing a network. Artificial networks only deal with numeric input data. Therefore, the raw data must often be converted from the external environment. Additionally, it is usually necessary to scale the data, or normalize it to the network's paradigm. This pre-processing of real-world stimuli, be they cameras or sensors, into machine readable format is already common for standard computers.

Many conditioning techniques which directly apply to artificial neural network implementations are readily available. It is then up to the network designer to find the best data format and matching network architecture for a given application.

After a supervised network performs well on the training data, then it is important to see what it can do with data it has not seen before. If a system does not give reasonable outputs for this test set, the training period is not over.

Indeed, this testing is critical to insure that the network has not simply memorized a given set of data but has learned the general patterns involved within an application. Unsupervised learning is the great promise of the future. It shouts that computers could someday learn on their own in a true robotic sense. Currently, this learning method is limited to networks known as self- organizing maps.

These kinds of networks are not in widespread use. They are basically an academic novelty. Yet, they have shown they can provide a solution in a few instances, proving that their promise is not groundless.

They have been proven to be more effective than many algorithmic techniques for numerical aerodynamic flow calculations. They are also being used in the lab where they are split into a front-end network that recognizes short, phoneme-like fragments of speech which are then passed on to a back- end network.

The second artificial network recognizes these strings of fragments as words. These networks use no external influences to adjust their weights. Instead, they internally monitor their performance. These networks look for regularities or trends in the input signals, and makes adaptations according to the function of the network.

Even without being told whether it's right or wrong, the network still must have some information about how to organize itself.

This information is built into the network topology and learning rules. An unsupervised learning algorithm might emphasize cooperation among clusters of processing elements.

In such a scheme, the clusters would work together. If some external input activated any node in the cluster, the cluster's activity as a whole could be increased. Likewise, if external input to nodes in the cluster was decreased, that could have an inhibitory effect on the entire cluster.

Competition between processing elements could also form a basis for learning. Training of competitive clusters could amplify the responses of specific groups to specific stimuli. As such, it would associate those groups with each other and with a specific appropriate response. Normally, when competition for learning is in effect, only the weights belonging to the winning processing element will be updated.

At the present state of the art, unsupervised learning is not well understood and is still the subject of research. This research is currently of interest to the government because military situations often do not have a data set available to train a network until a conflict arises. The rate at which ANNs learn depends upon several controllable factors. In selecting the approach there are many trade-offs to consider.

Obviously, a slower rate means a lot more time is spent in accomplishing the off-line learning to produce an adequately trained system. With the faster learning rates, however, the network may not be able to make the fine discriminations possible with a system that learns more slowly. Researchers are working on producing the best of both worlds. Generally, several factors besides time have to be considered when discussing the off-line training task, which is often described as "tiresome.

These factors play a significant role in determining how long it will take to train a network. Changing any one of these factors may either extend the training time to an unreasonable length or even result in an unacceptable accuracy. Usually this term is positive and between zero and one. If the learning rate is greater than one, it is easy for the learning algorithm to overshoot in correcting the weights, and the network will oscillate.

Small values of the learning rate will not correct the current error as quickly, but if small steps are taken in correcting errors, there is a good chance of arriving at the best minimum convergence. Many learning laws are in common use. Most of these laws are some sort of variation of the best known and oldest learning law, Hebb's Rule. Research into different learning functions continues as new ideas routinely show up in trade publications. Some researchers have the modeling of biological learning as their main objective.

Others are experimenting with adaptations of their perceptions of how nature handles learning. Either way, man's understanding of how neural processing actually works is very limited.

Learning is certainly more complex than the simplifications represented by the learning laws currently developed. A few of the major laws are presented as examples.

Hebb's Rule: The first, and undoubtedly the best known, learning rule was introduced by Donald Hebb. The description appeared in his book T h e Organization of Behavior in His basic rule is: If a neuron receives an input from another neuron, and if both are highly active mathematically have the same sign , the weight between the neurons should be strengthened. Hopfield Law: It is similar to Hebb's rule with the exception that it specifies the magnitude of the strengthening or weakening.

It states, "if the desired output and the input are both active or both inactive, increment the connection weight by the learning rate, otherwise decrement the weight by the learning rate. It is one of the most commonly used.

This rule is based on the simple idea of continuously modifying the strengths of the input connections to reduce the difference the delta between the desired output value and the actual output of a processing element.

This rule changes the synaptic weights in the way that minimizes the mean squared error of the network. The way that the Delta Rule works is that the delta error in the output layer is transformed by the derivative of the transfer function and is then used in the previous neural layer to adjust input connection weights. In other words, this error is back-propagated into previous layers one layer at a time. The network type called Feedforward, Back-propagation derives its name from this method of computing the error term.

When using the delta rule, it is important to ensure that the input data set is well randomized. Well ordered or structured presentation of the training set can lead to a network which can not converge to the desired accuracy. If that happens, then the network is incapable of learning the problem. The Gradient Descent Rule: This rule is similar to the Delta Rule i n that the derivative of the transfer function is still used to modify the delta error before it is applied to the connection weights.

Here, however, an additional proportional constant tied to the learning rate is appended to the final modifying factor acting upon the weight. This rule is commonly used, even though it converges to a point of stability very slowly. It has been shown that different learning rates for different layers of a network help the learning process converge faster.

In these tests, the learning rates for those layers close to the output were set lower than those layers near the input. This is especially important for applications where the input data is not derived from a strong underlying model.

Kohonen's Learning Law: This procedure, developed by Teuvo Kohonen, was inspired by learning in biological systems. In this procedure, the processing elements compete for the opportunity to learn, or update their weights.

The processing element with the largest output is declared the winner and has the capability of inhibiting its competitors as well as exciting its neighbors. Only the winner is permitted an output, and only the winner plus its neighbors are allowed to adjust their connection weights. Further, the size of the neighborhood can vary during the training period. The usual paradigm is to start with a larger definition of the neighborhood, and narrow in as the training process proceeds.

Because the winning element is defined as the one that has the closest match to the input pattern, Kohonen networks model the distribution of the inputs.

This is good for statistical or topological modeling of the data and is sometimes referred to as self-organizing maps or self-organizing topologies. The majority of the variations stems from the various learning rules and how those rules modify a network's typical topology. The following sections outline some of the most common artificial neural networks. They are organized in very rough categories of application. These categories are not meant to be exclusive, they are merely meant to separate out some of the confusion over network architectures and their best matches to specific applications.

Basically, most applications of neural networks fall into the following five categories: - prediction - classification - data association - data conceptualization - data filtering Network type Networks Use for network Prediction - Back-propagation Use input values to predict some - Delta Bar Delta output e. Table 5. This chart is intended as a guide and is not meant to be all inclusive.

While there are many other network derivations, this chart only includes the architectures explained within this section of this report. Feedforward back-propagation i n particular has been used to solve almost all types of problems and indeed is the most popular for the first four categories. The next five subsections describe these five network types. There are many applications where prediction can help in setting priorities.

For example, the emergency room at a hospital can be a hectic place. To know who needs the most time critical help can enable a more successful operation. Basically, all organizations must establish priorities which govern the allocation of their resources. This projection of the future is what drove the creation of networks of prediction.

The feedforward, back-propagation architecture was developed in the early 's by several independent sources Werbor; Parker; Rumelhart, Hinton and Williams. This independent co-development was the result of a proliferation of articles and talks at various conferences which stimulated the entire industry.

Currently, this synergistically developed back-propagation architecture is the most popular, effective, and easy to learn model for complex, multi-layered networks.

This network is used more than all others combined. It is used in many different types of applications. This architecture has spawned a large class of network types with many different topologies and training methods. Its greatest strength is in non-linear solutions to ill- defined problems. The typical back-propagation network has an input layer, an output layer, and at least one hidden layer.

There is no theoretical limit on the number of hidden layers but typically there is just one or two. Some work has been done which indicates that a maximum of four layers three hidden layers plus an output layer are required to solve problems of any complexity. Each layer is fully connected to the succeeding layer, as shown in Figure 5. The in and out layers indicate the flow of information during recall.

Recall is the process of putting input data into a trained network and receiving the answer. Back-propagation is not used during recall, but only when the network is learning a training set. The number of layers and the number of processing elements per layer are important decisions. These parameters to a feedforward, back-propagation topology are also the most ethereal. They are the "art" of the network designer. There is no quantifiable, best answer to the layout of the network for any particular application.

There are only general rules picked up over time and followed by most researchers and engineers applying this architecture to their problems. Rule One: As the complexity in the relationship between the input data and the desired output increases, then the number of the processing elements in the hidden layer should also increase. Rule Two: If the process being modeled is separable into multiple stages, then additional hidden layer s may be required.

If the process is not separable into stages, then additional layers may simply enable memorization and not a true general solution. Rule Three: The amount of training data available sets an upper bound for the number of processing elements in the hidden layer s.

To calculate this upper bound, use the number of input- output pair examples in the training set and divide that number by the total number of input and output processing elements in the network.

Larger scaling factors are used for relatively noisy data. Extremely noisy data may require a factor of twenty or even fifty, while very clean input data with an exact relationship to the output might drop the factor to around two. It is important that the hidden layers have few processing elements. Too many artificial neurons and the training set will be memorized.

If that happens then n o generalization of the data trends will occur, making the network useless on new data sets. Once the above rules have been used to create a network, the process of teaching begins. This teaching process for a feedforward network normally uses some variant of the Delta Rule, which starts with the calculated difference between the actual outputs and the desired outputs.

Using this error, connection weights are increased in proportion to the error times a scaling factor for global accuracy. Doing this for an individual node means that the inputs, the output, and the desired output all have to be present at the same processing element.

The complex part of this learning mechanism is for the system to determine which input contributed the most to an incorrect output and how does that element get changed to correct the error. An inactive node would not contribute to the error and would have no need to change its weights. To solve this problem, training inputs are applied to the input layer of the network, and desired outputs are compared at the output layer.

During the learning process, a forward sweep is made through the network, and the output of each element is computed layer by layer. The difference between the output of the final layer and the desired output is back-propagated to the previous layer s , usually modified by the derivative of the transfer function, and the connection weights are normally adjusted using the Delta Rule. This process proceeds for the previous layer s until the input layer is reached.

There are many variations to the learning rules for back-propagation networks. Different error functions, transfer functions, and even the modifying method of the derivative of the transfer function can be used. The concept of "momentum error" was introduced to allow for more prompt learning while minimizing unstable behavior. Here, the error function, or delta weight equation, is modified so that a portion of the previous delta weight is fed through to the current delta weight. This acts, in engineering terms, as a low-pass filter on the delta weight terms since general trends are reinforced whereas oscillatory behavior is cancelled out.

This allows a low, normally slower, learning coefficient to be used, but creates faster learning. Another technique that has an effect on convergence speed is to only update the weights after many pairs of inputs and their desired outputs are presented to the network, rather than after every presentation. The number of input-output pairs that are presented during the accumulation is referred to as an "epoch.

There are limitations to the feedforward, back-propagation architecture. Back-propagation requires lots of supervised training, with lots of input- output examples. Additionally, the internal mapping procedures are not well understood, and there is no guarantee that the system will converge to an acceptable solution.

At times, the learning gets stuck in a local minima, limiting the best solution. This occurs when the network system finds an error that is lower than the surrounding possibilities but does not finally get to the smallest possible error.

Many learning applications add a term to the computations to bump or jog the weights past shallow barriers and find the actual minimum rather than a temporary error pocket. Typical feedforward, back-propagation applications include speech synthesis from text, robot arms, evaluation of bank loans, image processing, knowledge representation, forecasting and prediction, and multi-target tracking.

Each month more back-propagation solutions are announced in the trade journals. The delta bar delta network utilizes the same architecture as a back- propagation network. The difference of delta bar delta lies in its unique algorithmic method of learning.

Delta bar delta was developed by Robert Jacobs to improve the learning rate of standard feedforward, back-propagation networks. As outlined above, the back-propagation procedure is based on a steepest descent approach which minimizes the network's prediction error during the process where the connection weights to each artificial neuron are changed.

The standard learning rates are applied on a layer by layer basis and the momentum term is usually assigned globally. Some back-propagation approaches allow the learning rates to gradually decrease as large quantities of training sets pass through the network. Although this method is successful in solving many applications, the convergence rate of the procedure is still too slow to be used on some practical problems.

The delta bar delta paradigm uses a learning method where each weight has its own self-adapting learning coefficient. It also does not use the momentum factor of the back-propagation architecture. The remaining operations of the network, such as feedforward recall, are identical to the normal back-propagation architecture. Delta bar delta is a "heuristic" approach to training artificial networks. What that means is that past error values can be used to infer future calculated error values.

However, this process is complicated in that empirical evidence suggests that each weight may have quite different effects on the overall error. Jacobs then suggested the common sense notion that back-propagation learning rules should account for these variations in the effect on the overall error.

In other words, every connection weight of a network should have its own learning rate. The claim is that the step size appropriate for one connection weight may not be appropriate for all weights in that layer. Further, these learning rates should be allowed to vary over time. By assigning a learning rate to each connection and permitting this learning rate to change continuously over time, more degrees of freedom are introduced to reduce the time to convergence. Rules which directly apply to this algorithm are straight forward and easy to implement.

Each connection weight has its own learning rate. These learning rates are varied based on the current error information found with standard back-propagation. When the connection weight changes, if the local error has the same sign for several consecutive time steps, the learning rate for that connection is linearly increased. Incrementing linearly prevents the learning rates from becoming too large too fast.

When the local error changes signs frequently, the learning rate is decreased geometrically. Decrementing geometrically ensures that the connection learning rates are always positive. Further, they can be decreased more rapidly in regions where the change in error is large. By permitting different learning rates for each connection weight in a network, a steepest descent search in the direction of the negative gradient is no longer being preformed.

Perceptron convergence. Limitations of a perceptrons. Structures of Multi-layer feedforward networks. Back propagation algorithm. Back propagation - training and convergence.

Functional approximation with back propagation. Practical and design issues of back propagation learning. Pattern separability and interpolation. Regularization Theory. Regularization and RBF networks. RBF network design and training. Approximation properties of RBF. Linear separability and optimal hyperplane. Determination of optimal hyperplane.

Optimal hyperplane for nonseparable patterns. Design of an SVM. Examples of SVM. General clustering procedures. Competitive learning algorithms and architectures.

Self organizing feature maps. Properties of feature maps. Neuro-fuzzy systems. Background of fuzzy sets and logic. Extent xxi, pages. Isbn Label Artificial neural networks, Robert J. Ask a Librarian! Library Locations Map Details.

African Studies Library Borrow it. Alumni Medical Library Borrow it. Astronomy Library Borrow it. Fineman and Pappas Law Libraries Borrow it. Frederick S. Pardee Management Library Borrow it.



0コメント

  • 1000 / 1000