23-07-2012, 03:31 PM
Automatic Target Recognition using Neural Networks
Automatic Target Recognition using Neural Networks.pdf (Size: 42.99 KB / Downloads: 37)
ABSTRACT
Two applications of automatic target recognition (ATR) using artificial neural networks are
presented. These are, target position detection and target classification. The neural
networks are based on the probabilistic RAM (pRAM) neuron which is briefly described.
The pRAM has been built using VLSI techniques and includes learning on-chip which allows
the pRAM to be used as an adaptive embedded controller in robust systems.
1. Target Position Detection
Given an image or scene, S, and a target image, P, the neural system is to find the coordinates
of the target image, P, in the scene, S. Additionally, given any sub-scene S’ containing the
target, P, the system is expected to find the image P and to return its coordinates with respect
to S’. It is assumed that a reference scene S0 and a target image, P0, are known a priori. It is
also assumed that the range, azimuth and elevation of the observation point of scene, S, from
the target, P, are known to a reasonable accuracy. This information will be used to transform
any sub-scene S’ to the same scale as S0, so as to produce a scene S for analysis of where the
target P is located in S.
The target image is typically derived from a photograph and the image to be matched may
come from a video camera or another photograph.
(b) ©
Figure 1. (a) the target image, (b) change of viewpoint, © change of scale
In the simple images in Fig. 1, if (a) is the target image, P0, then (b) and © are the same
target seen from a different angle or range. It can be seen that (b) and © can be made to
approximate (a) by the application of suitable geometric transforms. It is not possible to
obtain a perfect reconstruction of (a) from (b) owing to the three-dimensional nature of the
image since these images are available to the system in two-dimensional form.
However, in real images, the target will not be readily segmented from its background.
Therefore the scene in which the target is placed is significant. It is essential that additional
information concerning the viewpoint of the reference scene and the observed scene is
known, otherwise the transformation of the observed scene to match the reference scene
cannot be performed accurately. The accuracy of this transformation is limited, in any case,
and the maximum difference in the angle of view between the reference and observed images
is around 30°. Standard correlation techniques do not give good performance for these
complex images in arbitrary scenes, which is why a neural network is used to handle the
non-linear characteristics of this problem.
There are additional problems inherent in using these techniques for outdoor scenes, which
are the time of day and the time of year. Only one reference scene may exist and this will be
for a certain time of day. If the scene is later observed at a different time of day, then the
effects of shadows or the lack of shadows will be significant. Shadows can distort the
apparent outline of objects, examples of which might be buildings or vehicles. Other
problems will be caused by night/day or summer/winter differences in the images.
Substantial changes, such as a change from full foliage to absence of foliage or a thick
covering of snow cannot reasonably be accommodated.
In the example described later, the reference scene was derived from a high-resolution
photograph and the observed images were received from an infra-red sensor. Here, there is a
cross-spectral problem where parts of the image which are optically dark may appear to be
light in the infra-red image owing to their high temperature. Therefore, this recognition
system must be insensitive to colour.
Because of the above artefacts in the observed image, preprocessing is essential in order to
remove, or at least reduce, these unwanted features. Most of the problems are caused by
scene illumination; however, it is assumed that the structure of the target image will not
change. This suggests that classification based on some form of feature detection will give
the best results, rather than template matching techniques alone. The exact features to be
extracted are dependent upon the structure of the target.
8.1 An example of position detection
In the example described below, a photograph of a building was used as the target image.
The observed image was part of a set of infra-red images at different ranges.
As stated above, the relevant features for position detection are dependent upon the structure
of the target. For a building, these features might be the corners or edges of the building with
architectural features such as windows providing additional information. The simplest form
of feature extraction is to use edge-detection. This method works well with high-resolution
and high-contrast images. When low-resolution and low-contrast infra-red images are used,
an edge may not be completely represented.
Therefore, a combination of conventional image processing and non-linear principal
component analysis was investigated in the solution to this problem. The extracted features
were passed to a pyramidal neural structure and noise was injected during training to give
greater tolerance to variations in the observed image.
8.2 Preprocessing
In this example, the parameters of the reference scene viewpoint are known a priori. It is
assumed that navigational information is available to perform second-order geometric
transformations [1] on the observed image to give an approximate match to the reference
scene in terms of target size and angle-of-view.
Two methods of further processing were then used and compared.
8.3 Principal Components Analysis
Principal Components Analysis (PCA) [2] of an image yields an ordered set of masks which
represent the most common features in that image and their order gives their frequency of
occurrence. Experiments were conducted to see how many components were required to
reconstruct the original image to a given accuracy, which was normally taken to be better than
95%. In this example, six PCA masks of size 8 by 8 pixels were used.
The image is multiplied by the six PCA masks which results in six matrices. These matrices
are input to the neural network. Since each set of 64 pixels yields one vector, and with six
components used, a useful reduction in the dimension of the input of 64/6 was achieved.
To train the neural network, the PCA masks were extracted from the reference scene. A
target point on the building was then marked manually. The image was then scanned in 8 x 8
pixel segments at 2 pixel increments. At each step, the image segment was convolved with
the PCA masks and the 6-element vector applied to a neural network. The network was
trained to give an output of "0" for all areas outside the marked segment and to give an output
of "1" at the marked point only. The network was assessed on a geometrically-corrected
infra-red image containing the same object. If successful, the network should give a
maximum response when the 8 x 8 pixel area containing the previously marked point in the
photograph is seen in the infra-red image.
The results of using PCA were disappointing. Although there was a peak in the response at
the desired point, there were a number of other peaks in the response, some of which were
larger in amplitude than the desired response. This is mainly due to the use of PCA masks
derived from the photograph being used for the infra-red image. These masks are clearly not
sufficient to discriminate the spatial features in the infra-red image.
8.4 Edge detection
In place of PCA, edge-detection using eight preferred orientations was used. The absolute
value of the edge-detected images was used and a single image was produced by summing the
eight outputs. It is clear that the discrimination performance will be improved if each
edge-detected feature is separately processed, but the advantages expected will be small. It is
noted that this method of edge-detection is a special case of PCA - where the components (or
masks) are predefined.
The reference photograph was processed to produce an edge-detected image. Again one
point on the building was marked manually. An 16 x 16 pixel window centred on this point
was used to train the neural network to give an output of "1" and the complement of this area
was applied to the network and trained to give an output of "0". Training noise [3] was used
to improve generalisation of the network.
Figure 2. A section of the edge-detected infra-red and photographic images (64x64 pixels)
IR PHOTO
When the network of Fig. 3, trained on the edge-detected photographic data, was used to
search for the marked point in the geometrically-corrected infra-red image, the maximum
response was found at the target point. The infra-red image was searched by scanning
across the entire image (256 x 256 pixels) using the 16 x 16 mask moving in 2 pixel
increments.
Figure 3. The pRAM neural network used in the position detection system.
Since the pRAM neuron produces an output in the form of a spike-train, and receives
real-valued inputs in the same form, each input was presented for a number of iterations
(typically 1000) and the output response was accumulated. The results are shown as the
firing rate in Fig. 4, where the maximum response is seen at an offset of zero pixels from the
marked position.
Figure 4. The search results for the target position in an infra-red image.
8.5 Discussion of results
The results in Fig. 4 represent a single horizontal scan across the infra-red image, passing
through the target point. A similar scan for the vertical direction also shows a unique
maximum response. In the final system, the response is a 2-D map of the response of the net
as it scans the received image, in which a unique (maximum) response is required. However,
a maximum response does not indicate any certainty of having found the target. If the peak
of the response is not sharp, the confidence of the result is low.
A circle of error probability (CEP) estimate is required in order to assess the accuracy of the
result. The CEP estimate can be made using the variance of the spike trains with the formula
16 x 16 pixel input
(256 8-bit
vectors )
randomised
connections
to
pRAM inputs
256
inputs
output
Correlation of window pictures (horizontal move)
training--binary edge image of photo
testing--grey-level image of trans. edge IR
with 20% training noise
-64 -56 -48 -40 -32 -24 -16 -8 0 8 16 24 32 40 48 56 64
Number of pixels shifted
F iring rate
where < α > is the mean response at the maximum output of the net and R is the spike train
length. We take the CEP as defined by the intersection of the horizontal line at 2σ below
the maximum with the response curve. Thus a broad response gives a large CEP and a sharp
response, a small CEP. We find that, with R=103, α ∼ 0.8, so σ ∼ 1%, that for the data of
Fig 4, that the CEP ≈ ±6 pixels in the horizontal direction.