This article is currently the subject of an educational assignment. |
Viewpoint Invariant Object Recognition
editViewpoint invariant object recognition is a theory in Cognitive science, which arose as a possible explanation for the human capability to quickly and easily identify the same objects from different depths and positions. Despite tremendous variation in the appearance of visual objects, primates can recognize a multitude of objects, each in a fraction of a second, with no apparent effort[1]. Viewpoint-invariant theories of object recognition propose that objects are represented on the basis of distinctive features and their inter-relations, which remain constant across changes in viewpoint[2]. That is to say for example, when we see a cat, whether we see them from the front, the back or from above, most people can easily tell that they are looking at a cat. Humans are very good at these types of visual recognition tasks. This ability is known in cognitive science as pattern recognition (psychology) and it is very important because it allows us to know what we are looking at and whether we have seen it before. When looking at a cat or any other object, part of the system used by our brain is to see objects as parts such as a cat being four legs, two ears, two eyes etc... Because we can do this it allows for recognition to take place regardless of our viewpoint of the object as long as it's parts are recognizable[3]. Humans are able to recognize an object even if we cannot see all of it and even if we have never seen it from that angle before. For example, we can imagine our cat again but this time flying through the air above us, while we may not be able to see all of the animal, we can see enough of it's structural components such as tail, ears, and paws etc... to recognize the animal as a cat. In general, people show little difficulty recognizing an object when they see it from a different perspective.[3] This is viewpoint invariant object recognition. However it should be noted that not all forms of vision are viewpoint invariant such as unorthodox shaped objects like bushes or crumpled socks. Also any object which has unexpected changes from the new perspective will not be recognized.
History
editBecause of the tremendous variation in visual images that people see and our capability to recognize these objects it is clear that human beings have some mechanism in the brain which allows for object recognition.Somehow our visual system is able to identify these features. How we do this is known as the object constancy problem[4] and it is at the heart of viewpoint invariant object recognition. When scientists realized how good people are at recognizing objects they began to theorize how our brain was able to do this. Disagreement lay not in which process the visual system uses to match the input but with how the information is represented in our minds. One of the first theorists to publish research on Viewpoint invariant object recognition was Irving Biederman. In 1985 he published a paper on Recognition-by-components theory.[5] In his paper he discussed how image input to the brain is broken down into shape based representations such as cylinders and cubes etc... called a geon (psychology). In his theory These geons are used to recognize anything we look at and are an essential component of our ability to recognize objects. While this has been the major influence on the field his ideas have been challenged by the template matching theory and viewpoint dependent theories which are discussed below in the theories section.
Physical Mechanisms of Object Recognition
editBecause of all the different ways an image can be projected onto the retina of varying angles and depths, there are a potentially limitless number of ways we can view an object with our eyes. Yet humans can still recognize many of these variations as the same object, even if parts of the object are missing. In fact, it is quite rare that people make mistakes when being tested under these conditions even though the actual projections onto the retina of the object can be quite different.[6]. This is possible because all normal objects contain features that stand in constant(invariant) relation to each other such as, the human hairline is always at the top of the head and nose in the middle every time you see a face. The brain faces two problems with this information. First, the brain must bind all of this information together so that we see a whole object rather than separate features and secondly it must detect the invariant relationships between these features. There is evidence of the ability for humans to do both at birth.[4].
Feature Detection
editWhen visual information enters the brain it does not remain as one whole image; instead the image we see of an object is broken down into parts. Each object we see is a combination of these parts. For example, a stop sign is a cylinder with an octagon on top. The brain take these parts and put them together in order to recognize what we are looking at[7]
Visual System
editWhen light hits an object and is then seen by a person's eyes, the light comes in as an inverted image on the retina. The image is then processed and visual information about the object being viewed is transported to the primary visual cortex.
From there the information travels further to the ventral stream which can be thought of as the "what" area of the brain. This is important in deciding what an object is. At the same time information also travels to the dorsal stream which is used to know where an object is located.[3]Despite objects in the world being three dimensional, our visual system is only two dimensional. Therefore, when we see something, the image of an object only provides a view of one side of an object. When we move around the object, parts appear and disappear from view.[8]. This again raises the question of how humans are able to constantly recognize an object as the same thing, even though the images we see are constantly changing. Clearly this it not simply a visual process and other areas of the brain must be involved.
Neural Mechanisms
editIn order to solve a recognition task, a person must use some internal representation of what they see within the brain. This representation is in the form of patterns and activations of neurons. Neurons receive signals in the brain, which cause them to activate their own response. Typically, this is sending neurotransmitters to other neurons. During object recognition, input to the brain causes certain neurons to fire which then affects other neurons. This series of inputs and firings form a pattern. These patterns represent what we see and can change easily based on new information[4]. For example, the pattern of neurons which fire when we look at the cover of a book will be much different than the pattern of firing once we open to the first page of the book. Yet, we are still able to perceive this as the same book we were looking at before. So how is this possible? Clearly there is more to recognition than just the patterns of neurons. Another mechanism in the brain must be responsible for our ability to perceive objects as the same thing despite changes in them. One theory is that the visual system learns from these changes, as there are many examples of these types of transformations since the first day of a persons life[4]. Although exactly how neurons do this is not known, studies have suggested that cells in the primary visual cortex and the temporal cortex interacting may be responsible[4]. One study in which electrodes were implanted into the brains of macaque monkeys has used their results to conclude that there are, in fact, neurons in the brain which are entirely view independent(invariant); those which are somewhat view independent and those that are view dependent. Also, the researchers found some which are object dependent. Their conclusion is that view-independent representations are being built in these regions by associating together neurons that respond to different views of the same individual[9]. To try to better understand how the brain can do this, the same theorists have built an artificial neural network which was able to replicate these mechanisms. There results have been supported by other researchers who built similar and more complicated networks, which also showed invariant recognition.[10]
Memory
editBecause viewpoint invariant theories propose that mental representations are stored as parts of an object rather than as wholes, this has implications on how much memory will be needed for these tasks, as only structural parts need to be encoded. Therefore, storage of multiple object viewpoints would not be required in memory, as the parts can be assembled from any perspective. This stands in contrast with the Template Matching Theory, which requires a separate representation of every possible viewpoint of every possible object to be stored in memory, which would likely take for more memory space inside the brain
Theories of Object Recognition
editRecognition by Components Theory
editThis theory was developed by Irving Biederman, which states that invariant object recognition is possible because of structurally invariant components of objects. These components are known as geons. This theory claims that our ability to recognize objects as the same objects, even when we see them from a different angle or depth, is not because of mental rotation or from having multiple templates. Rather, the theory states, people only need to have one view of an object to be able to recognize it from new perspectives. Because the human brain uses structural images of the object, it breaks the object down into its geons and it then uses these basic shapes to recreate the image of the object from any viewpoint at nearly instantaneous speed. As long as both views of the object show similar parts (geons) like a cylinder or cone then our brain will recognize them as the same object.[3] One important aspect of the Recognition by components theory is that if an arrangement of two or three geons can be recognized through the visual input then objects can be quickly recognized even when they are partially blocked, rotated, degraded or seen from a new viewpoint. Biederman did experiments in which he briefly presented pictures of objects and then presented them again but from a different angle. The participants were able to recognize the object with very little difficulty. These results provide empirical support for the theory.[6] A fundamental assumption of Recognition-by-Components Theory is that recognition of individual geons (and therefore objects composed of geons) should be equally accurate and fast when seen from almost any viewpoint barring what are called accidental viewpoints.[11]. An accidental viewpoint is when we see a shape from an angle which hides its true form and makes it look like another shape. For example, the end of a sphere may look like a flat cylinder or the end of a three dimensional rectangle may look like a two dimensional square.
Geons
editGeons are the foundation of the Recognition-by-Components Theory. They are two or three dimensional shapes such as cylinders, cones, circles and rectangles. They correspond to the parts of an object and when put together make up it's whole. There are only 36 geons in the theory but their combinations may be endless[6]. This aspect of the theory is strong because geons would not require the large amount of memory that the Template Matching Theory would.
There are four main properties of geons as described in the Recognition-by-Components theory:
- View-invariance: Each geon can be distinguished from the others from almost any viewpoints, except for accidents where one geon projects an image that looks like another. Objects represented as an arrangement of geons are also viewpoint invariant.
- Resistance to visual noise: Visual noise is when the object or geons is partially blocked or degraded, like when a rectangular billboard is blocked by a tree. Geons are still recognizable under conditions like this.
- Invariance to lighting direction and surface markings. This means that despite lighting or potential damage to the object which is visible, geons are still clearly identifiable.
- Distinctiveness: Geons differ on only a few traits such as straight vs. curved, parallel vs. non parallel, positive vs. negative curvature. However, these differences can be easily recognized, which makes each geon easy to identify and helps in whole object recognition as it eliminates confusion
.[6].
Template Matching Theory
editIn the Template Matching Theory of object recognition, any image a person sees is matched up against a mental representation of that image known as a template. Depending on how much of the incoming image matches the template, a person will either recognize the object or not[7]. The way this theory accounts for Invariant object recognition is to store multiple copies of the object from all angles, mentally rotate the object from its originally stored memory, then interpolate or imagine what the object would look like from this new view, and finally, compare the mental rotation to the new image to see if they match.[3]. This theory has taken significant criticism because of the amount of variability of objects that a person sees throughout their lifetime, which would all need a template in the mind. There would literally need to be a template in your mind for every object that you can recognize. The variation in the letter A shown in the image serves as a good example. Although each version of the letter is slightly different, the fact that they are all the letter A is still obvious to anyone who can read English even if you have never seen those specific fonts used before. Having a template for each object and each different way we could see that object stored in our memory would be inefficient and impractical. Because of these limitations, the theory is not considered viable in cognitive psychology, even though, this method of recognition is used by computers in certain situations.
Pandemonium Model
editThe Pandemonium Model of object recognition uses feature detection in objects to perceive what the object is. Incoming visual information is assessed by its features, such as a vertical line or a circle shape. These features are assembled by the four demons of the model, which are known as the image demon, feature demon, cognitive demon and the decision demon. This method of recognition has a flaw in that it is entirely bottom up processing. It does not account for context such as the word superiority effect which shows that letters are easier to recognize when they are in a word.[12]
Viewpoint Dependent recognition theory
editViewpoint-Dependent Recognition theory suggests that object recognition is affected by the viewpoint at which it is seen. In this theory objects are stored in memory with multiple viewpoints. This form of recognition requires a lot of memory as each viewpoint must be stored, however, recent research has found that viewpoint dependent recognition may play a role in object recognition. This is a direct challenge to viewpoint invariant object recognition theories. Based on the geon structural description approach, Biederman proposed three conditions under which object recognition is predicted to be viewpoint invariant. Two experiments are reported that satisfied all three criteria, yet revealed performance that was clearly viewpoint dependent. These results suggest that the conditions proposed by I. Biederman and P.C. Gerhardstein are not generally applicable, the recognition of distinct objects in certain situations may be viewpoint-dependent mechanisms.[13]. Other theories have suggested that human recognition should be looked at as a continuum rather than either or[14]. In the 1998 study, the authors claim that recognition of objects does depend on which angle we see them from. The authors show experimental evidence that when an object is seen and then shown again from a different angle it takes people longer to recognize the object as the same one than when the see that object again from the same angle.[11]. Some other studies have supported this theory as well. The recognition of static faces has been shown to exhibit viewpoint dependence. Studies using unfamiliar faces demonstrate that for both recognition memory and matching tasks judgements for faces seen from a novel viewpoint, whether measured by speed or accuracy, are typically impaired in proportion to the difference in angle of view. In general, the recognition of faces seems to be viewpoint dependent.[15]
Future Research
editSome theorists have claimed that it is the environment that determines whether viewpoint invariant or viewpoint dependent recognition is used by the brain. They suggest a multiple views mechanism. [14]. One study has found that when faces are viewed in motion or upside down, they are recognized using invariant viewpoint mechanisms. Despite this, faces that are viewed while still or right side up use viewpoint dependent mechanisms.[15]. Clearly the distinction between view dependent and view invariant is not as black and white as some of the earlier theories have claimed. Future research should seek to combine these two ways of recognition rather than trying to prove each other wrong.
Another study has developed a neuroscience model which does something the other models presented in this article do not. Their model uses both top-down and bottom-up design rather than just bottom up processing. This model also uses feedback from neurons which are in a hierarchical organization.[16]. This means that when visual information is coming in and traveling through neurons in the brain to the areas which will recognize the object, at the same time information about things like context are being used as well to help the brain decide what the object is faster. An example here will help to make the distinction more clear. Imagine you are at a rock concert and you see your grandmother. Although all of the visual information is the same as every other time you see your grandmother, it is likely that it would be harder for you to recognize her as quickly as usual because you would not expect her to be there. However if you go to her house and see her you would recognize her immediately because you expect it. This context is top down processing and it likely plays a role in object recognition and is an exciting area for future research.
References
edit- ^ Dicarlo, J., & Cox, D. (2007). Untangling invariant object recognition. Trends in Cognitive Science, 11(8), 333-341.
- ^ Harris, I., & Dux, P. (2005). Orientation-invariant object recognition: evidence from repetition blindness. Cognition, 95(1), 73-93.
- ^ a b c d e Biederman, I., Gerhardstein, P. (1993). Recognizing Depth-Rotated Objects: Evidence and Conditions for Three-Dimensional Viewpoint Invariance. Journal of Experimental Psychology: Human Perception and Performance, 19(6), 1162-1182.
- ^ a b c d e Walsh, W. , & Kulikowski, J. (1998). Perceptual constancy why things look as they do. Cambridge: Cambridge University Press.
- ^ Biederman, I. (1987). Recognition-by-components: A theory of human image understanding. Psychological Review, 94(2), 115-147.
- ^ a b c d Biederman, I. (1987). Recognition-by-components: A theory of human image understanding. Psychological Review, 94(2), 115-147.
- ^ a b Friedenberg, J., Silverman, G. (2012). Cognitive science: An introduction to the study of mind. California: Sage Publications Inc.
- ^ O'Reagan, J., Noe, A. (2001). A sensorimotor account of vision and visual consciousness. Behavioural and brain sciences. 24, 939–1031
- ^ Rolls, E. (1994). Brain mechanisms for invariant visual and learning recognition. Behavioural Processes, 33, 113-138.
- ^ Kikuchi, M., & Fukushima, K. (2001). Invariant pattern recognition with eye movement: A neural network model. Nuerocomputing, 38(40), 1359-1365.
- ^ a b Tarr, M., Pepper, W., Hayward, W., Gauthier, I. (1998). Three-dimensional object recognition is viewpoint. nature neuroscience, 1(4), 275-277.
- ^ Marchetti, F., Mewhort,D. (1986). On the word-superiority effect. Psychological Research, 48(1), 23-35.
- ^ Tarr. M., Hayward. G. (1997). Testing Conditions for Viewpoint Invariance in Object Recognition. Journal of Experimental Psychology: Human Perception and Performance. 23(5), 1511-1521.
- ^ a b Tarr, M. , Bulthoff, H. (1993). Is Human Object Recognition Better Described by Geon Structural Descriptions or by Multiple Views? Comment on Biederman and Gerhardstein. Journal of Experimental Psychology: Human Perception and Performance. 21(6), 1494-1505.
- ^ a b Watson, T,. Johnson, A., Hill, H., Troje, N. (2005) Motion as a cue for viewpoint invariance. Visual cognition, 12(7), 1291-1308.
- ^ Deco, G., & Rolls, E. (2004). A neurodynamical cortical model of visual attention and invariant object recognition. Vision Research, 44(6), 621-642.