If I asked you to name the objects in the picture below, you would probably come up with a list of words such as “tablecloth, basket, grass, boy, girl, man, woman, orange juice bottle, tomatoes, lettuce, disposable plates…” without thinking twice. Now, if I told you to describe the picture below, you would probably say, “It’s the picture of a family picnic” again without giving it a second thought.
Those are two very easy tasks that any person with below-average intelligence and above the age of six or seven could accomplish. However, in the background, a very complicated process takes place. The human vision is a very intricate piece of organic technology that involves our eyes and visual cortex, but also takes into account our mental models of objects, our abstract understanding of concepts and our personal experiences through billions and trillions of interactions we’ve made with the world in our lives.
Digital equipment can capture images at resolutions and with detail that far surpasses the human vision system. Computers can also detect and measure the difference between colors with very high accuracy. But making sense of the content of those images is a problem that computers have been struggling with for decades. To a computer, the above picture is an array of pixels, or numerical values that represent colors.
Computer vision is the field of computer science that focuses on replicating parts of the complexity of the human vision system and enabling computers to identify and process objects in images and videos in the same way that humans do. Until recently, computer vision only worked in limited capacity.
Thanks to advances in artificial intelligence and innovations in deep learning and neural networks, the field has been able to take great leaps in recent years and has been able to surpass humans in some tasks related to detecting and labeling objects.
Applications of computer vision
The importance of computer vision is in the problems it can solve. It is one of the main technologies that enables the digital world to interact with the physical world.
Computer vision enables self-driving cars to make sense of their surroundings. Cameras capture video from different angles around the car and feed it to computer vision software, which then processes the images in real-time to find the extremities of roads, read traffic signs, detect other cars, objects and pedestrians. The self-driving car can then steer its way on streets and highways, avoid hitting obstacles, and (hopefully) safely drive its passengers to their destination.
Computer vision also plays an important role in facial recognition applications, the technology that enables computers to match images of people’s faces to their identities. Computer vision algorithms detect facial features in images and compare them with databases of face profiles. Consumer devices use facial recognition to authenticate the identities of their owners. Social media apps use facial recognition to detect and tag users. Law enforcement agencies also rely on facial recognition technology to identify criminals in video feeds.
Computer vision also plays an important role in augmented and mixed reality, the technology that enables computing devices such as smartphones, tablets and smart glasses to overlay and embed virtual objects on real world imagery. Using computer vision, AR gear detect objects in real world in order to determine the locations on a device’s display to place a virtual object. For instance, computer vision algorithms can help AR applications detect planes such as tabletops, walls and floors, a very important part of establishing depth and dimensions and placing virtual objects in physical world.
Online photo libraries like Google Photos use computer vision to detect objects and automatically classify your images by the type of content they contain. This can save you a much time that you would have otherwise spent to add tags and descriptions to your pictures. Computer vision can also help annotate the content of videos and enable users to search through hours of video by typing in the type of content they’re looking for instead of manually looking through entire videos.
Computer vision has also been an important part of advances in health-tech. Computer vision algorithms can help automate tasks such as detecting cancerous moles in skin images or finding symptoms in x-ray and MRI scans.
Computer vision has other, more nuanced applications. For instance, imagine a smart home security camera that is constantly sending video of your home to the cloud and enables you to remotely review the footage. Using computer vision, you can configure the cloud application to automatically notify you if something abnormal happens, such as an intruder lurking around your home or something catching fire inside the house. This can save you a lot of time by giving you assurance that there’s a watchful eye constantly looking at your home. The U.S. military is already using computer vision to analyze and flag video content captured by cameras and drones (though the practice has already become the source of many controversies).
Taking the above example a step further, you can instruct the security application to only store footage that the computer vision algorithm has flagged as abnormal. This will help you save tons of storage space in cloud, because in nearly all cases, most of the footage your security camera captures is benign and doesn’t need review.
Furthermore, if you can deploy computer vision at the edge on the security camera itself, you’ll be able to instruct it to only send its video feed to the cloud if it has flagged its content as needing further review and investigation. This will enable you to save network bandwidth by only sending what’s necessary to the cloud.
The evolution of computer vision
Before the advent of deep learning, the tasks that computer vision could perform were very limited and required a lot of manual coding and effort by developers and human operators. For instance, if you wanted to perform facial recognition, you would have to perform the following steps:
- Create a database: You had to capture individual images of all the subjects you wanted to track in a specific format.
- Annotate images: Then for every individual image, you would have to enter several key data points, such as distance between the eyes, the width of nose bridge, distance between upper-lip and nose, and dozens of other measurements that define the unique characteristics of each person.
- Capture new images: Next, you would have to capture new images, whether from photographs or video content. And then you had to go through the measurement process again, marking the key points on the image. You also had to factor in the angle the image was taken.
After all this manual work, the application would finally be able to compare the measurements in the new image with the ones stored in its database and tell you whether it corresponded with any of the profiles it was tracking. In fact, there was very little automation involved and most of the work was being done manually. And the error margin was still large.
Machine learning provided a different approach to solving computer vision problems. With machine learning, developers no longer needed to manually code every single rule into their vision applications. Instead they programmed “features,” smaller applications that could detect specific patterns in images. They then used a statistical learning algorithm such as linear regression, logistic regression, decision trees or support vector machines (SVM) to detect patterns and classify images and detect objects in them.
Machine learning helped solve many problems that were historically challenging for classical software development tools and approaches. For instance, years ago, machine learning engineers were able to create a software that could predict breast cancer survival windows better than human experts. However, as AI expert Jeremy Howard explains, building the features of the software required the efforts of dozens of engineers and breast cancer experts and took a lot of time develop.
Deep learning provided a fundamentally different approach to doing machine learning. Deep learning relies on neural networks, a general-purpose function that can solve any problem representable through examples. When you provide a neural network with many labeled examples of a specific kind of data, it’ll be able to extract common patterns between those examples and transform it into a mathematical equation that will help classify future pieces of information.
For instance, creating a facial recognition application with deep learning only requires you to develop or choose a preconstructed algorithm and train it with examples of the faces of the people it must detect. Given enough examples (lots of examples), the neural network will be able to detect faces without further instructions on features or measurements.
Deep learning is a very effective method to do computer vision. In most cases, creating a good deep learning algorithm comes down to gathering a large amount of labeled training data and tuning the parameters such as the type and number of layers of neural networks and training epochs. Compared to previous types of machine learning, deep learning is both easier and faster to develop and deploy.
Most of current computer vision applications such as cancer detection, self-driving cars and facial recognition make use of deep learning. Deep learning and deep neural networks have moved from the conceptual realm into practical applications thanks to availability and advances in hardware and cloud computing resources. However, deep learning algorithms have their own limits, most notable among them being lack of transparency and interpretability.
The limits of computer vision
Thanks to deep learning, computer vision has been able to solve the first of the two problems mentioned at the beginning of this article, meaning the detecting and classifying of objects in images and video. In fact, deep learning has been able to exceed human performance in image classification.
However, despite the nomenclature that is reminiscent of human intelligence, neural networks function in a way that is fundamentally different from the human mind. The human visual system relies on identifying objects based on a 3D model that we build in our minds. We are also able to transfer knowledge from one domain to another. For instance, if we see a new animal for the first time, we can quickly identify some of the body parts found in most animals such as nose, ears, tail, legs…
Deep neural networks have no notion of such concepts and they develop their knowledge of each class of data individually. At their heart, neural networks are statistical models that compare batches of pixels, though in very intricate ways. That’s why they need to see many examples before they can develop the necessary foundations to recognize every object. Accordingly, neural networks can make stupid (and dangerous) mistakes when not trained properly.
But where computer vision is really struggling is understanding the context of images and the relation between the objects they see. We humans can quickly tell without a second thought that the picture at the beginning of the article is that of a family picnic, because we have an understanding of abstract concepts it represents. We know what a family is. We know that a stretch of grass is a pleasant place to be. We know that people usually eat at tables, and an outdoor event sitting on the ground around a tablecloth is probably a leisure event, especially when all the people in the picture are happy. All of that and countless other little experiences we’ve had in our lives quickly goes through our minds when we see the picture. Likewise, if I tell you about something unusual, like a “winter picnic” or a “volcano picnic” you can quickly put together a mental image of what such an exotic event would look like.
For a computer vision algorithm, pictures are still arrays of color pixels that can be statistically mapped to a certain descriptions. Unless you specifically train a neural network on pictures of family picnics, it won’t be able to make the connection between the different objects it sees in a photo. Even when trained, the network will only have a statistical model that will probably label any picture that has a lot of grass, several people and tablecloths as a “family picnic.” It won’t know what a picnic is contextually. Accordingly, it might mistakenly classify a picture of a poor family with sad looks and sooty faces eating in the outdoors as a happy family picnic. And it probably won’t be able to tell the following picture is a drawing of an animal picnic.
Some experts believe that true computer vision can only be achieved when we crack the code of general AI, artificial intelligence that has the abstract and commonsense capabilities of the human mind. We don’t know when—or if—that will ever happen. Until then, or until we find some other way to represent concepts in a way that can also leverage the strengths of neural networks, we’ll have to throw more and more data at our computer vision algorithms, hoping that we can account for every possible type of object and context they should be able to recognize.