In a world where images and videos are part of our daily communication process, Computer Vision has emerged as one of the indispensable technologies, quickly developing in the Artificial intelligence field. This discipline teaches machines to “see” the world and has transformed how we interact with smart devices and how these devices interpret and respond to our environment.

This article will explore the world of Computer Vision, discussing its definition, how it works, and its applications in various industries. We will also look at recent advancements in the field and potential benefits and challenges. Whether you are a technology enthusiast or just curious about how computers "see" the world, this article will provide an insightful overview of this fascinating technology.

What is Computer Vision?

Computer Vision is a subfield of Artificial Intelligence and Machine Learning, where different techniques allow the machine to understand digital images and videos. It implies analyzing those images to extract the necessary information to solve a problem. This approach is similar to how humans use our eyes to recognize faces, objects, and scenes.

Computer vision involves many disciplines and techniques, including statistics and mathematics. Statistics helps with pattern recognition, noise reduction, and the training of models; for example, Gaussian filters are based on probability distribution. On the other hand, mathematics fields like linear algebra, calculus, and geometry support image transformation and optimization. With these disciplines, machines can learn to classify images, detect objects, track motion, and even recognize human emotions.

Image extracted from What is Computer Vision and its Benefits - Rishabh Software

Computer Vision Vs. Image Processing

Some people confuse computer vision with image processing because these are widely related fields, but they have differences in their objectives and applications. While image processing works in preparing images for further analysis or improving their quality for human perception, the goal is not to interpret the content or understand the objects that may appear.

Image processing manipulates images through algorithms and can perform tasks like image enhancement, such as bright and contrast adjustment, filter and noise elimination, and conversion between different image formats.

On the other hand, Computer Vision, as we stated before, searches for data extraction to understand the visual environment. Some of the tasks that can be performed by computer vision include object detection, recognition, tracking, and segmentation. These applications are applied in various fields, including self-driving cars, medical imaging, surveillance, and robotics.

Requirement and Key Concepts for Computer Vision

Venturing into computer vision requires comprehensive resources and skills to effectively develop and implement vision-based applications. There are some hardware and software you must have.

In the hardware section, you will need cameras and sensors of various types since the core of Computer Vision is the need for capturing images or videos. From standard digital cameras to more specialized ones like thermal or depth cameras, or LiDar Systems (which can be used to enhance depth perceptions and mapping capabilities by giving distance measurements, ideal for applications in robotics and 3D mapping).

Furthermore, the computationally intensive nature of computer vision tasks demands powerful processing units. Graphics Processing Units (GPUs) often have the heaviest task in computer vision since deep learning models and complex image processing tasks leverage their parallel processing capabilities. Recommendations go from the 8GB minimum to 16GB recommended VRAM. While processors (CPUs) handle various tasks, having a powerful CPU with multiple cores helps in pre-processing and running other essential computer vision algorithms. A minimum of 4 cores is recommended, with more cores offering better performance.

On the software side, it involves libraries and frameworks that will enable us to work with images, train models, and build applications. Some popular choices are the Open Source Computer Vision Library (OpenCV), TensorFlow, and PyTorch.

OpenCV is a widely used open–source library offering a comprehensive set of functions for real-time computer vision tasks like image processing, feature extraction, and object detection. According to its website, it has more than 2500 optimized algorithms, including a comprehensive set of both classic and more updated computer vision and machine learning algorithms.

Tensorflow and PyTorch are popular deep-learning frameworks that allow you to build, train, and deploy computer vision models. Both offer high-level APIs and extensive functionalities for various deep-learning tasks.

Apart from these technical requirements, a solid foundation in mathematics and statistics, including areas like linear algebra, calculus, and probability is essential for understanding these algorithms. But don’t worry if these aren’t your strongest suit yet, there are plenty of resources available to help you build your mathematical and statistical confidence!

Remember, even a basic grasp of these mathematical concepts will go a long way in understanding the intuition behind the algorithms, and you can always delve deeper as you progress. The important thing is to jump in and be curious.

Fundamental Processes of Computer Vision

Computer vision systems form a pipeline through which visual information is converted into meaningful interpretations or decisions. These processes can be broadly categorized into several key stages:

Image Acquisition: this is the first step where the visual data is captured through cameras and/or sensors. The quality of image acquisition directly impacts the effectiveness of subsequent processing steps. Image acquisition can involve considerations like lighting, angles, and resolution.
Image pre-processing: once captured, the image data might need some cleaning and preparation, this can include tasks such as:
- Noise Reduction, which consists in removing irrelevant information or distortions.
- Contrast enhancement, made by adjusting the image to make certain features more distinguishable.
- Normalization, where images are scaled to a standard size or intensity range.
- Edge detection, which highlights the edges of objects within an image to simplify analysis.
Feature extraction: in this step, the system identifies and extracts key features from the image, which is crucial for reducing the resources required to process large volumes of data. This will result in identifying details such as edges, corners, or specific shapes; these will be used to represent the image in subsequent processes.
Segmentation: the goal is to partition an image into meaningful segments that can represent individual objects, distinct regions with specific properties, or even groups of pixels that share similar characteristics, and can be easier to analyze. Segmentation techniques can be based on attributes like color, intensity, or texture. Some examples might be:
- Thresholding, a simple method where pixels are classified based on their intensity values. For example, in biological research, specifically in the study of cells, we can use thresholding by separating the foreground (cells with high intensity) from the background (low intensity).
- Region-based segmentation, consists in grouping pixels with similar characteristics into regions. Unlike thresholding, it considers multiple properties of pixels to achieve more sophisticated and accurate segmentation.
Object Detection and Recognition: in simpler terms, object detection is like finding and labeling objects in a picture, while recognition is just figuring out what the objects are. To do this, computers can use patterns, features, or powerful learning models like CNNs to pinpoint and categorize all sorts of objects in an image or video.
Classification: imagine teaching a computer to sort things. Classification is like creating an expert for identifying objects, showing it a bunch of pictures with labels, like “dog” or “table”. By studying these examples, the expert can use this knowledge to figure out what new things are in other pictures, allowing it to predict the correct label for objects it hasn’t seen before.
Tracking: when analyzing videos, tracking keeps tabs on things that move. It does this by finding the object in each frame, one after another. This is useful for things like security systems where we want to see where people or objects are going.
Post-processing: the final step is like polishing a gem - we take the results from the earlier steps and make them even more accurate. This can involve getting rid of any false alarms (where the computer thought it saw something that wasn’t there), combining all the information we found, and making sense of it all. Finally, we use this refined data to make the best decisions possible.

There are many tools and tricks computers use to do each of these steps, and new ideas from machine learning and deep learning keep making them better and faster. By putting all these steps together, computers can “see” and understand pictures and videos more and more accurately, which lets them be used in all sorts of cool ways across many different fields.

Practical Applications

Computer vision has a wide array of applications across various industries, significantly transforming how tasks are performed and services are delivered. Here are some key applications of computer vision:

Healthcare
- Medical Imaging Analysis: computer vision algorithms are transforming medical imaging analysis. By analyzing X-rays, MRI scans, and CT scans, these algorithms can detect subtle abnormalities that might be missed by the human eye.
- Patient Monitoring: vision systems can be used to continuously monitor patients in intensive care units or at home. They can track vital signs, analyze facial expressions for signs of pain, and even detect changes in gait or posture that might indicate emerging medical issues.
Automotive and Transportation
- Traffic Management: computer vision systems can be installed at intersections and along roadways, continuously analyzing traffic patterns. They can detect congestion in real-time, allowing authorities to dynamically adjust traffic signals to optimize flow, or even make an automatic algorithm that permits traffic lights to change.
Sports and Entertainment
- Performance Analysis: by analyzing footage of athletes’ movements and plays, computer vision systems can provide detailed feedback on technique and strategy. This can help athletes identify areas for improvement, such as optimizing swing mechanics in baseball.

Challenges and Limitations

Computer vision has made remarkable strides in recent years, but it still faces significant challenges and limitations. These issues can affect the performance and applicability of vision systems across various domains. Here are some of the main challenges and limitations:

Variability in Lighting and Weather Conditions
Computer vision systems often struggle in environments with variable lighting, such as shadows, glare, or insufficient light, which can significantly degrade the quality of image analysis.
Adverse weather conditions like fog, rain, or snow can obscure visual sensors, making it difficult for systems to accurately interpret visual data.
Complexity of Real-World Environments
Real-world scenarios are incredibly diverse and unpredictable. Objects can be partially obscured, perspectives can vary, and scenes can be cluttered, all of which pose challenges for accurate object detection and recognition.
Background noise and similar-looking objects can lead to misclassifications or false positives/negatives.
High Computational Requirements
Many computer vision tasks, especially those involving deep learning, require substantial computational power and memory. This can limit the deployment of advanced computer vision systems to environments where high-end hardware is feasible.
Real-time processing needs, such as in autonomous vehicles or real-time surveillance, impose even greater demands on system capabilities.
Data Bias and Ethical Concerns
Training datasets may not be representative of all demographics, leading to biases in model predictions. For example, facial recognition systems have been found to have higher error rates for certain racial groups.
The use of computer vision in surveillance and personal data analysis raises significant privacy and ethical concerns, requiring careful regulation and transparent practices.
Integration with Other Systems
Computer vision often needs to be integrated with other systems and technologies, such as IoT devices or robotics. This integration can be complex, involving synchronization of different technologies and data types.
Ensuring robust and secure communication between systems can be challenging, especially in critical applications like healthcare or transportation.
Scalability
Scaling computer vision systems from controlled experimental settings or small-scale deployments to widespread real-world applications can introduce unexpected challenges. These include handling significantly larger datasets, diverse operating conditions, and ensuring consistent performance across different platforms.
Algorithmic Limitations
Despite advances, certain tasks remain difficult for computer vision systems, such as understanding context from a single image, dealing with abstract or non-literal images, and long-term scene understanding.
Algorithms can be sensitive to small changes in input data or parameters, leading to unstable performance in new or slightly different environments.
Standardization and Quality Control
There is a lack of standardization in terms of how vision data is processed, stored, and shared, which can hamper the development and evaluation of systems.
Ensuring the quality and reliability of computer vision applications is critical, particularly in safety-critical areas like autonomous driving and medical diagnostics.

Tutorial

For this tutorial, we will make an image segmentation algorithm using Jupyter Notebooks with the Conda Distribution.

Image segmentation is the process of partitioning an image into multiple segments or clusters. The goal is to simplify the representation of an image to make it more meaningful and easier to analyze. This technique is commonly used to locate objects and boundaries in images.

We will also be using the OpenCV library. In that case, we need to run the following line code.

Make it stand out

Whatever it is, the way you tell your story online can make all the difference.

Once installed, we will need to import the libraries to the code.

Cv2 is the OpenCV library for computer vision tasks. The Numpy library, which we have used before in my previous tutorials, is for numerical operations. Matplotlib.pyplot is a library for plotting images. Finally, urllib.request is a library for downloading images from a URL.

This code downloads an image from a URL and loads it into a format that can be processed by OpenCV.

Once the image is loaded in the code, we will convert the image to RGB for better compatibility with the Matplotlib library. Then the image is reshaped into a 2D array where each row represents a pixel, and each column represents the color channels (R, G, B). This makes it easier to apply the k-means clustering algorithm.

Now we will initialize the criteria for the K-Means algorithm, it defines the conditions under which the algorithm will stop iterating. Specifically, it terminates after completing 100 iterations or achieving an accuracy threshold of 0.2, whichever comes first. This ensures that the algorithm does not run indefinitely and provides control over the convergence quality.

The k parameter specifies the number of clusters into which the image is to be divided, influencing how detailed the segmentation will be. Finally, cv2.kmeans is the function from OpenCV that implements the k-means algorithm. It processes the image data, partitions it into the specified number of clusters, and returns two key outputs: the centers of these clusters, which represent the average color of pixels in each cluster, and the labels for each pixel, indicating which cluster a particular pixel belongs to.

This method effectively groups pixels with similar characteristics and is widely used in image processing to simplify and reduce the complexity of images for analysis.

This is a crucial part of the image segmentation process using k-means clustering.

The first code block converts the cluster centers to 8-bit integers. The cluster centers calculated by the k-means algorithm are typically in floating-point format, representing the mean color values of the pixels in each cluster. This process is necessary because the pixel intensity values in images are usually represented as integers ranging from 0 to 255, and ensures that the cluster centers can be directly used as color values for virtualizing the image.

The second code block works on the labeling mapping and the image reshaping. After the conversion, each pixel in the original image is labeled with a cluster index by the k-means algorithm. These labels indicate which cluster (or color center) each pixel belongs to. This line maps each pixel’s label to the corresponding cluster center value.

‘labels.flatten()’ converts the label array into a 1D array, and ‘centers[...]’ uses this array to look up and replace each label with its corresponding cluster center color. The result is a 1D array where each pixel’s color is now one of the cluster centers.

Since the mapping process flattens the image into a 1D array of colors, it needs to be reshaped back to the original dimension of the image to restore its original structure. ‘image.shape’ provinces the dimensions (height, width, and color channels) needed to reshape the array back into a proper image format.

Now we use this to display both the original and the segmented images side by side for visual comparison, using the *matplotlib* library, showing as the result the following plot:

This is the main tutorial, but we may come to think: How can I use this image for an analysis?

Using the segmented image obtained from the algorithm, you can perform various analyses to extract meaningful information, for this example, we’ll use a color analysis.

The color analysis of the segmented image focuses on the simplified representation provided by the k-means clustering, which groups similar colors into clusters. This contrasts with the analysis of the original image, where the color information is more detailed and varied.

In the following code, we will show how closely the colors of the segments in the processed image match the average colors of the corresponding regions in the original image. This can be particularly useful in applications where color fidelity and segmentation accuracy are critical, such as in digital arts or medical imaging.

As result

Cluster 1: Original Mean Color: [ 92.90724 142.18944 165.9082 ], Segmented Mean Color: [ 92 142 165]
Cluster 2: Original Mean Color: [ 42.713947 84.39792 110.38863 ], Segmented Mean Color: [ 42 84 110]
Cluster 3: Original Mean Color: [160.1698 177.44855 189.85788], Segmented Mean Color: [160 177 189]
Remember you can always find the source code on our Github Page!

In Conclusion,

This article offers a thorough exploration of Computer Vision, skillfully introducing its core concepts, functionalities, and applications across various industries. It distinguishes Computer Vision from similar technologies such as image processing, detailing the essential hardware and software required for its implementation, including tools like OpenCV, TensorFlow, and Pytorch.

We went across the practical applications of Computer Vision, highlighting its transformative impact on sectors such as healthcare, automotive, and entertainment, and acknowledged the challenges and limitations of the technology, including lightning issues, weather conditions, and computational demands.

A practical tutorial on image segmentations using k-means clustering makes an enriched ending for this article, providing you, my dear readers, with a hands-on demonstration of applying Computer Vision techniques.

I’m happy to announce that I’m working on a project using Computer Vision in education, more specifically for kindergarten children. I hope once I have finished and presented this work, share with you, in a very quick way, all the processes that took the project. Thanks for reading!

References:

Introduction to Computer Vision