In the most recent webinar in the SNIA Data, Storage & Networking “AI Stack” webinar series, “From Data to Decisions: Understanding How AI Models Learn,” Cal Foshee and Eric Gamble provided an in-depth look at how computer vision models learn. They shared specific techniques and concrete examples of how to train and test these models, drawing on their many years of experience. If you missed the live session, you can view it here in the SNIA Educational Library, along with a PDF of the webinar slides.

We did not have time during the live webinar to answer the questions from the live audience. Cal and Eric have kindly provided detailed answers here:   

Q: Are high resolution images useful in computer vision, even with pixel downsampling?

A: Yes, regardless of pixel downsampling, high resolution images can be useful. 

First of all, there are “High-Resolution” vision models available that downsample far less than other models. These High-Resolution models are often much larger and consume more resources to utilize but have their uses.  Some of these models will use a technique called “tiling” where the model breaks a high-resolution image into multiple tiles and each tile is then inferenced. This allows downsampling to happen on parts of the image vs the whole image, which decreases the overall downsampling of the image.

Additionally, if a modeler is utilizing multi-stage models, the high-resolution images can offer the benefit of more detail for the subsequent models to work with.  There are modeling techniques that allow a user to crop out a lot of the “noise” or extra pixels from an image, leaving only the more relevant pixels to be inferenced. This reduction in extra pixels means that the image, when downsampled, will only be downsampled in a specific area vs the whole image being downsampled (including noise) and thus leaving the user with less relevant pixels to inference with. 

Q: What causes the labeling distortion. The words sometimes come out misspelled or nonsensical?  Does that have to do with the downsampling and upsampling?

A: In the presentation examples, the distortion is largely a combination of downsampling and parallax/curvature working against the OCR.  On a flat document, OCR is highly effective.  In the “wild open space”, OCR needs to be “tamed”. Most OCR sampling corresponds to the equivalent of ~300DPI on a flat document.  Stand this up in space and off-angle, we now have curvature, parallax, reflections, variable lighting, etc and a ~300DPI extraction.  Upsampling will not truly enhance text and character “readability” unless it is combined with additional software designed to do a more sophisticated extraction than common OCR (which is likely not to upsample anyway).  Overcoming the distortion for more accurate OCR extraction... similar to above, we want to boost the acuity by localizing/isolating features to be read, and in most situations, this prevents the downsampling.  In a generalized environment, this is more difficult than a repetitive environment (manufacturing lines, store shelves, cargo pallets, etc).  We design for value, so we do our best to golf on the greens and avoid the roughs unless the value prop or mission needs lead us there.

Q: What is a bounding box, and how do they work?

A: Bounding boxes are the tool a modeler uses to teach the model what an “object” looks like.  Some model types can allow for polygon bounding boxes, meaning the modeler can use whatever shape they believe is best for teaching the model what a specific object is, however many models can only utilize square/rectangular bounding boxes, and any other shape will be converted to a rectangle during model training. 

When a modeler draws a bounding box on an image and labels the bounding box with the object name (ex. Brand badge for a car) the model will learn that all the pixels within the box make up that object. However, since not every pixel in that bounding box is likely part of the object, it is important that enough different images are used so that the model learns what pixels within the bounding box are unimportant (such as the color of the vehicle the brand badge is on). If enough unique images are used for training the model will be able to get a good idea of what pixel patterns make up an object. 

Q: How do we feed data into the model?  What are the steps or tools to train the model on that data (how do we do it)?

A: We feed data into vision models by adding images to the data set of the model with objects properly labeled with bounding boxes (Object Detection Model) of with the image properly classified (Classification Model).  Unlabeled or Unclassified images are typically ignored by the algorithm when a model is trained.  There are modeling techniques that we use to bring in images that show the “absence of an object”, such as “Gas Cap present vs Gas Cap missing”.

Adding a variety of images with different situations, lighting, angles, distances, good and bad states, etc. are all ways to add “data” to the model.  The only way a model can learn what an object is (or isn’t) is to have a labelled image with those objects added to the data set. 

Q: How do we evolve a model?  How can a model evolve itself?

A: Evolving a model is a critical part of a modeler’s role. A skilled modeler constantly looks for ways to improve a model—making it more accurate, robust, and capable. This process starts by analyzing what the model does well and where it struggles. Based on these insights, the modeler adds context by incorporating properly labeled images into the dataset.

For example, suppose a model consistently detects the target object but shows lower confidence scores for images taken at night compared to those taken during the day. While the predictions are still correct, the modeler can anticipate potential issues with “night images.” To address this, they introduce more nighttime examples into the dataset and train a new iteration of the model. If testing confirms improved performance, that updated model becomes the primary one for inspections.

Q: Can a Model Evolve on Its Own?

A: Technically, yes. Many tools offer features for automatic retraining and self-evolution. However, as Dr. Ian Malcolm famously said:

"Your scientists were so preoccupied with whether or not they could, they didn't stop to think if they should."

We strongly caution against fully autonomous retraining because of error propagation. Models lack context—only modelers provide that. If a model retrains itself without oversight, it could reinforce and amplify existing errors. Moreover, a good modeler understands how a model behaves and uses that knowledge to deliver reliable results. If the model evolves independently, the modeler risks losing that understanding, which can undermine the entire use case.

Bottom line: A modeler should maintain full control over a vision model and only implement changes when they are confident those changes will improve performance.

Q: How has vision modeling helped you become better at traditional data modeling?

A: Analytics at its core is about identifying and making use of patterns in data and changes in those patterns.  This includes much of AI, predictive modeling, and change detection (anomaly modeling). Visualizing patterns and changes in patterns is powerful, and this is why a lot of data modeling begins with descriptive analytics and exploratory data analysis. Pattern recognition is key. Vision modeling is 100% about pattern recognition, and to be successful, we must develop and exercise our brains to see and think in patterns.  This development has a direct carryover to data pattern recognition... it might be numbers, it might be text, it might include time series, it often includes multiple input variables and variable transformations... and pattern modeling is required for useful predictive modeling.

Working in the vision modeling space has helped us develop the “better golfer” skills that we discussed in the presentation.  “Thinking in patterns” is a core skill in working with images. Identifying signal vs. noise in a visual sense is a core skill. Recognizing working patterns vs. broken patterns and when to “tune” or “adjust” patterns are skills that come from “doing and testing models”.  We learn how to use algorithms like tools to get productive and useful outcomes, and we learn how to compare and contrast the effectiveness of these tools – we need to get good at this because the tools will constantly change and we need to change as modelers over time, adapting our core skills to advance outcomes and capabilities.

Q: The “dog test”… how do we tie this into modeling?

A: When we talk about providing context to a computer vision model, think of it like this:

Imagine the “Dog Test.” A dog sees a leaf and mistakes it for food, so it licks it. After tasting it, the dog realizes it’s not food and moves on. The dog learns through experience because it has multiple senses and real-world context.

Now, your model has far less context than a dog; it only sees pixels. It can’t lick or interact with objects to learn. So how does it gain context? The answer: you, the modeler, must provide it.

Example: Wheel Inspection on a Manufacturing Line

A client wanted to inspect vehicle wheels for lug nut defects. Occasionally, the inspection images included a forklift passing by in the background. Forklifts have wheels too, and those wheels look similar to the ones being inspected. Since the model lacks context, it assumes these forklift wheels are the same as the target wheels and fails the inspection because forklift wheels don’t have lug nuts.

To fix this, the modeler takes the confusing images and updates the labels:

  • Correct or remove incorrect labels.

  • Add a new object class called “forklift_wheel.”

By doing this, the model now understands the difference between the wheels it should inspect and forklift wheels. This added context improves accuracy.

A model will never create context on its own. It’s always the responsibility of the modeler to provide that context through properly labeled images.

 

Q: How do you know when a model is ready to use?

A: There’s no single universal threshold; it depends on the goal of the inspection and what the stakeholders expect. A model will never be perfect, but it can be tuned to maximize strengths and minimize weaknesses.

1. High-Accuracy Scenarios

For inspections that require near-perfect accuracy (e.g., >99.99%), the model must:

  • Perform correctly on real-world, process-representative images.
  • Pass stress tests where the modeler actively tries to “trick” or “break” it with challenging cases.
  • Run successfully on the production line for a sustained period, building confidence.

Once these conditions are met, the model can be integrated with shop-floor control systems to make real decisions.

2. Situations with Ambiguous Defects

For defects that are hard to define (scratches, dents, partially seated connectors):

  • Perfection isn’t realistic; catching ~80% of common defects can be a major improvement.
  • Focus on the most frequent issues rather than rare edge cases.
  • Remember: Don’t let perfect be the enemy of good.

When the model consistently catches the majority of defects, it’s likely ready for deployment.

Critical Factor: False Failures

The fastest way to get an inspection turned off is false failures, which means flagging good parts as defective. These waste time, resources, and frustrate clients. Before declaring a model “Ready to Use,” ensure false failures are minimized or eliminated.

Bottom Line

A model is ready when:

  • It meets the accuracy expectations for its use case.
  • It performs reliably under real-world conditions.
  • False failures are under control.

Q: How does OCR relate to vision models?

A: At the end of the day, OCR is a form of vision model, it is just a focused version that is specifically created to recognize the patterns that make up letters, numbers, and special characters.  Can you train a computer vision model to be an OCR?  Yes, it is definitely possible to train a model that recognizes characters, however, there is often no advantage with creating your own versus integrating an existing open-source OCR into your inspection solution. Occasionally, a modeler may find a specific font or character type that traditional OCRs do not do a good job of reading, and in that situation, it could be worth creating your own.

Additionally, pairing OCR models with other vision models can be a very powerful strategy.  An example could be using a custom vision model to find a label in an image and then crop that label out and send it to a specialized OCR model that can then read the label and digitize the information. 

A modeler that wants to utilize OCR technology would be smart to learn more about Regular Expressions (Regex).  Regex are used to tell the OCR what sort of character patterns to look for when reading a label.  For example, a user may want to read Serial Numbers off of a label, and the Serial Numbers may have a specific pattern, such as 8 total characters comprising of 2 Capital Letters followed by a dash “-” followed by 6 numbers. If you learn how to write a Regex to communicate this pattern, then the OCR has a much higher chance of reading the characters properly.  Another example is if you tell the OCR that the first characters is a letter, it will not mistake the letter “O” for the number “0” and vice versa. 

Like with all AI and vision tools, understanding the limitations and strengths of the tool will help the user know how to best use the tool and extract the most value from it. 

For a deeper dive into this topic, check out this interview with our speakers.

As mentioned, this webinar is part of the SNIA Data, Storage & Networking “AI Stack” webinar series. We encourage you to register for upcoming sessions and view past presentations on demand.

We have many more webinars planned for 2026. Follow us on LinkedIn or X  for upcoming dates and topics.