Vision system Vs Human - Test for 'closeness' and statistical significance?

Jonathanlai928

Starting to get Involved
We are moving to a vision system that will replace 100% human inspection and I need to verify that the machine is at least 99% accurate, as in it identifies the same faults as the human does (benchmark).

There are 4 categories.
The data will come out as binary or pass/fail.
For example, machine counts 3 defects in category 1 for batch 1234. Human also counts 3 in category 1 for batch 1234.

Would a simple independent T test suffice?
 

Bev D

Heretical Statistician
Leader
Super Moderator
NO. A t test is for continuous data. You will have categorical data.
Look for my article on MSA V and V in the resources section. Read the sections on categorical data.
 

Jonathanlai928

Starting to get Involved
Thank you. I've had a look at your document really helpful. Just to elaborate:

There are 4 categories of defects possible.

Example: Machine counts 3x defect 1 VS human who counts 5x defect 1

How do I compare if difference is statistically significant and would I be able to say it's something like 95% accurate or whatever?

I have some ideas of the above i.e. chi square but what really stumps me is while I can compare the count or occurrence of one category in machine vs human, how can I compare if what counted MATCHED?

To the above, should I do something with maybe category 1 match vs no match and so on for other categories?
 

Bev D

Heretical Statistician
Leader
Super Moderator
As usual this is less about the math than the study design. (No amount of mathematical manipulations can save a flawed study)
A few things to keep in mind:
  • Inspectors can be affected by defect rate while a Vision system usually isn’t. This will drive how many good units are used in the study vs how many bad units.
  • Time spent on each unit affects both systems. Too little time leads to misses and false positives for inspectors and can lead to misses in vision systems.
  • In general it is the disagreement that is important and not the amount of agreement. This drives how many ‘bad’ units are included in the study. Under sampling bad units can lead to great (statistical) agreement but that will be very misleading…
  • You will learn from marginally good adn marginally bad units than from very good and obviously bad units.
I typically set up such a study to have a set of known good and bad units of all types. Then I run the study with inspectors (2 passses) adn the vision system ( 2 passes). I compare the inspectors to the inspectors to the inspectors and the system to the system. Then if they are each in agreement with themselves (at least 99%) I compare each one to the truth using the first pass results. This iteration tells me that the inspectors adn the system can detect the truth reliably. THEN I compare to results for the system to the inspectors. In this case I look at the false accept and false reject rates of each - then I use thinking to determine if the system is good enough to replace the inspectors.

I think @Miner will have some valuable input as well
 

Jonathanlai928

Starting to get Involved
Thank you for that. It's beginning to get clearer, sorry if I wrote it in a confusing way.

I am deciding to do similar to what you said and use inter-rater reliability Cohen Kappa to compare.
 

Bev D

Heretical Statistician
Leader
Super Moderator
Thank you for that. It's beginning to get clearer, sorry if I wrote it in a confusing way.

I am deciding to do similar to what you said and use inter-rater reliability Cohen Kappa to compare.
You’re welcome. Rely more on looking at the results and calculating false accept and reject rates more than the kappa score. That is really just a rough indicator
 

Miner

Forum Moderator
Leader
Admin
@Bev D has provided excellent advice. The only thing I can add, which may not be relevant to this case is that I have seen a lot of automated testers that actually make continuous measurements then report it as a pass/fail result. In most cases, you can get access to the continuous results and perform a traditional continuous R&R study.
 

Bev D

Heretical Statistician
Leader
Super Moderator
If you are able to perform a continuous data analysis of repeatability on the vision system that is great. Establishing repeatability before assessing reproducibility, method comparison or ‘accuracy’ in terms of false accept & false reject is essential to have reliable results. It can also be invaluable in diagnosing any inaccuracy…just remember that repeatability - while critical - is not sufficient in establishing false acc / false rej rates…which is your stated ultimate objective
 
Top Bottom