Calculate average

mox

New member
Joined
Aug 4, 2019
Messages
8
Hi everyone.

I have a question about calculating average but not simple average sumOfElements/numOfElements.

I have two arrays that I have to compare which has greater accuracy (for some purposes).
a[6] = [99,97,99,97,99,97]
b[6] = [99,99,99,99,99,89]

If I calculate sum of a[6] and divide by 6 I get result 98.
If I calculate sum of b[6] and divide by 6 I get result 97,33.

What I actually need is that b[6] have greater accuracy because it has 5 elements of 99 and just one 89, while a[6] have three 99 and three 97.

I need to calculate average that can show array that has most higher elements.

I don't know if I have explained it good.
 
How about considering the Standard Deviation of the sets? Set AA has an s.d. of 651.0954\displaystyle \sqrt{\frac{6}{5}} \approx 1.0954 whereas set BB has an s.d. of 5234.0825\displaystyle 5\sqrt{\frac{2}{3}} \approx 4.0825
 
I would try to understand the reason for the "severe" out-lier.
 
How about considering the Standard Deviation of the sets? Set AA has an s.d. of 651.0954\displaystyle \sqrt{\frac{6}{5}} \approx 1.0954 whereas set BB has an s.d. of 5234.0825\displaystyle 5\sqrt{\frac{2}{3}} \approx 4.0825

I have tried this, but I am not sure it's what I wanted to get, let me show real example what I am working on.

a[8] = 98.995697, 99.042831, 99.004173, 99.021362, 99.043694, 99.01416, 99.038483, 93.741989
b[8] = 99.007858, 98.984535, 99.043472, 99.029045, 96.799889, 98.611069, 97.092064, 98.569359

Looking at naked eye it's clear to me that a[8] has higher accuracy (what I wanted), by "high accuracy" I mean value of 100 is perfect accuracy. a[8] has more values that are close to 100.
Standard Deviation doesnt show that.
stddev(a[8])= 1.8671748066291356
stddev(b[8]) = 0.91532345375219248
 
We may be having language issues. Look up "accuracy" and "precision" in their English meanings. For example,


As for measures of central tendency, there is usually not a single criterion of "goodness." It depends what you want to do with that measure. A different way to say it is that any single number designed to summarize more than one differeing number will LOSE some information that the different numbers provide collectively. Different measures of central tendency lose different information. So which information you want to retain determines which measure is relevant.
 
What I actually need is that b[6] have greater accuracy because it has 5 elements of 99 and just one 89, while a[6] have three 99 and three 97.

I need to calculate average that can show array that has most higher elements.

I don't know if I have explained it good.
No, you haven't.

It appears that what you mean by "accuracy" has nothing to do with how close the numbers are to one another, as we have been assuming (e.g. in suggesting standard deviation), but rather that your numbers all represent something like percentages of some goal (so that "accuracy" means 100%), and you are taking "greater accuracy" to mean "more of these percentages are closer to 100". Whether that makes sense, and how to take into account that the few that are lower may be much lower, depends on your application. What will it mean to you if one set fits your criterion? What do these numbers actually mean?

In particular, suppose that one set is {98,98,98} and another is {99,99,0}. Is the latter more accurate, in your opinion? Why?
 
No, you haven't.

It appears that what you mean by "accuracy" has nothing to do with how close the numbers are to one another, as we have been assuming (e.g. in suggesting standard deviation), but rather that your numbers all represent something like percentages of some goal (so that "accuracy" means 100%), and you are taking "greater accuracy" to mean "more of these percentages are closer to 100". Whether that makes sense, and how to take into account that the few that are lower may be much lower, depends on your application. What will it mean to you if one set fits your criterion? What do these numbers actually mean?

In particular, suppose that one set is {98,98,98} and another is {99,99,0}. Is the latter more accurate, in your opinion? Why?


I don't use term "Accuracy" in math sense.
I am doing OCR (Optical Character Recognition) of letters, and those sets represent values of how accurate characters have been recognized.
100 means perfect accuracy, 0 means none.
Whole set represent word, so if one set is {98,98,98} and another is {99,99,0} then {99,99,0} is more accurate.
 
I'm not sure what it means to recognize a single letter 98% accurately; if it takes E as F, is that 75%? I'd expect any measure of accuracy to be an average over many characters, with each one being either yes or no.

How can getting one letter entirely wrong, whatever that means, be better than getting everything close?
 
I'm not sure what it means to recognize a single letter 98% accurately; if it takes E as F, is that 75%? I'd expect any measure of accuracy to be an average over many characters, with each one being either yes or no.

How can getting one letter entirely wrong, whatever that means, be better than getting everything close?
I know what I am looking for, it's not that simple.
Don't go into background what I am looking for, just understand what I am trying to achieve and what I am asking for.
 
I don't use term "Accuracy" in math sense.
I am doing OCR (Optical Character Recognition) of letters, and those sets represent values of how accurate characters have been recognized.
100 means perfect accuracy, 0 means none.
Whole set represent word, so if one set is {98,98,98} and another is {99,99,0} then {99,99,0} is more accurate.
What does "accurate" mean in terms of a letter being recognized and how does that determine a numeric score? These are, as you have pointed out, your terms, not any standard meaning in math.

Using standard language, a letter is either recognized or not. So what do your numbers even mean?
 
What does "accurate" mean in terms of a letter being recognized and how does that determine a numeric score? These are, as you have pointed out, your terms, not any standard meaning in math.

Using standard language, a letter is either recognized or not. So what do your numbers even mean?
OCR works with probability of recognition. Those numbers represent probability of recognition.
If I have some words with chars with probability of recognition {98,98,98,98} and {99,99,99,90} it's more certain that latter word {99,99,99,90} is probably correctly recognized, so I can't do simple mean of values. I have to compare those two sets in a manner that I have described.
As I stated in previous post, don't go into background of what I need, just answer me what I asked for, because I can't go into analysis why is so, I know why I am asking.
 
OCR works with probability of recognition. Those numbers represent probability of recognition.
If I have some words with chars with probability of recognition {98,98,98,98} and {99,99,99,90} it's more certain that latter word {99,99,99,90} is probably correctly recognized, so I can't do simple mean of values. I have to compare those two sets in a manner that I have described.
As I stated in previous post, don't go into background of what I need, just answer me what I asked for, because I can't go into analysis why is so, I know why I am asking.
You do understand that you are trying to utilize a "free" service - where information must flow two-way (or more) without impediment.

My advice for you would be to

go to a local university

"hire" a professor of mathematics under "secrecy agreement" and

offer him a project.
 
OCR works with probability of recognition. Those numbers represent probability of recognition.
If I have some words with chars with probability of recognition {98,98,98,98} and {99,99,99,90} it's more certain that latter word {99,99,99,90} is probably correctly recognized, so I can't do simple mean of values. I have to compare those two sets in a manner that I have described.
As I stated in previous post, don't go into background of what I need, just answer me what I asked for, because I can't go into analysis why is so, I know why I am asking.
You did not say that your numbers were supposed to be probabilities. How could they be: a probability is DEFINED as a non-negative number that does not exceed 1. Furthermore, you did not specify what specifically determines that one array is better than another. You gave an example. It is impossible to create a reliable general rule from a small number of specific examples. You did NOT explain it "good." Either you must explain in a natural language what in general makes makes one array better than another, or you must provide a large number of examples.

In either case, people here may or may not be interested in doing commercial work for free. We are primarily interested in helping students, not for-profit ventures looking for free labor.
 
It's not a commercial project, I am doing it for myself, so I am not trying to get free work from someone in order to get rich from it! I mean, really...!
I asked a simple question and explained it, it's a simple matter, I am not hiding anything, nor am I doing anything in secrecy!
 
You have NOT explained it.

Which is more accurate and WHY?

(99, 95, 97, 97, 97, 97) versus (98, 96, 97, 97, 97, 97)
 
I did, look at the post #7.
You gave an example. You did not give a general rationale. You are looking for a formula that will duplicate your general rationale. So you have to tell us what the rationale is. One or two examples will not do the trick.
 
I don't use term "Accuracy" in math sense.
I am doing OCR (Optical Character Recognition) of letters, and those sets represent values of how accurate characters have been recognized.
100 means perfect accuracy, 0 means none.
Whole set represent word, so if one set is {98,98,98} and another is {99,99,0} then {99,99,0} is more accurate.
OCR works with probability of recognition. Those numbers represent probability of recognition.
If I have some words with chars with probability of recognition {98,98,98,98} and {99,99,99,90} it's more certain that latter word {99,99,99,90} is probably correctly recognized, so I can't do simple mean of values. I have to compare those two sets in a manner that I have described.
As I stated in previous post, don't go into background of what I need, just answer me what I asked for, because I can't go into analysis why is so, I know why I am asking.
The reason background matters is that in order to meet your needs, we need to know exactly what your needs are. When people ask what kind of average is "best", for example, I always have to ask about their purpose, which typically depends on the context of their question, which will determine what matters most. That's all that's happening here. What's best can't be isolated from the specifics of what you are doing. And when people try to isolate a small part of a project, they usually don't get the right answer, because they are withholding information that they don't even realize is important.

So let's look at the facts as they have been gradually revealed. You have data representing the probability that each letter in a word will be recognized correctly; I'll suppose that is because, say, the software recognizes A 99% of the time but Z 90% of the time, and you are mapping each letter in a word to its probability. A word made up of more easily-recognized letters, and fewer easily-missed letters, is more likely to be correct; what you are asking for, then, is the probability that a word with a certain set of letter probabilities will be correctly recognized.

If that is what you want (and it would have been so much easier if you had said that at the beginning), then this is a basic probability problem. You are given p_i = P(letter a_i is correct), and want to know p = P(a_1 is correct and a_2 is correct and ... a_n is correct). That will just be the product of the probabilities, p = p_1 * p_2 * ... * p_n.

For your example, these are
  • {98,98,98,98}: 0.98*0.98*0.98*0.98 = 0.922368
  • {99,99,99,90}: 0.99*0.99*0.99*0.90 = 0.873269
So the former is in fact more likely to be recognized correctly, even though it has no 99's. Intuition is not always right. (That's why we have math ...)

Similarly, for your original example,
  • [99,97,99,97,99,97]: .99*.97*.99*.97*.99*.97 = 0.885566
  • [99,99,99,99,99,89]: .99*.99*.99*.99*.99*.89 = 0.846381
Again, the first is a little more likely to be correct (contrary to your guess), even though it has fewer 99's, because it has more high probabilities.

Does that sound like what you are looking for?

By the way, do you still think that {99,99,0} is more accurate? (If the last letter can't ever be accurate, can the whole word be accurate?)
 
Maybe I should just compare two sets for descending values.
For example:
If I have two sets:
a[5]={99,98,99,98,98}
b[5]={99,90,99,99,99}
I order them by descending
a[5]={99,99,98,98,98}
b[5]={99,99,99,99,90}
and just compare a[1]>b[1], a[2]>b[2]....
and see how much a's are there higher/lower then b's and I will know which set has more higher values.
In this example b[5] will win 2:1 because b[3]>a[3], b[4]>a[4] and b[5]<a[5].
 
Are you saying that you disagree with my analysis of your need?

My suggestion (just multiply) is considerably easier than what you propose here, and it has a clear justification. Math is not done by merely guessing at a way to combine numbers; it involves reasoning based on logic.

Please explain why you think your idea would give you an appropriate assessment for your purposes?
 
Top