How Do I Determine The Degree Of Linearity In A Large Set Of Random Bytes?

The Grand Rascal · May 24, 2021

IMPORTANT: I am NOT a student, and this is NOT any form of, or in any way related to, any homework assignment. I am a private person, 62 years of age.

-----

I'm investigating the quality of the output of the "HotBits" random number server, at:

(URL removed)

HotBits derives its random bytes by measuring the interval between pairs of radioactive decay events. Supposedly, this guarantees randomness because their entropy source is governed directly by quantum mechanics.

I have gathered 168,656 (supposedly) random bytes from them, and wish now to test the quality of their randomness.

While I'm certain there must be many dozens of different tests for randomness, there is really only one that interests me: equal chance of selection. To be truly random, each possible value must have the same likelihood of occurrence.

So what I'd like to do, is to write a program in Commodore BASIC (I'm a C128 hobbyist) that will read all of the random bytes and test the frequency distribution of the byte values for linearity -- the more linear the distribution, the better!

My problem is that I don't know how to measure the linearity of a frequency distribution.

Is there some kind of simple formula, or a simple set of steps, that will provide me with this measurement?

Many thanks, in advance, for any information.

Range of possible values: 0 - 255, inclusive.
Number of values: 168,656.

--The Grand Rascal (a.k.a. "C128User")

Deleted member 4993 · May 24, 2021

The Grand Rascal said:
IMPORTANT: I am NOT a student, and this is NOT any form of, or in any way related to, any homework assignment. I am a private person, 62 years of age.

-----

I'm investigating the quality of the output of the "HotBits" random number server, at:

(URL removed)

HotBits derives its random bytes by measuring the interval between pairs of radioactive decay events. Supposedly, this guarantees randomness because their entropy source is governed directly by quantum mechanics.

I have gathered 168,656 (supposedly) random bytes from them, and wish now to test the quality of their randomness.

While I'm certain there must be many dozens of different tests for randomness, there is really only one that interests me: equal chance of selection. To be truly random, each possible value must have the same likelihood of occurrence.

So what I'd like to do, is to write a program in Commodore BASIC (I'm a C128 hobbyist) that will read all of the random bytes and test the frequency distribution of the byte values for linearity -- the more linear the distribution, the better!

My problem is that I don't know how to measure the linearity of a frequency distribution.

Is there some kind of simple formula, or a simple set of steps, that will provide me with this measurement?

Many thanks, in advance, for any information.

Range of possible values: 0 - 255, inclusive.
Number of values: 168,656.

--The Grand Rascal (a.k.a. "C128User")

What is the DEFINITION of

,..,,,.. LINEARITY of a frequency distribution?

The Grand Rascal · May 24, 2021

"What is the DEFINITION of LINEARITY of a frequency distribution?"

Basically, it means a frequency distribution that is flat. Each value occurs with the same frequency as every other value. If a frequency distribution table shows six possible values that each occur six times, that's a linear frequency distribution. Think of the frequency distribution of 1d6 (e.g., a single six-sided die) rolled 100 times. That's pretty much a linear (flat) frequency distribution -- no "bell curve."

Frankly I'm surprised you ask; I thought the idea was evident from the way I described the problem. I greatly apologize for any unclear or confusing verbiage

HallsofIvy · May 25, 2021

No, that is NOT the definition of "linearity of a frequency distribution". You have said what it means for a "frequency distribution" to be linear but the "linearity" is a measure of how "linear" a frequency distribution is. We need to know how that measure is defined.

In mathematics, definitions are "working definitions"- that is, you use the precise words of a definition are used in proofs and problems involving the object defined. The reason Subhotosh Kahn asked you to give the definition of "linearity of a frequency distribution" was to get you to think about the precise words in the definition.

Cubist · May 25, 2021

OP, I think you want to show that the numbers follow a discrete uniform distribution (click).

Your basic program could build a histogram of how often each possible number occurs in a sample of data. If you plot the histogram, you'd expect it to approach a rectangle shape (horizontal at the top) as the amount of data that you input is increased.

Deleted member 4993 · May 25, 2021

Cubist said:
OP, I think you want to show that the numbers follow a discrete uniform distribution (click).

Your basic program could build a histogram of how often each possible number occurs in a sample of data. If you plot the histogram, you'd expect it to approach a rectangle shape (horizontal at the top) as the amount of data that you input is increased.

I wanted OP to distinguish between linear and uniform.

To solve the problem, I would:

calculate the best fit straight-line through the given points and observe the "slope" of the line (how much 'm' can I tolerate?)
Calculate R² value to observe the "goodness-of-fit" to a straight line

These can be done easily by importing data to an excel program or manually coding and calculating "y = mx+b" and R² values.

Cubist · May 25, 2021

Subhotosh Khan said:
I wanted OP to distinguish between linear and uniform.

To solve the problem, I would:

calculate the best fit straight-line through the given points and observe the "slope" of the line (how much 'm' can I tolerate?)

Calculate R² value to observe the "goodness-of-fit" to a straight line

These can be done easily by importing data to an excel program or manually coding and calculating "y = mx+b" and R² values.

That sounds like a very nice feature to implement/ perform! I guess it would be good to calculate, mathematically, the expected R² value for a specific number of input samples assuming the data is truly random. Then, this could be compared to the observed value. I'd have to do some research if I was attempting to do this myself. @The Grand Rascal , does this make sense to you and is it something that you could attempt to do (or be interested in)?

The Grand Rascal · May 26, 2021

Cubist said:
OP, I think you want to show that the numbers follow a discrete uniform distribution (click).

From what I read in the Wikipedia article, this sounds about right. ?

I'm not interested in a (visible) histogram, although it would certainly be necessary to (sort of) construct one in memory. I'd dimension an array -- "DIM N(255)" -- which would create an array of 256 elements, 0-255, and I'd store the frequency that each value occurs in the corresponding element.

Then I'd need some (hopefully simple!) way of determining that the distribution of all elements is reasonably uniform!

Hmmm... would a simple output of the Highest Value and the Lowest Value (I think you call that the "scatter") suffice, do you think?

Cubist · May 27, 2021

The Grand Rascal said:
I'm not interested in a (visible) histogram, although it would certainly be necessary to (sort of) construct one in memory. I'd dimension an array -- "DIM N(255)" -- which would create an array of 256 elements, 0-255, and I'd store the frequency that each value occurs in the corresponding element.

This sounds good.

The Grand Rascal said:
Then I'd need some (hopefully simple!) way of determining that the distribution of all elements is reasonably uniform!

Hmmm... would a simple output of the Highest Value and the Lowest Value (I think you call that the "scatter") suffice, do you think?

You've picked a reasonably difficult project! The problem is, how would you know what output to expect? People aren't very good at judging randomness.

After doing a search, I found a (fairly) simple test called the chi squared test. This doesn't seem to take into account the number of samples that you have, but you might be interested nonetheless. Take a look at the SECOND suggestion (from "Jon Reid") in this discussion stackoverflow.com (click). It should give you, "the probability of the result being due to chance".

If you want to do this, then first you need to calculate X². Basically you'd have to loop through all the elements of your array. For every element let x=the difference between the observed and expected value. Then, square x, and divide it by the expected value. Then add this number to a running total. At the end you'll have an amount for X² which you can plug into the calculator on this page which will finally give the probability of the result being due to chance (d is 256).

For extra info see this page chi-squared_test(click).

Interestingly, as pointed out on stackoverflow, building a histogram won't detect a sequence of numbers. That is, if the numbers in your sample go 0,1,2,3..,254,255,0,1,2...255,etc. But you should get an indication of whether or not the sample has a bias towards (or away from) some numbers.

The Grand Rascal · Jun 6, 2021

Cubist wrote:

"For every element let x=the difference between the observed and expected value."

Sounds reasonable, except that I'm not completely sure what the "expected value" should be. Since the range of numbers is from 0 - 255, my guess is that this would be the midway point, or 128. Is that correct, or is there another number I should be measuring against? (And, how on Earth do I calculate this in BASIC... but that's my headache!)

(And BTW, thanks for all the help.)

The Grand Rascal · Jun 6, 2021

[Scratches head thoughtfully]: Hmmm... Would a simple mean deviation from 128 be sufficient, do you think...???

Cubist · Jun 7, 2021

The Grand Rascal said:
...I'm not completely sure what the "expected value" should be.

We need the expected value for the "height of the histogram" in the array "DIM N(255)". In other words, what average value would you expect to find within the elements of N after reading in all the random bytes?

It might help you to think about what would happen if you fed in 512 random bytes into your program (512=2*256) rather than the full set of 168,656. Then, if you output the 256 "N" values to the screen, what number would you expect to see (on average). Clue:- it wouldn't be 128,128,128,....,128,128

BTW: I think you can probably figure this out, that's why I'm not telling you directly!

The Grand Rascal said:
(And BTW, thanks for all the help.)

You're welcome.

The Grand Rascal · Jun 7, 2021

I had to do a LOT of thinking on this, and I'm still not completely sure I have the right answer! Is "658.8125" what you had in mind...???

Cubist · Jun 7, 2021

The Grand Rascal said:
Is "658.8125" what you had in mind...???

Perfect, that's the expected value (for all 168,656 samples). Good luck writing the BASIC code!

The Grand Rascal · Jun 7, 2021

Aha! I got it! Yes, now that I think about it, a simple mean deviation from 658.8125 (rounded up to 659)* will quite suffice.

*(There's no Earthly point including the decimal, since the frequency count of each value must be an integer anyway. It does mean that there's no way for the mean deviation to ever be zero, but then I'm not expecting it to be. I'm not trying to prove or measure "randomness" -- even if that's what that amounts to -- but to measure the degree to which each value is equally likely to occur. The closer to zero the mean deviation gets, the more closely the bytes meet that criterion, and for my purposes that should be perfectly adequate.)

A mean deviation will also be significantly easier to program (and what is even better, much easier for me to understand!). I'm mulling over some possible code in my mind even as I write this... I'll keep you posted!

The Grand Rascal · Jun 9, 2021

It's 03:30 in the morning, and I just had to pop in here to address a niggling concern which abruptly occurred to me:

Will it make any difference whether I subtract 659 from each value's frequency, or subtract each value's frequency from 659? That is, does --

(FREQ - 659) = (659 - FREQ) ?

My common sense tells me that it shouldn't make any difference whatsoever -- especially since I will be taking the absolute value of the result! -- but, given that subtraction is NOT a commutative operation, I thought I'd just better check in here to be sure... since math (or even just simple arithmetic!) doesn't always accord with "common sense."

Cubist · Jun 9, 2021

The Grand Rascal said:
It's 03:30 in the morning, and I just had to pop in here to address a niggling concern which abruptly occurred to me:

So, like me, you alo struggle to sleep when something interesting is buzzing around your brain! ?. Sorry that I wasn't around to answer more quickly so that you could sleep!

The Grand Rascal said:
Will it make any difference whether I subtract 659 from each value's frequency, or subtract each value's frequency from 659?

One method yields the negative of the other method. So, for example, you'll either get 2 or -2. So, to answer your question, have a look at the next step (highlighted in bold)...

Cubist said:
If you want to do this, then first you need to calculate X². Basically you'd have to loop through all the elements of your array. For every element let x=the difference between the observed and expected value. Then, square x, and divide it by the expected value. Then add this number to a running total. At the end you'll have an amount for X² which you can plug into the calculator on this page which will finally give the probability of the result being due to chance (d is 256).

Do you know what happens when you square a negative number? (If you're unsure then search the internet for "square of a negative number", or ask here)

The Grand Rascal · Jun 9, 2021

Yes, I do know the result of squaring a negative number.*

It's the foundation, heart, and soul of the "standard deviation," which frankly I have never fully or properly understood. I consider it a lot more effort than it's worth; a simple "mean deviation" is more logical and surely easier to understand.

As for the negative numbers... I guess you missed where I said that "I will be taking the absolute value of the result;" there is a simple snippet of BASIC code that will handle that:

Code:

X = ABS(X)

"Looky, Maw! No more negative numbers!"

Just thinking about "standard deviation" gives me a headache (which always makes me nervous, since I'm subject to migraines).

---------
*The result of square rooting (is that proper wording?) a negative number is a lot more interesting, however. At least, "i" think so.

Cubist · Jun 9, 2021

The Grand Rascal said:
It's the foundation, heart, and soul of the "standard deviation," which frankly I have never fully or properly understood...

I don't think I've ever given it that much thought, I just accepted all this stuff at college

. Too much thinking can be a dangerous pass-time

The Grand Rascal said:
...I consider it a lot more effort than it's worth

Just let the trusty Z80 in your Commodore handle all the hard work!

The Grand Rascal said:
a simple "mean deviation" is more logical and surely easier to understand.

Why not. You could always expand the functionality later if you want to. The only problem with the mean deviation is, as I said in post#9, knowing what number to expect. In other words, what kind of output value would lead you to the conclusion that the data is random?

The Grand Rascal · Jun 10, 2021

Well, as we established in messages #13 and #14, the "ideal distribution" would be that each value occurs 658.8125 times. Rounded up to 659, that's more or less the ideal distribution.

"[W]hat kind of output value would lead you to the conclusion that the data is random?"

A mean deviation from 659 that is as close to zero as possible would do it. How close? Well, you have a point: I don't really know. But as a reasoned guess, and with 168,656 samples, I'd say that any M.D. under 17 (one 10,000th, rounded, the number of samples) would probably be really good.

(You know, finding the range would be helpful, too, since in an "ideal" [flat] distribution, this too would be small, and the smaller the better.)

I don't have ready physical access to my C128 at the moment, so actual coding will have to wait, although I am pondering some pseudocode...

How Do I Determine The Degree Of Linearity In A Large Set Of Random Bytes?

New member

Deleted member 4993

Guest

New member

Elite Member

Senior Member

Deleted member 4993

Guest

Senior Member

New member

Senior Member

New member

New member

Senior Member

New member

Senior Member

New member

New member

Senior Member

New member

Senior Member

New member