Comparing Distributions of Datasets

Furn · Apr 5, 2019

I've been asked to find a method to best compare the distribution of a number datasets that have small sample sizes. Bonus points for a solution/result that is in a scale of 0-1, i.e. a distribution approaching 1 is bordering on perfectly unequal and a distribution approaching 0 is bordering on perfectly equal.

Some examples within this dataset include:

Sample A: [10,1]
Sample B: [10,1,1]
Sample C: [4,4,3,2,2]

In other words, the method used should show A to have a distribution close close to 1 (almost perfectly unevenly distributed), B to be close to 1 but further away from 1 than A's distribution, and C to be closer to 0 (quite an equal distribution).

I first thought of the Gini coefficient, which is precisely about distribution and gives values between 0-1. However it seems the Gini has a 'small-sample bias' that limits its use here, where each of the datapoints have between 1 and c.10 values.

I then considered the coefficient of variance, however given results can go higher than 1 this also isn't well suited to this problem.

Any pointers would be greatly appreciated!

Furn · Apr 5, 2019

To be a bit clearer about the example I gave:

Dataset A contains the values of 10 and 1
Dataset B contains the values of 10, 1, and 1
Dataset C contains the values of 4, 4, 3, 2, and 2

What I'm looking for is some way to depict the distribution of each dataset in a number that is between 0 and 1. So dataset A would get a result that is close to 1, meaning it is quite unevenly distributed. Dataset C on the other hand would get a result that is close to 0, as it is pretty evenly distributed. Dataset B on the other hand is somewhere in between, although closer to 1 than 0.

Comparing Distributions of Datasets

Furn

New member

Furn

New member