Stamilarity

The stamilarity module provide tools to test the STAtistical siMILARITY of two samples, i.e. to test wether both samples were generated from the same distribution.

stamilarity.similar(*args, distrib=None, continuous=False)

Return the p-value of the hypothesis that all samples were sampled from the same distribution.

Unless distrib is given, we use the union of all the samples as the theoretical discrete distribution in our test’s hypothesis.

If continuous is True, we use a Kolmogorov Smirnov test.

If continuous is False, samples are treated as drawn from a categorical random variable.

If the samples appear to be drawn from a binary set, we use a binomial test. Otherwise we run a Xi^2.

Parameters:

args : iterables

The experimental samples

distrib: dict or None

The theoretical distribution, as a dict. If None, the empirical distribution will be computed from the union of the iterables in args.

continuous: bool

If true, use a statistical tool that works with continuous distributions.

Returns:

p: float

Return the p-value of the hypothesis that the sample are drawn from the same distribution.

See also

https
//docs.scipy.org/doc/scipy/reference/generated/scipy.stats.chisquare.html#scipy.stats.chisquare

References

P-values are very often a misunderstood concept. Please make sure you know how to interpret the results. A good starting point is [R1].

[R1](1, 2) http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2895822/

Examples

>>> import random
>>> import stamilarity

Binary distributions

Compare two samples from the same binary distribution:

>>> fair_coin1 = [1 if random.random()>.5 else 0 for i in range(10000)]
>>> random.seed(123)
>>> fair_coin2 = [1 if random.random()>.5 else 0 for i in range(10000)]
>>> p = stamilarity.similar(fair_coin1, fair_coin2)
>>> p > .05  # Fair coins
True

Compare a sample against a theoretical binary distribution:

>>> random.seed(789)
>>> biased_coin = [1 if random.random()>.6 else 0 for i in range(10000)]
>>> p = stamilarity.similar(biased_coin, distrib={1: .4, 0: .6})
>>> p > .05
True

Compare two dissimilar binary samples:

>>> p = stamilarity.similar(fair_coin1, biased_coin)
>>> p > .05  # Detect a biased coin
False

Categorical distributions

Compare two samples from the same categorical distribution :

>>> fair_dice1 = [random.choice(range(6)) for i in range(10000)]
>>> random.seed(456)
>>> fair_dice2 = [random.choice(range(6)) for i in range(10000)]
>>> p = stamilarity.similar(fair_dice1, fair_dice2)
>>> p > .05  # Fair dices
True

Compare a sample agains a theoretical categorical distribution:

>>> biased_dice = [random.choice(range(6)) if random.random()>.1 else 0                       for i in range(10000)]
>>> p = stamilarity.similar(biased_dice, distrib={0: .25,                                              1: .15,                                              2: .15,                                              3: .15,                                              4: .15,                                              5: .15})
>>> p > .05
True

Compare two dissimilar categorical samples:

>>> p = stamilarity.similar(fair_dice1, biased_dice)
>>> p > .05  # Detect a unfair dice
False

When one category is vastly underrepresented (less than 5%), the user is warned.

>>> small_cat_sample = [random.choice(range(7)) for i in range(100)]
>>> p = stamilarity.similar(small_cat_sample, distrib={0: .16,                                              1: .16,                                              2: .16,                                              3: .16,                                              4: .16,                                              5: .16,                                              6: .04})
Some frequencies are too small, results of chisquare may be innacurate.
>>> p > .05
False

Continuous distributions

Compare two samples from the same continuous distribution:

>>> sample1 = [random.random() for i in range(10000)]
>>> random.seed(4242)
>>> sample2 = [random.random() for i in range(10000)]
>>> p = stamilarity.similar(sample1, sample2, continuous=True)
>>> p > .05  # Same distrib
True

Compare multiple samples from the sample continuous distribution:

>>> sample3 = [random.random() for i in range(10000)]
>>> sample4 = [random.random() for i in range(10000)]
>>> p = stamilarity.similar(sample1, sample2, sample3,
... sample4, continuous=True)
>>> p > .05  # Same distrib
True

Comparing a sample against a theoretical distribution is not implemented yet.

Compare two dissimilar samples

>>> def bias():
...    a = random.random()
...    if a < .1:
...        a = random.random()
...    return a
...
>>> biased_sample = [bias() for i in range(10000)]
>>> p = stamilarity.similar(sample1, biased_sample, continuous=True)
>>> p > .05  # Detect discrepancy
False

Comapring multiple samples among which one is biased

>>> p = stamilarity.similar(sample1, sample2, sample3, sample4,
... biased_sample, continuous=True)
>>> p > .05  # Detect anomay
False

Indices and tables