Stamilarity¶
The stamilarity module provide tools to test the STAtistical siMILARITY of two samples, i.e. to test wether both samples were generated from the same distribution.
-
stamilarity.
similar
(*args, distrib=None, continuous=False)¶ Return the p-value of the hypothesis that all samples were sampled from the same distribution.
Unless distrib is given, we use the union of all the samples as the theoretical discrete distribution in our test’s hypothesis.
If continuous is True, we use a Kolmogorov Smirnov test.
If continuous is False, samples are treated as drawn from a categorical random variable.
If the samples appear to be drawn from a binary set, we use a binomial test. Otherwise we run a Xi^2.
Parameters: args : iterables
The experimental samples
distrib: dict or None
The theoretical distribution, as a dict. If None, the empirical distribution will be computed from the union of the iterables in args.
continuous: bool
If true, use a statistical tool that works with continuous distributions.
Returns: p: float
Return the p-value of the hypothesis that the sample are drawn from the same distribution.
See also
https
- //docs.scipy.org/doc/scipy/reference/generated/scipy.stats.chisquare.html#scipy.stats.chisquare
References
P-values are very often a misunderstood concept. Please make sure you know how to interpret the results. A good starting point is [R1].
[R1] (1, 2) http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2895822/ Examples
>>> import random >>> import stamilarity
Binary distributions
Compare two samples from the same binary distribution:
>>> fair_coin1 = [1 if random.random()>.5 else 0 for i in range(10000)] >>> random.seed(123) >>> fair_coin2 = [1 if random.random()>.5 else 0 for i in range(10000)] >>> p = stamilarity.similar(fair_coin1, fair_coin2) >>> p > .05 # Fair coins True
Compare a sample against a theoretical binary distribution:
>>> random.seed(789) >>> biased_coin = [1 if random.random()>.6 else 0 for i in range(10000)] >>> p = stamilarity.similar(biased_coin, distrib={1: .4, 0: .6}) >>> p > .05 True
Compare two dissimilar binary samples:
>>> p = stamilarity.similar(fair_coin1, biased_coin) >>> p > .05 # Detect a biased coin False
Categorical distributions
Compare two samples from the same categorical distribution :
>>> fair_dice1 = [random.choice(range(6)) for i in range(10000)] >>> random.seed(456) >>> fair_dice2 = [random.choice(range(6)) for i in range(10000)] >>> p = stamilarity.similar(fair_dice1, fair_dice2) >>> p > .05 # Fair dices True
Compare a sample agains a theoretical categorical distribution:
>>> biased_dice = [random.choice(range(6)) if random.random()>.1 else 0 for i in range(10000)] >>> p = stamilarity.similar(biased_dice, distrib={0: .25, 1: .15, 2: .15, 3: .15, 4: .15, 5: .15}) >>> p > .05 True
Compare two dissimilar categorical samples:
>>> p = stamilarity.similar(fair_dice1, biased_dice) >>> p > .05 # Detect a unfair dice False
When one category is vastly underrepresented (less than 5%), the user is warned.
>>> small_cat_sample = [random.choice(range(7)) for i in range(100)] >>> p = stamilarity.similar(small_cat_sample, distrib={0: .16, 1: .16, 2: .16, 3: .16, 4: .16, 5: .16, 6: .04}) Some frequencies are too small, results of chisquare may be innacurate. >>> p > .05 False
Continuous distributions
Compare two samples from the same continuous distribution:
>>> sample1 = [random.random() for i in range(10000)] >>> random.seed(4242) >>> sample2 = [random.random() for i in range(10000)] >>> p = stamilarity.similar(sample1, sample2, continuous=True) >>> p > .05 # Same distrib True
Compare multiple samples from the sample continuous distribution:
>>> sample3 = [random.random() for i in range(10000)] >>> sample4 = [random.random() for i in range(10000)] >>> p = stamilarity.similar(sample1, sample2, sample3, ... sample4, continuous=True) >>> p > .05 # Same distrib True
Comparing a sample against a theoretical distribution is not implemented yet.
Compare two dissimilar samples
>>> def bias(): ... a = random.random() ... if a < .1: ... a = random.random() ... return a ... >>> biased_sample = [bias() for i in range(10000)] >>> p = stamilarity.similar(sample1, biased_sample, continuous=True) >>> p > .05 # Detect discrepancy False
Comapring multiple samples among which one is biased
>>> p = stamilarity.similar(sample1, sample2, sample3, sample4, ... biased_sample, continuous=True) >>> p > .05 # Detect anomay False