albertyw/albertyw.com

View on GitHub
app/notes/20190822-0404.md

Summary

Maintainability
Test Coverage
Sampling Samples

sampling-samples

1566446641

If you had a set of p90 samples, how would you get a p90 of the overall original data?
Would you take the p90 of your p90 samples?  Would you take the p50 median of your set of p90 samples?
It turns out that neither are correct and it's actually impossible to reliably recover original
percentiles from derived percentile data:

If your original data is grouped into a `[10, 10, 10, 10]` sample and many `[0]` samples, your p90 samples
sould be `[10, 0, 0, 0, 0...]` and your `p90(p90(data))` would turn out to be `0` (your `p50(p90(data))` would
turn out to be 0 too).

Even with equal sized samples, p90 samples aren't useful for finding an overall.  If we had 10
sets of samples, `[1->10]`, `[11->20]`, etc. until `[91->100]`, our overall p90 would be 90, but our
`p90(p90(data))` would be 89.