app/notes/20190822-0404.md from albertyw/albertyw.com

app/notes/20190822-0404.md

Summary

Maintainability

Test Coverage

Issues
Source
Stats

Issues

Sampling Samples

sampling-samples

1566446641

If you had a set of p90 samples, how would you get a p90 of the overall original data?
Would you take the p90 of your p90 samples?  Would you take the p50 median of your set of p90 samples?
It turns out that neither are correct and it's actually impossible to reliably recover original
percentiles from derived percentile data:

If your original data is grouped into a `[10, 10, 10, 10]` sample and many `[0]` samples, your p90 samples
sould be `[10, 0, 0, 0, 0...]` and your `p90(p90(data))` would turn out to be `0` (your `p50(p90(data))` would
turn out to be 0 too).

Even with equal sized samples, p90 samples aren't useful for finding an overall.  If we had 10
sets of samples, `[1->10]`, `[11->20]`, etc. until `[91->100]`, our overall p90 would be 90, but our
`p90(p90(data))` would be 89.