You're consulting for an online retailer who has handed you an enormous spreadsheet containing every conceivable measurement of each of their 700,000 customers. Each line in the file contains 600 quantitative features corresponding to one of their customers.
Overwhelmed by the sheer magnitude of information, you decide to do what seems like some modest data trimming by rejecting outliers. You make a histogram of each of the 600 quantitative measurements and label as an outlier any value that is outside the inner 99% of the distribution. Now, you go through the list and eliminate all customers for whom one of their measurements is an outlier.
How many customers do you expect to have left to analyze?
Assumptions
This section requires Javascript.
You are seeing this because something didn't load right. We suggest you, (a) try
refreshing the page, (b) enabling javascript if it is disabled on your browser and,
finally, (c)
loading the
non-javascript version of this page
. We're sorry about the hassle.
The probability that any of a customer's measurements is not an outlier is 0 . 9 9 . The probability that all of their 6 0 0 measurements are not outliers is 0 . 9 9 6 0 0 . This is the probability that any one customer would still be counted by the analysis. Taken over 7 0 0 , 0 0 0 customers, the expected number of people that will be counted by the analysis is 0 . 9 9 6 0 0 × 7 0 0 , 0 0 0 , which is approximately 1 6 8 4