The data's too big

Probability Level 4

You're consulting for an online retailer who has handed you an enormous spreadsheet containing every conceivable measurement of each of their 700,000 customers. Each line in the file contains 600 quantitative features corresponding to one of their customers.

Overwhelmed by the sheer magnitude of information, you decide to do what seems like some modest data trimming by rejecting outliers. You make a histogram of each of the 600 quantitative measurements and label as an outlier any value that is outside the inner 99% of the distribution. Now, you go through the list and eliminate all customers for whom one of their measurements is an outlier.

How many customers do you expect to have left to analyze?

Assumptions

For simplicity, approximate the histogram of each of the 600 quantitative measurements as a uniform distribution between its minimum and its maximum value.
Assume there is no correlation in a user being an outlier in one measurement vs another.

The answer is 1683.51.

2 solutions

Trevor B.
Apr 1, 2014

The probability that any of a customer's measurements is not an outlier is $0.99.$ The probability that all of their $600$ measurements are not outliers is $0.99^{600}.$ This is the probability that any one customer would still be counted by the analysis. Taken over $700,000$ customers, the expected number of people that will be counted by the analysis is $0.99^{600}\times700,000,$ which is approximately $\boxed{1684}$

Kunal Das
Apr 28, 2014

Suppose initially there are N customers. After one measurement is chosen, 99% of N will be non-outlier.

When second measure is chosen, It is expected to be randomly distributed among the customers who are outliers and non-outliers with respect to the first measure. That means, 99% of outliers w.r.t. first measure will not be outlier w.r.t. second measure. Similarly, 99% of non-outliers w.r.t. first measure will not be outlier w.r.t. second measure. Thus, we get N * 0.99 * 0.99 customers who are non-outliers w.r.t. both the measures.

So we get N * (0.99)^2 customers will be non-outliers w.r.t. both the measures.

If we apply the same pattern for all the 600 measures, we get N * (0.99)^600 customers will be remaining completely non-outliers w.r.t. all the measures.

In our case, N = 700,000 Thus we get, number of complete non-outliers = 1683.506

The data's too big

The answer is 1683.51.

2 solutions

0 pending reports