Data Analysis Foundation Quiz #1 Q5

macademianut · December 1, 2024, 10:14pm

Small bit of feedback: I think this question should be clarified with more details, since the “correct” answer only holds if the random variable is continuous and there was no indication of this in the question.

A list of comprised of the following numbers:
1,8,12,17,19,20,21,23,23,24,28,28,36
We are then asked to compare A) The interquartile range, and B) 7. The correct answer is marked as A is greater.

For this question, I naturally interpreted the list as realizations of a discrete random variable X. In which case with the conventional definition of quantile functions,
Q(p) = \inf \{x \in \mathbb{R} : F(x) \geq p\}
This implies that Q(0.25) = 17, Q(0.75) = 24, for an IQR of exactly 7.

If X was continuous however, then it can be shown that
IQR(X) \in (7, 16)
which is what I assume the correct answer is intending for.

Leaderboard · December 2, 2024, 5:56am

X is discrete yes, but we follow ETS’ definition: page 14 of https://www.ets.org/pdfs/gre/gre-math-conventions.pdf. Also Quartile Calculator | Interquartile Range Calculator does not agree with you.

Continuous random variables in probability are out-of-scope for the most part on the GRE.

macademianut · December 2, 2024, 4:02pm

Got it thanks - I’ll look over that document for ETS conventions.

For others who also come across this question, note that an IQR of exactly 7 is the correct answer when we define quantiles more rigorously (in practice, most scientific libraries e.g. scipy, R, Julia would give you this answer too).

For the GRE, we should treat quantiles using the “high school” definition where if it does not evenly divide the data, we interpolate between two points. For example the list:
X = [1,2,3,4,5]
The 25th percentile is the 1.25th element of the list. A rigorous definition of quantiles would set Q(0.25) = 2, since we guarantee P(X \leq 2) >= 0.25. The high school definition would set Q(0.25) = 1.5 (average of the first two elements), which of course is not a valid quantile function since P(X \leq 1.5) = 0.2 < 0.25.

cylverixxx · December 2, 2024, 5:58pm

Most of what you’ve mentioned is correct, but for something like computing Q1 in [1,2,3,4,5,6], most definitions lead to 3.5. I agree that all values between 3 and 4 satisfies the condition, but since we want a unique output we interpolate between data points. This approach is certainly not exclusive to the GRE alone.

Scroll down to computing methods for discrete distributions

macademianut · December 2, 2024, 6:19pm

By Q1 do you mean the 1st quartile (25th percentile)? With the conventional definition,
Q(p) = \inf\{x \in \mathbb{R} : F(x) \geq p\}
and assuming X is discrete, we can see that in the empirical support of the given list,
F(1) = 0.2, F(2) = 0.4, F(3) = 0.6, ...
With a properly defined quantile function that satisfies Galois conditions, we must have Q(0.25) = 2. This is because F(x) = 0.2 \ \forall x \in [1,2) if X is discrete, and we can’t magically interpolate some x^* \in [1,2) s.t. F(x^*) = 0.25. At least, assuming the F we are discussing here is the empirical distribution.

cylverixxx · December 3, 2024, 12:23pm

A median of a distribution F : \mathbb{R} \to [0, 1] would be any m \in \mathbb{R} that satisfies:

F(m) \geq \frac{1}{2} \quad \text{and} \quad 1 - F(m^-) \geq \frac{1}{2}.

Now let Q(p) = \inf \{ x \in \mathbb{R} : F(x) \geq p \}. Fix any p \in (0, 1) and set q = Q(p). Let x_n \downarrow q. By right continuity of F we have F(x_n) \to F(q) \geq p. Furthermore, by definition of q we see that whenever x < q, then F(x) < p. So if x_n \uparrow q, then \lim F(x_n) \leq p which shows F(q^-) \leq p and thus 1 - F(q^-) \geq 1 - p.

In particular, m = Q(1/2) is always a median of F. Similarly, you could define that the “first quartile” is any q \in \mathbb{R} satisfying:

F(q) \geq \frac{1}{4}, \quad \text{and} \quad 1 - F(q^-) \geq \frac{3}{4}, which is again satisfied by Q(1/4). Generally, this will give one way to obtain some notion of \text{IQR} via
\text{IQR} = Q(3/4) - Q(1/4).

However, the inequalities that the quartiles have to satisfy do not necessarily have unique solutions.

In particular, for a finite list of data points, you could assign a discrete distribution to this and the above generalizes. Essentially, if you have a sample \{ X_1, …, X_n\} then you’ll attempt to find the IQR without knowing the actual distribution of X. In other words, you’re looking for an estimator for Q(3/4) - Q(1/4) from our sample. Some “good estimator” happen to be the discretized methods offered in wiki (both the high school and your definition are equally valid), whereby we obtain an estimation of the IQR of the underlying sampled distribution. One can imagine that “good estimators” at the very least would check if their expectation converges to the underlying IQR and how the error of the estimator behaves as you increase the sample set. As you’d know, it’s not far fetched to imagine that interpolation also works as a metric to improve the quality of an estimator.

Tldr; if you take a discrete distribution on something like {1,2,3,4} then any real number on the interval [2,3) should be a median, so you have as many medians as real numbers. Your “rigorous treatment” only really works nicely for continuous distributions with nice densities and there’s no unique definition for the discrete case. The methodology you’re using is just the nearest rank definition of quantiles with some rounding, and that isn’t any more “rigorous” than the “high school” definition, so i’m not sure what your caveat was. In fact, the nearest rank definition isn’t even the optimal choice for sub 100 element finite lists, but that’s besides the point. Anyhow, different definitions leads to different answers and thus a disparity in quartile values is not a surprise owing to there being multiple things which are all not well-defined to be something unique.

macademianut · December 3, 2024, 5:09pm

Hmm interesting argument, and I agree with most of your points. But there’s a couple of issues in your post, and if you think about it carefully then you’ll find that it actually supports what I was saying. This is because the property of quantile functions (Galois conditions) that you are implicitly using in your argument leads to an a.s. unique way to define quantiles.

As you claimed and proved: A point q \in \mathbb{R} is the first quartile (25th percentile) of a distribution F iff
F(q) \geq \frac{1}{4}, 1- F(q^-) \geq \frac{3}{4}

Consider the list I offered above:
X = [1,2,3,4,5]
alongside the empirical distribution \hat{F}. For the first inequality, \hat{F}(q) \geq \frac{1}{4} \iff q \geq 2. For the second inequality, 1 - \hat{F}(q^-) \geq \frac{3}{4} \iff q \leq 2. To satisfy both inequalities, we must have the unique solution q = 2.

You should see then why the conventional “rigorous” (meaning well-defined) definition of quantiles I mentioned is:
Q_{F}(p) = \inf\{x \in \mathbb{R} : F(x) \geq p\}
where we can also see that Q_{\hat{F}}(\frac{1}{4}) = 2, and Q : [0,1] \to \mathbb{R} is a well-defined function that outputs a unique value for any desired percentile p. In fact, your definition of a quantile is exactly equivalent to Q, in the sense that Q is the unique function that satisfies the inequalities in your proposed definition.

In order for interpolation to work, we must drop this conventional definition of Q, and thereby also relax the inequalities you proposed. The “high school” definition essentially pretends \hat{F} is continuous by interpolating between the points a specific way to avoid the messiness of discrete distributions. Whether we should do this is a different story, and not the purpose of my post.

Side note: As for viewing the interpolation method as an estimator as you suggested, I think that’s a really good way of thinking about it. Just note that this estimator is asymptotically biased and not consistent for any p “in between” points in the support, otherwise it coincides with Q. With the Q-definition viewed as an estimator, it is consistent but may have finite sample issues. In other words, you should be able to see why the definition of Q is “convention” and what most computing implementations are based on, since it is the most conservative and guarantees that Q(p) is a valid quantile for any p.