1 Introduction
A metric space is a nonempty set endowed with a metric, i.e., a function such that
- β’
if and only if (identity of indiscernibles),
- β’
(symmetry), and
- β’
(triangle inequality)
for all , , [13].
For all , define . Given and oracle access to a metric , metric -median asks for , breaking ties arbitrarily. It generalizes the classical median selection on the real line and has a brute-force -time algorithm. More generally, metric -median asks for , , , minimizing . Because defines nonzero distances, only -time algorithms are said to run in sublinear time [8]. For all , an -approximate -median is a point satisfying
| | |
For all , metric -median has a Monte Carlo -approximation -time algorithm [8, 9]. Guha et al. [7] show that metric -median has a Monte Carlo, -approximation, -time, -space and one-pass algorithm for all small as well as a deterministic, -approximation, -time, -space and one-pass algorithm. Given points in with , the Monte Carlo algorithms of Kumar et al. [10] find a -approximate -median in time and a -approximate solution to metric -median in time. All randomized -approximation algorithms for metric -median take time [11, 7]. Chang [2] shows that metric -median has a deterministic, -approximation, -time and nonadaptive algorithm for all constants , generalizing the results of Chang [1] and Wu [15]. On the other hand, he disproves the existence of deterministic -approximation -time algorithms for all constants and [3, 4].
In social network analysis, the closeness centrality of a point is the reciprocal of the average distance from to all points [14]. So metric -median asks for a point with the maximum closeness centrality. Given oracle access to a graph metric, the Monte-Carlo algorithms of Goldreich and Ron [6] and Eppstein and Wang [5] estimate the closeness centrality of a given point and those of all points, respectively.
All known sublinear-time algorithms for metric -median are either deterministic or Monte Carlo, the latter having a positive probability of failure. For example, Indykβs Monte Carlo -approximation algorithm outputs with a positive probability a solution without approximation guarantees. In contrast, we show that metric -median has a randomized algorithm that always outputs a -approximate solution in expected time for all constants . So, excluding the known deterministic algorithms (which are Las Vegas only in the degenerate sense), this paper gives the first Las Vegas approximation algorithm for metric -median with an expected sublinear running time. Note that deterministic sublinear-time algorithms for metric -median can be -approximate but not -approximate for any constant [1, 4]. So our approximation ratio of beats that of any deterministic sublinear-time algorithm. Inheriting Indykβs algorithm, our algorithm outputs a -approximate -median in time with probability for all constants .
Below is our high-level and inaccurate sketch of proof, where , are small constants:
- (i)
Run Indykβs algorithm to find a probably -approximate -median, . Then let be the average distance from to all points.
- (ii)
For all , denote by the open ball with center and radius . Use the triangle inequality (with details omitted here) to show to be a solution no worse than the points in , i.e.,
| | | (1) |
- (iii)
Take a uniformly random bijection . Then observe that
| | | | | (2) |
| | | | | (3) |
where the first (resp., second) inequality follows from the injectivity of (resp., the triangle inequality).
- (iv)
Assume for simplicity. So by inequalities (1)β(3), if the following inequality holds, then it serves as a witness that is -approximate:
| | | (4) |
To guarantee outputting a -approximate -median, output only when inequality (4) holds. Restart from item (i) whenever inequality (4) is false.
More details of item (iv) follow: For a -median of , it will be easy to show
| | | (5) |
When in item (i) is indeed -approximate,
| | | (6) |
Assuming , inequalities (5)β(6) make inequality (4) hold with high probability as long as is highly concentrated around its expectation. The need for such concentration is why we restrict the radius of the codomain of to be in item (iii)βLarge distances ruin concentration bounds. To accommodate for the points in , our witness for the approximation ratio of actually differs slightly from inequality (4), unlike in item (iv).
2 Definitions and preliminaries
For a metric space , and , define
| | |
to be the open ball with center and radius . For brevity,
| | |
The pairs in are ordered.
An algorithm with oracle access to is denoted by and may query on any for . In this paper, all Landau symbols (such as , , and ) are w.r.t. . The following result is due to Indyk.
Fact 1 ([8, 9]).
For all , metric -median has a Monte Carlo -approximation -time algorithm with a failure probability of at most .
Henceforth, denote Indykβs algorithm in Fact 1 by Indyk median. It is given , and oracle access to a metric . By convention, denote the expected value and the variance of a random variable by and , respectively.
Chebyshevβs inequality ([12]).
Let be a random variable with a finite expected value and a finite nonzero variance. Then for all ,
| | |
4 Probability of termination in any iteration
This section analyzes the probability of running line 7 in any particular iteration of the while loop of Las Vegas median. The following lemma uses an easy averaging argument.
Lemma 5.
| | |
and, therefore,
| | |
Proof.
Clearly,
| | |
Then use line 4 of Las Vegas median. β
Henceforth, assume without loss of generality; otherwise, find a -median by brute force. So by Lemma 5. Define
| | | (9) |
to be the average distance in .
Lemma 6.
.
Proof.
By equation (9) and the triangle inequality,
| | | | |
| | | | |
| | | | |
Obviously, the average distance from to the points in is at most that from to all points, i.e.,
| | | (11) |
Inequalities (4)β(11) and line 4 of Las Vegas median complete the proof. β
To analyze the probability that the condition in line 6 of Las Vegas median holds, we shall derive a concentration bound for
| | |
whose expected value and variance are examined in the next four lemmas.
Lemma 7.
With expectations taken over ,
| | | (12) |
Proof.
For each , is a uniformly random size- subset of by line 5 of Las Vegas median. Therefore,
| | | | |
| | | | |
| | | | | (14) |
where the second (resp., last) equality follows from the identity of indiscernibles (resp., equation (9) and Lemma 5). Finally, use equations (4)β(14), the linearity of expectation and Lemma 5. β
Clearly,
| | | | |
| | | | |
| | | | |
| | | | | (16) |
where the last equality follows from the linearity of expectation and the separation of pairs according to whether .
Lemma 8.
With expectations taken over ,
| | |
Proof.
Pick any distinct , . By line 5 of Las Vegas median,
| | |
is a uniformly random size- subset of . So
| | | | |
| | | | |
| | | | |
Clearly,
| | | | |
| | | | |
| | | | |
In summary,
| | | | |
| | | | |
| | | | |
| | | | |
Together with Lemma 5 and equation (9), this completes the proof. β
Lemma 9.
With expectations taken over ,
| | | (17) |
Proof.
By line 5 of Las Vegas median, is a uniformly random size- subset of for each . Therefore,
| | | | |
| | | | |
| | | | |
| | | | |
For all , ,
| | | (19) |
where the first inequality follows from the triangle inequality.
By equations (9) and (4)β(19), the left-hand side of inequality (17) cannot exceed the optimal value of the following problem, called max square sum:
Find for all , to maximize
| | | (20) |
subject to
| | | (21) |
| | | (22) |
Above, constraint (21) (resp., (22)) mimics equation (9) (resp., inequality (19) and the non-negativeness of distances). Appendix A bounds the optimal value of max square sum from above by
| | |
This evaluates to be at most by Lemma 5. β
Recall that the variance of any random variable equals .
Lemma 10.
With variances taken over ,
| | |
Proof.
By equations (4)β(16) and Lemmas 8β9,
| | |
This and Lemma 7 imply
| | |
Finally, invoke Lemma 6. β
Lemma 11.
For all ,
| | |
where the probability is taken over .
Proof.
Use Chebyshevβs inequality and Lemmas 7 and 10. β
Let be a -median of , i.e.,
| | |
breaking ties arbitrarily. So by the averaging argument,
| | | (23) |
Lemma 12.
| | |
Proof.
We have
| | |
Clearly, . β
Lemma 13.
For all sufficiently large ,
| | |
Proof.
We have
| | | | |
| | | | |
| | | | |
| | | | |
where the first inequality (resp., the first equality) follows from the triangle inequality (resp., line 4 of Las Vegas median). By Lemmas 6 and 12,
| | | (25) |
By inequalities (4)β(25) and Lemma 5, . β
Lemma 14.
For all sufficiently large ,
| | |
Proof.
By the triangle inequality,
| | | | |
| | | | |
| | | | |
Now sum up the above with the inequality in Lemma 12. β
Lemma 15.
For all sufficiently large and with probability greater than ,
| | | (26) |
where the probability is taken over and the internal coin tosses of Indyk median in line 3 of Las Vegas median.
Proof.
By Lemma 11 with ,
| | | (27) |
with probability at least . By Fact 1 and line 3 of Las Vegas median,
| | | | | (28) |
| | | | | (29) |
with probability at least . Now by the union bound, inequalities (27)β(29) hold simultaneously with probability at least . It remains to derive inequality (26) from inequalities (27)β(29) for all sufficiently large .
Line 4 of Las Vegas median, inequalities (28)β(29) and Lemma 14 give
| | | (30) |
This and inequality (27) imply
| | | | | (31) |
| | | | |
| | | | |
Clearly, for all sufficiently large . So inequality (31) implies, for all sufficiently large and after laborious calculations,
| | | | |
| | | | |
This implies inequality (26) for all sufficiently large (note that by line 1 of Las Vegas Median). β
Lemma 15 and lines 6β7 of Las Vegas median show the probability of termination in any iteration to be . Because the proof of Lemma 15 implies that inequalities (26)β(29) hold simultaneously with probability in any iteration of Las Vegas median, it happens with probability that in the first iteration, is returned in line 7 (because of inequality (26)) and is -approximate (because of inequality (28)). So Las Vegas median outputs a -approximate -median with probability in the first iteration. In summary, we have the following.
Lemma 16.
The first iteration of the while loop of Las Vegas median outputs a -approximate -median with probability .
5 Putting things together
We now show that metric -median has a Las Vegas -approximation algorithm with an expected running time for all constants . Our algorithm also outputs a -approximate -median in time with probability .
Theorem 17.
For each constant , metric -median has a randomized algorithm that (1) always outputs a -approximate solution in an expected time and that (2) outputs a -approximate solution in time with probability .
Proof.
By Lemma 4, Las Vegas median outputs a -approximate -median at termination. To prevent Las Vegas median from running forever, find a -median by brute force (which obviously takes time) after steps of computation.
By Fact 1, line 3 of Las Vegas median takes time. Line 5 takes time by the Knuth shuffle. Clearly, the other lines also take time. Consequently, each iteration of the while loop of Las Vegas median takes time. By Lemma 15 and lines 6β7, Las Vegas median runs for at most iterations in expectation. So its expected running time is .
Having shown each iteration of Las Vegas median to take time, establish condition (2) of the theorem with Lemma 16. β
By Fact 1, Indyk median satisfies condition (2) in Theorem 17. But it does not satisfy condition (1).
We briefly justify the optimality of the ratio of in Theorem 17. Let be a randomized algorithm that always outputs a -approximate -median. Furthermore, denote by (resp., ) the output (resp., the set of queries as unordered pairs) of , where is the discrete metric (i.e., and for all distinct , ). Without loss of generality, assume for all by adding dummy queries. So knows that
| | | (32) |
Furthermore, assume that never queries for the distance from a point to itself.
In the sequel, consider the case that . By the averaging argument, there exists a point involved in at most queries in . Clearly, cannot exclude the possibility that for all satisfying . In summary, cannot rule out the case that
| | | | | (33) |
Equations (32)β(33) contradict the guarantee that is -approximate. In summary, any randomized algorithm that always outputs a -approximate -median must always make at least queries given oracle access to the discrete metric.
Appendix A Analyzing max square sum
Max square sum has an optimal solution, denoted , because its feasible solutions (i.e., those satisfying constraints (21)β(22)) form a closed and bounded subset of . (Recall from elementary mathematical analysis that a continuous real-valued function on a closed and bounded subset of has a maximum value, where .) Note that must be feasible to max square sum. Below is a consequence of constraint (21).
Lemma A.1.
| | | (34) |
Proof.
Clearly,
| | |
Furthermore, the left-hand side of inequality (34) is an integer. β
Lemma A.2.
| | |
Proof.
Assume otherwise. Then
| | | | |
| | | | |
| | | | |
| | | | |
So by constraint (22) (and the feasibility of to max square sum),
| | |
Consequently, there exist distinct , satisfying
| | | (35) |
By symmetry, assume . By inequality (35), there exists a small real number such that increasing by and simultaneously decreasing by will preserve constraints (21)β(22). I.e., the solution defined below is feasible to max square sum:
| | | (39) |
Clearly, objective (20) w.r.t. exceeds that w.r.t. by
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
where the inequality holds because and .
In summary, is a feasible solution achieving a greater objective (20) than the optimal solution does, a contradiction. β
We now bound the optimal value of max square sum.
Theorem A.3.
The optimal value of max square sum is at most
| | |
Proof.
W.r.t. the optimal (and thus feasible) solution , objective (20) equals
| | | | |
| | | | |
where if is true and otherwise, for any predicate . Now invoke Lemma A.2. β