Ching-Lueh Chang 222Department of Computer Science and Engineering, Yuan Ze University, Taoyuan, Taiwan. Email: clchang@saturn.yzu.edu.tw333Supported in part by the Ministry of Science and Technology of Taiwan under grant 105-2221-E-155-047-.
Abstract
Let be a metric space. We analyze the expected value and the variance of for a uniformly random permutation of , leading to the following results:
β’
Consider the problem of finding a point in with the minimum sum of distances to all points. We show that this problem has a randomized algorithm that (1) always outputs a -approximate solution in expected time and that (2) inherits Indykβs [9, 10] algorithm to output a -approximate solution in time with probability , where .
β’
The average distance in can be approximated in time to within a multiplicative factor in with probability , where .
β’
Assume to be a graph metric. Then the average distance in can be approximated in time to within a multiplicative factor in with probability , where .
1 Introduction
A metric space is a nonempty set endowed with a metric, i.e., a function such that
For all , define . Given and oracle access to a metric , metric -median asks for , breaking ties arbitrarily. It generalizes the classical median selection on the real line and has a brute-force -time algorithm. More generally, metric -median asks for , , , minimizing . Because defines nonzero distances, only -time algorithms are said to run in sublinear time [9]. For all , an -approximate -median is a point satisfying
For all , metric -median has a Monte Carlo -approximation -time algorithm [9, 10]. Guha et al. [8] show that metric -median has a Monte Carlo, -approximation, -time, -space and one-pass algorithm for all small as well as a deterministic, -approximation, -time, -space and one-pass algorithm. Given points in with , the Monte Carlo algorithms of Kumar et al. [11] find a -approximate -median in time and a -approximate solution to metric -median in time. All randomized -approximation algorithms for metric -median take time [12, 8]. Chang [3] shows that metric -median has a deterministic, -approximation, -time and nonadaptive algorithm for all constants , generalizing the results of Chang [2] and Wu [16]. On the other hand, he disproves the existence of deterministic -approximation -time algorithms for all constants and [4, 5].
In social network analysis, the closeness centrality of a point is the reciprocal of the average distance from to all points [15]. So metric -median asks for a point with the maximum closeness centrality. Given oracle access to a graph metric, the Monte-Carlo algorithms of Goldreich and Ron [7] and Eppstein and Wang [6] estimate the closeness centrality of a given point and those of all points, respectively.
All known sublinear-time algorithms for metric -median are either deterministic or Monte Carlo, the latter having a positive probability of failure. For example, Indykβs Monte Carlo -approximation algorithm outputs with a positive probability a solution without approximation guarantees. In contrast, we show that metric -median has a randomized algorithm that always outputs a -approximate solution in expected time for all . So, excluding the known deterministic algorithms (which are Las Vegas only in the degenerate sense), this paper gives the first Las Vegas approximation algorithm for metric -median with an expected sublinear running time. Note that deterministic sublinear-time algorithms for metric -median can be -approximate but not -approximate for any constant [2, 5]. So our approximation ratio of beats that of any deterministic sublinear-time algorithm. Inheriting Indykβs algorithm, our algorithm outputs a -approximate -median in time with probability for all .
Indyk [9, 10] gives a Monte-Carlo -time algorithm that approximates the average distance in any metric space to within a multiplicative factor in , for all . Barhum, Goldreich and Shraibman [1] improve Indykβs time complexity of to . This paper gives a Monte-Carlo -time algorithm that approximates the average distance in to within a multiplicative factor in , for all . For all , we present a Monte-Carlo -time algorithm approximating the average distance of any graph metric to within a multiplicative factor in . But for general metrics, we do not know whether the running time of Barhum, Goldreich and Shraibman can be improved to .
2 Definitions and preliminaries
For a metric space ,
(1)
(2)
breaking ties arbitrarily in equation (2). So is the average distance in , and is a -median.
An algorithm with oracle access to is denoted by and may query on any for . In this paper, all Landau symbols (such as , , and ) are w.r.t. . The following result is due to Indyk.
For all , metric -median has a Monte Carlo -approximation -time algorithm with a failure probability of at most .
Henceforth, denote Indykβs algorithm in Fact 1 by Indyk median. It is given , and oracle access to a metric . The following fact on estimating the average distance is due to Barhum, Goldreich and Shraibman.
This and Lemma 5 imply . So the left-hand side of inequality (5) is at least . β
Lemma 7.
For all and in each iteration of the while loop of Las Vegas median,
(7)
where the probability is taken over , , , and the random coin tosses of Indyk median.
Proof.
By Fact 1 and line 2 of Las Vegas median, the first condition within in equation (7) holds with probability at least over the random coin tosses of Indyk median. By Lemma 6, the second condition holds with probability at least over , , , . In summary, the first two conditions hold simultaneously with probability at least (note that the random coin tosses of Indyk median are independent of , , , ). Finally, the first two conditions together imply the third by inequality (3) and the easy fact that
β
Theorem 8.
For all , metric -median has a randomized algorithm that (1) always outputs a -approximate solution in expected time and (2) outputs a -approximate solution in time with probability .
Proof.
By Lemma 7, each execution of lines 4β5 of Las Vegas median returns with probability . So the expected number of iterations is . By Fact 1, line 2 takes time. Line 3 takes time by the Knuth shuffle. Clearly, lines 4β5 take time. In summary, the expected running time of Las Vegas median is . To prevent Las Vegas median from running forever, find a -median by brute force (which obviously takes time) after steps of computation. By Lemma 3, Las Vegas median is -approximate.
By Lemma 7, is -approximate and is also returned in line 5 with probability in the first (in fact, any) iteration. Finally, the previous paragraph has shown each iteration to take time. β
By Fact 1, Indyk median satisfies condition (2) in Theorem 8. But it does not satisfy condition (1).
We now justify the optimality of the ratio of in Theorem 8. Let be a randomized algorithm that always outputs a -approximate -median. Furthermore, denote by (resp., ) the output (resp., the set of queries as unordered pairs) of , where is the discrete metric (i.e., and for all distinct , ). Without loss of generality, assume for all by adding dummy queries. So the queries in witness that
(8)
Assume without loss of generality that never queries for the distance from a point to itself.
In the sequel, consider the case that . By the averaging argument, there exists a point involved in at most queries in (note that each query involves two points). Because every function with
satisfies the triangle inequality, cannot exclude the possibility that for all satisfying . In summary, cannot rule out the case that
(9)
Equations (8)β(9) contradict the guarantee that is -approximate. Consequently, the case that should never happen. The next theorem summarizes the above.
Theorem 9.
Metric -median has no randomized algorithm that always outputs a -approximate solution and that makes fewer than queries with a positive probability given oracle access to the discrete metric, for any constant .
Lemmas 4 and 6 yield the following estimation of the average distance.
Theorem 10.
Given , and oracle access to a metric , a real number in can be found in time with probability .
with probability . The Knuth shuffle picks , , , in time. Then the left-hand side of relation (10) can be calculated in time. β
Note that the estimation of the average distance in Theorem 10 has only one-sided error. The time complexity (resp., approximation ratio) in Theorem 10 is better (resp., worse) than that in Fact 2.
4 Estimating the average distance of a graph metric
Throughout this section, take any less than a small constant, e.g., . Define
(11)
(12)
where is as in equation (2). As , by equation (11).
As in line 1 of average distance in Fig. 2, let be a uniformly random permutation. Clearly,
(15)
where the last equality follows from the linearity of expectation and the separation of pairs according to whether . The next three lemmas analyze the variance of
By equations (1) and (4)β(19), the left-hand side of inequality (17) cannot exceed the optimal value of the following problem, called max square sum:
Find for all , to maximize
(20)
subject to
(21)
(22)
Above, constraint (21) (resp., (22)) mimics equation (1) (resp., inequality (19) and the non-negativeness of distances). Appendix A bounds the optimal value of max square sum from above by
This evaluates to be at most
β
Recall that the variance of any random variable equals .
We now arrive at an efficient estimation of the average distance on a graph.
Theorem 17.
Given , and oracle access to a graph metric , a real number in can be found in time with probability .
Proof.
Let be an undirected unweighted graph inducing the distance function . Then pick , with , i.e., is a furthest pair of vertices of . Find a simple shortest - path, denoted , in . By equation (12),
(23)
Now,
(24)
where the first inequality (resp., the second equality) follows from the triangle inequality (resp., being a shortest - path).444It is easy to verify that if and otherwise. By inequalities (23)β(24),
(25)
Because is a graph metric, for all distinct , . So by equation (12),
for all sufficiently large .555If , then . Otherwise, for all . Finally, recall that . By equation (11),
(28)
for all sufficiently large . By inequalities (27)β(28), Lemma 16 with and recalling that ,
(29)
for all sufficiently large . Consequently, the output of line 2 of average distance in Fig. 2 is in with probability . Line 1 takes time by the Knuth shuffle. Clearly, line 2 also takes time. β
The time complexity of in Theorem 17 is independent of . But for general metrics, we do not know whether the time complexity of in Fact 2 can be improved to .
Appendix A Analyzing max square sum
Max square sum has an optimal solution, denoted , because its feasible solutions (i.e., those satisfying constraints (21)β(22)) form a closed and bounded subset of . (Recall from elementary mathematical analysis that a continuous real-valued function on a closed and bounded subset of has a maximum value, where .) Note that must be feasible to max square sum. Below is a consequence of constraint (21).
Lemma A.1.
(30)
Proof.
Clearly,
Furthermore, the left-hand side of inequality (30) is an integer. β
Lemma A.2.
Proof.
Assume otherwise. Then
So by constraint (22) (and the feasibility of to max square sum),
Consequently, there exist distinct , satisfying
(31)
By symmetry, assume . By inequality (31), there exists a small real number such that increasing by and simultaneously decreasing by will preserve constraints (21)β(22). I.e., the solution defined below is feasible to max square sum:
(35)
Clearly, objective (20) w.r.t. exceeds that w.r.t. by
where the inequality holds because and .
In summary, is a feasible solution to max square sum achieving a greater objective (20) than the optimal solution does, a contradiction. β
where if is true and otherwise, for any predicate . Now invoke Lemma A.2. β
References
[1]K. Barhum, O. Goldreich, and A. Shraibman. On approximating the average distance between points. In Proceedings of the 10th International Workshop on Approximation and the 11th International Workshop on Randomization, and Combinatorial Optimization, pages 296β310, 2007.
[2]C.-L. Chang. Deterministic sublinear-time approximations for metric -median selection. Information Processing Letters, 113(8):288β292, 2013.
[3]C.-L. Chang. A deterministic sublinear-time nonadaptive algorithm for metric -median selection. Theoretical Computer Science, 602:149β157, 2015.
[4]C.-L. Chang. Metric -median selection: Query complexity vs. approximation ratio. In Proceedings of the 22nd International Computing and Combinatorics Conference, pages 131β142, Ho Chi Minh City, Vietnam, 2016. Full version at https://arxiv.org/abs/1509.05662.
[5]C.-L. Chang. A lower bound for metric -median selection. Journal of Computer and System Sciences, 84:44β51, 2017.
[6]D. Eppstein and J. Wang. Fast approximation of centrality. Journal of Graph Algorithms and Applications, 8(1):39β45, 2004.
[7]O. Goldreich and D. Ron. Approximating average parameters of graphs. Random Structures & Algorithms, 32(4):473β493, 2008.
[8]S. Guha, A. Meyerson, N. Mishra, R. Motwani, and L. OβCallaghan. Clustering data streams: Theory and practice. IEEE Transactions on Knowledge and Data Engineering, 15(3):515β528, 2003.
[9]P. Indyk. Sublinear time algorithms for metric space problems. In Proceedings of the 31st Annual ACM Symposium on Theory of Computing, pages 428β434, 1999.
[11]A. Kumar, Y. Sabharwal, and S. Sen. Linear-time approximation schemes for clustering problems in any dimensions. Journal of the ACM, 57(2):5, 2010.
[12]R. R. Mettu and C. G. Plaxton. Optimal time bounds for approximate clustering. Machine Learning, 56(1β3):35β60, 2004.
[13]R. Motwani and P. Raghavan. Randomized Algorithms. Cambridge University Press, Cambridge, UK, 1995.
[14]W. Rudin. Principles of Mathematical Analysis. McGraw-Hill, 3rd edition, 1976.
[15]S. Wasserman and K. Faust. Social Network Analysis: Methods and Applications. Cambridge University Press, 1994.
[16]B.-Y. Wu. On approximating metric -median in sublinear time. Information Processing Letters, 114(4):163β166, 2014.