Deterministic metric -median selection with very few queries 111Part of this paper appears in Proceedings of the 27th International Computing and Combinatorics Conference (COCOON 2021).
Ching-Lueh Chang222Department of Computer Science and Engineering, Yuan Ze University, Taoyuan, Taiwan. clchang@saturn.yzu.edu.tw
Abstract
Given an -point metric space , metric -median asks for a point minimizing . We show that for each computable function satisfying , metric -median has a deterministic, -query, -approximation and nonadaptive algorithm. Previously, no deterministic -query -approximation algorithms are known for metric -median. On the negative side, we prove each deterministic -query algorithm for metric -median to be not -approximate for a sufficiently small constant . We also refute the existence of deterministic -query -approximation algorithms.
An -point metric space is a size- set endowed with a distance function such that
β’
if and only if ,
β’
, and
β’
(triangle inequality)
for all , , [16]. Metric -median asks for a point minimizing . Clearly, it has a brute-force -time algorithm. Furthermore, it generalizes the classical median selection [6] and can be generalized further to metric -median clustering. In social network analysis, metric -median asks for an actor with the maximum closeness centrality [17]. For all , a -approximate -median of is a point satisfying . By convention, a -approximation algorithm for metric -median must output a -approximate -median of . A query inspects for some , . An algorithm is nonadaptive if its th query is independent of the answers to the first queries, for all . Write for the distance function induced by an undirected graph .
Indyk [11, 12] gives a Monte Carlo -time -approximation algorithm for metric -median, where . His time complexity is optimal w.r.t. . When restricted to , metric -median has a Monte Carlo -time -approximation algorithm [14]. The more general -median clustering in metric spaces has streaming approximation algorithms [10], requires time for -approximations [15] and is inapproximable to within unless [13]. For and graph metrics, a well-studied problem is to find the average distance from a query point to a finite set of points [1, 8, 9].
Deterministic -query computation is almost completely understood for metric -median: For all constants , the best approximation ratio achievable by deterministic -query and -query algorithms is and , respectively [2, 4, 18]. The same holds with βqueryβ replaced by βtimeβ and regardless of whether the algorithms can be adaptive [2, 4]. In contrast, we study the largely unknown deterministic - or -query computation. An -query algorithm enjoys the strength of ignoring a fraction of points.
It is folklore that every point is an -approximate -median. Surprisingly, this is the current best upper bound for deterministic -query algorithms. In particular, no deterministic -query -approximation algorithms are known for metric -median. Instead, we give a deterministic, -query, -approximation and nonadaptive algorithm for each computable function satisfying . So, e.g., metric -median has a deterministic -query -approximation algorithm for the very slowly growing inverse Ackermann function . Our main technical discovery is that a -approximate -median of (where denotes restricted to ) is an -approximate -median of , for all and . When is a uniformly random set of a sufficiently large size, an approximate solution to metric -median clustering for is a good one for with high probability [7]. But our discovery is for any and is new.
Chang [3] shows that metric -median has a deterministic, -time, -query, -approximation and nonadaptive algorithm, for all . So deterministic -query algorithms can be -approximate for each . Currently, the best lower bound against deterministic -query algorithms is that they cannot be -approximate [4]. So there is a huge gap between Changβs [3] approximation ratio of and the current best lower bound. We close the gap by showing each deterministic -query algorithm for metric -median to be not -approximate for a sufficiently small constant (depending on the algorithm). Our approach, sketched below, adversarially answers the queries of a deterministic -query algorithm Alg:
(I)
Start with the complete graph on .
(II)
Mark all edges in an -regular expander graph as permanent.
(III)
Repeat the following:
(1)
Upon receiving a query , find a shortest - path and answer by the length of .
(2)
Mark all edges of as permanent.
(3)
For each vertex incident to too many permanent edges, remove all non-permanent edges incident to .
Intuitively, item (III3) keeps degrees small, thus forcing the output of Alg to have a large average distance to other points. Because item (III1) answers a query by the length of , items (III2)β(III3) must preserve all edge of (by marking them as permanent and not removing them) for the consistency in answering future queries. Items (I) and (III1)β(III3) follow Changβs [4] paradigm. To prove a lower bound against Alg, we shall make the output of Alg a lot worse than a -median, presumably by identifying or planting a vertex with a sufficiently small average distance to other points. However, Chang fails in this respect. We overcome his problem by item (II), which allows a vertex to have an average distance to other vertices.
An extension of our lower bound forbids each deterministic -query algorithm for metric -median to be -approximate for some computable function satisfying . In particular, deterministic -query -approximation algorithms do not exist. Previously, the best lower bound against deterministic -query algorithms is folklore and forbids to be -approximate for some .333For a sketch of proof, answer all queries of by and put all points not involved in the queries to be extremely close to one another but extremely far away from βs output and from the points involved in the queries. So previous works do not yet refute the existence of deterministic -query -approximation algorithms, where is the very slowly growing inverse Ackermann function.
Chang [5]βs adversarial method shows that metric -median has no deterministic -query -approximation algorithms that make each point involve in queries to . But his adversary is rather naΓ―ve and does not seem to yield any unconditional lower bound such as ours.
2 Upper bound
Take an -point metric space and . Define
to be a -median of and , respectively, breaking ties arbitrarily. Furthermore, pick and independently and uniformly at random from . So
is the average distance in .
Lemma 1.
Proof.
We have
β
Lemma 2.
Proof.
By the optimality of ,
Clearly,
β
For all ,
(1)
The next two lemmas constitute our main discovery.
Lemma 3.
For all and satisfying and , is an -approximate -median of .
For all constants , metric -median has a deterministic, -time, -query, -approximation and nonadaptive algorithm.
Below is our main theorem.
Theorem 7.
For each computable function satisfying , metric -median has a deterministic, -query, -approximation and nonadaptive algorithm.
Proof.
Take any of size . Applying Theorem 6 to , an -approximate -median of can be found deterministically and nonadaptively with queries. By Lemma 5 (with ), is an -approximate -median of . β
Taking a very slowly growing (e.g., the iterated logarithm or the inverse Ackermann function), Theorem 7 allows deterministic -query algorithms to be very close to being -approximate.
3 Lower bound
Fix any deterministic -query algorithm Alg, where . Then take a constant , where is such that -regular expander graphs exist. By padding, assume the number of Algβs queries to be exactly . Adversary Adv in Fig. 1 answers the queries of Alg. All graphs are assumed to be undirected.
As a remark, whenever an edge of a graph is marked as permanent, that edge is considered to be permanent in all graphs. For example, an edge of marked as permanent in line 3 of Adv is considered to be permanent in lines 11β13, even though the latter processes rather than . Similarly, although an edge marked as permanent by line 8 comes from by line 6, it is considered to be permanent in lines 11β13 as well.
Lemma 8.
For all , is a subgraph of .
Proof.
By line 1, is a subgraph of . Assume as induction hypothesis that is a subgraph of . By line 3 and the induction hypothesis, all edges of are permanent edges of . By lines 9β14, all permanent edges of are in . β
For all , Advβs answer to the th query of Alg equals .
Proof (included for completeness).
Let be Advβs answer to the th query. By lines 6β7, .444As is an expander, by Lemma 8. By lines 9β14, is a subgraph of , implying . In summary, .
By line 7, is the length of . As is in by line 6, all edges of are permanent edges of by lines 8β14. So by lines 9β14, exists in for all .555Note that once an edge is marked as permanent, it cannot be removed by line 12. Therefore, the length of is at least (in fact, at least for all ). In summary, . β
For each , each run of line 8 marks as permanent at most two edges incident to .
Proof (included for completeness).
In line 6, has at most two edges incident to . β
Let be the set of edges ever marked as permanent, and . Denote by the output of Alg with all queries answered by Adv. By padding dummy queries, assume without loss of generality that Alg queries for the distance between and each point in .
By lines 7β8, Adv answers each query of Alg by the length of a path whose edges are all in . So for all , the answer to the th query is at least . Therefore, by Lemma 9, where . This and the assumption that Alg queries for all distances between and the points in give
(5)
Consider the instant when the number of permanent edges incident to a vertex exceeds . By Lemma 10, is incident to at most permanent edges at time . Then lines 9β14 remove from all non-permanent edges incident to (and will not put them back to for any ). So no more edges incident to will be marked as permanent after time . In summary, has degree at most in . In the above argument, can be any vertex whose number of incident permanent edges ever exceeds . So has maximum degree at most .666Clearly, a vertex whose number of incident permanent edges never exceeds will have degree in . So for all , at most vertices in can be within distance (inclusive) from . Taking for a small constant depending on , . I.e., at least vertices are of distance greater than from in . So
For all and when line 6 picks , has at most one non-permanent edge.
Proof (included for completeness).
Write . Assume for contradiction that and are both non-permanent when line 6 picks from , for some . By line 1, has the edge . But by the optimality of in line 6, cannot have the edge . So there exists such that line 12 runs with in the th iteration of the loop in lines 4β15.777Let be the smallest index such that does not have . Line 9 initializes to be , which has . So line 12 must remove from . This happens only by running line 12 with . Being non-permanent when line 6 picks from , and must have remained non-permanent throughout the first iterations (including the th iteration) of the loop in lines 4β15 (because of the irreversibility of permanence). Therefore, when line 12 runs with in the th iteration of the loop in lines 4β15, or must be removed from . By symmetry, assume to not have . By lines 9β14 and as , cannot have , either. As is picked from by line 6, must have (which is on ), a contradiction. β
As is -regular by line 2, line 3 marks edges as permanent by the handshaking lemma. By Corollary 15, at most edges are ever marked as permanent by line 8. To sum up, has at most edges. So by the handshaking lemma, the average degree in is at most . This and Markovβs inequality imply that at most vertices have degrees at least in . As , at most vertices have degrees at least in . β
Each query increases cnt by at most two in lines 4β11. Lines 15β18 may also increase cnt. Lines 6, 10, and 17 set to be cnt for some . β
Lemma 21.
If Alg is -approximate for metric -median, where , then Sim is a tame -query -approximation algorithm for metric -median.
Proof.
By Lemma 19, Sim simulates Alg with an injective renaming of points. So, inheriting from Alg, Sim is -approximate and makes queries. By Lemma 20 and lines 12 and 19 of Sim, Sim is tame. β
Each deterministic -query algorithm for Metric -median fails to be -approximate for some computable function satisfying .
Proof.
By Lemma 21, assume Alg to be tame without loss of generality (otherwise, prove the theorem against Sim instead of Alg). Let the Algβs output when the queries are answered by Adv with (resp., ) substituted by (resp., ). By Lemma 11 with (resp., ) substituted by (resp., ),
(8)
where is a graph on as in Adv. By Lemmas 16β17 with (resp., ) substituted by (resp., ), there exists satisfying
(9)
Equations (8)β(9) and the triangle inequality imply
(10)
Recall that . Put all points in extremely close to : For all distinct , , and
(15)
It is not hard to see that is induced by the weighted graph obtained in the following way: (1) Add all vertices in to . (2) Add an edge between each and each neighbor (in ) of . (3) Connect any two vertices in by an edge of weight , all other edge weights being .
As Alg is tame, for all , implying by equation (15). So by Lemma 9, Adv answers queries consistently with .
We have
(17)
As Alg is tame, . By equation (10), .888For proving the theorem, we may assume without loss of generality. So is nonzero. So . Now,
This and equations (3.1)β(17) show to be no better than -approximate for some constant . Clearly, . So taking completes the proof except that may be uncomputable. Gladly, has codomain by equation (15).999Any graph on a subset of induces distances in . But equations (3.1)β(17) forbid as a distance. So we may pretend as if is Algβs worst-case query complexity w.r.t. metrics with codomain . This makes , and thus , computable. β
Corollary 23.
Metric -median has no deterministic -query -approximation algorithms.
Metric -median has no deterministic -query algorithms with an asymptotically best approximation ratio.
Proof.
Take any deterministic -query algorithm . By Theorem 22, there exists a computable forbidding to be -approximate. But Theorem 7 asserts the existence of a deterministic -query -approximation algorithm. β
Appendix A Distances in expanders
It is well-known that an -regular expander graph on exists. I.e., there exist constants and such that
(i)
is -regular, and
(ii)
for each of size at most , at least edges of are in .
Lemma 25.
For each nonempty of size at most ,
Proof.
For each ,
So is the set of vertices at level of the BFS tree rooted at .101010Generalize BFS in the obvious way to allow the root to be a set of vertices.
Now fix any . Because edges cannot cross non-adjacent levels of a BFS tree, . By item (ii) (with replaced by and noting that has size at most ), at least edges of are in . In summary, at least edges are in (and are thus incident to a vertex in ). As is -regular, therefore, . Hence
where the last equality uses the convergence of . β
Appendix B Acknowledgments
The author is supported by the Ministry of Science and Technology of Taiwan under grant 110-2221-E-155-012-.
References
[1]P. Bose, A. Maheshwari, and P. Morin. Fast approximations for sums of distances, clustering and the FermatβWeber problem. Computational Geometry, 24(3):135β146, 2003.
[2]C.-L. Chang. A lower bound for metric -median selection. Journal of Computer and System Sciences, 84:44β51, 2017.
[3]C.-L. Chang. Metric -median selection with fewer queries. In Proceedings of the 2017 International Conference on Applied System Innovation, pages 1056β1059, 2017.
[4]C.-L. Chang. Metric -median selection: Query complexity vs. approximation ratio. ACM Transactions on Computation Theory, 9(4):1β23, 2018. Article 20.
[5]C.-L. Chang. A note on metric -median selection. In Proceedings of the 23rd International Computer Symposium, pages 457β459, Yunlin, Taiwan, 2018.
[6]T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein. Introduction to Algorithms. The MIT Press, 3rd edition, 2001.
[7]A. Czumaj and C. Sohler. Sublinear-time approximation algorithms for clustering via random sampling. Random Structures & Algorithms, 30(1β2):226β256, 2007.
[8]D. Eppstein and J. Wang. Fast approximation of centrality. Journal of Graph Algorithms and Applications, 8(1):39β45, 2004.
[9]O. Goldreich and D. Ron. Approximating average parameters of graphs. Random Structures & Algorithms, 32(4):473β493, 2008.
[10]S. Guha, A. Meyerson, N. Mishra, R. Motwani, and L. OβCallaghan. Clustering data streams: Theory and practice. IEEE Transactions on Knowledge and Data Engineering, 15(3):515β528, 2003.
[11]P. Indyk. Sublinear time algorithms for metric space problems. In Proceedings of the 31st Annual ACM Symposium on Theory of Computing, pages 428β434, 1999.
[13]K. Jain, M. Mahdian, and A. Saberi. A new greedy approach for facility location problems. In Proceedings of the 34th Annual ACM Symposium on Theory of Computing, pages 731β740, 2002.
[14]A. Kumar, Y. Sabharwal, and S. Sen. Linear-time approximation schemes for clustering problems in any dimensions. Journal of the ACM, 57(2):5, 2010.
[15]R. R. Mettu and C. G. Plaxton. Optimal time bounds for approximate clustering. Machine Learning, 56(1β3):35β60, 2004.
[16]W. Rudin. Principles of Mathematical Analysis. McGraw-Hill, 3rd edition, 1976.
[17]S. Wasserman and K. Faust. Social Network Analysis: Methods and Applications. Cambridge University Press, 1994.
[18]B. Y. Wu. On approximating metric -median in sublinear time. Information Processing Letters, 114(4):163β166, 2014.