On ultrametric 111-median selection

Ching-Lueh Chang 111Department of Computer Science and Engineering, Yuan Ze University, Taoyuan, Taiwan. Email: clchang@saturn.yzu.edu.tw
Abstract

Consider the problem of finding a point in an ultrametric space with the minimum average distance to all points. We give this problem a Monte Carlo O​((log2⁑(1/Ο΅))/Ο΅3)𝑂superscript21italic-Ο΅superscriptitalic-Ο΅3O((\log^{2}(1/\epsilon))/\epsilon^{3})-time (1+Ο΅)1italic-Ο΅(1+\epsilon)-approximation algorithm for all Ο΅>0italic-Ο΅0\epsilon>0.

1 Introduction

A metric space is a nonempty set M𝑀M endowed with a distance function d:MΓ—Mβ†’[0,∞):𝑑→𝑀𝑀0d\colon M\times M\to[0,\infty) satisfying

  • β€’

    d​(x,y)=0𝑑π‘₯𝑦0d(x,y)=0 if and only if x=yπ‘₯𝑦x=y,

  • β€’

    d​(x,y)=d​(y,x)𝑑π‘₯𝑦𝑑𝑦π‘₯d(x,y)=d(y,x), and

  • β€’

    d​(x,z)≀d​(x,y)+d​(y,z)𝑑π‘₯𝑧𝑑π‘₯𝑦𝑑𝑦𝑧d(x,z)\leq d(x,y)+d(y,z) (triangle inequality)

for all xπ‘₯x, y𝑦y, z∈M𝑧𝑀z\in M. With the triangle inequality strengthened to

d​(x,z)≀max⁑{d​(x,y),d​(y,z)},𝑑π‘₯𝑧𝑑π‘₯𝑦𝑑𝑦𝑧d\left(x,z\right)\leq\max\left\{d\left(x,y\right),\,d\left(y,z\right)\right\},

we call (M,d)𝑀𝑑(M,d) an ultrametric space and d𝑑d an ultrametric (a.k.a. non-Archimedean metric or super-metric). The mathematical community studies ultrametrics extensively.

Given an n𝑛n-point metric space (M,d)𝑀𝑑(M,d), metric 111-median asks for a point in M𝑀M, called a 111-median, with the minimum average distance to all points. Metric 111-median is a special case of the classical kπ‘˜k-median clustering and a generalization to the classical median selection [3]. It can also be interpreted as finding the most important point because social network analysis often measures the importance of an actor v𝑣v by v𝑣v’s closeness centrality, defined to be v𝑣v’s average distance to all points [8]. Not surprisingly, metric 111-median is extensively studied, e.g., in the general [5, 6], Euclidean [7], streaming [4] and deterministic [2] cases. Indyk [5, 6] has the currently best upper bound for metric 111-median:

Theorem 1 ([5, 6]).

Metric 111-median has a Monte Carlo O​(n/Ο΅2)𝑂𝑛superscriptitalic-Ο΅2O(n/\epsilon^{2})-time (1+Ο΅)1italic-Ο΅(1+\epsilon)-approximation algorithm for all Ο΅>0italic-Ο΅0\epsilon>0.

The greatest strengths of Theorem 1 are the sublinear time complexity (of O​(n/Ο΅2)𝑂𝑛superscriptitalic-Ο΅2O(n/\epsilon^{2})) and the optimal approximation ratio (of 1+Ο΅1italic-Ο΅1+\epsilon), where β€œsublinear” means β€œo​(n2)π‘œsuperscript𝑛2o(n^{2})” by convention because there are Ξ˜β€‹(n2)Θsuperscript𝑛2\Theta(n^{2}) distances. Furthermore, except for the dependence of the time complexity on Ο΅italic-Ο΅\epsilon, all parameters in Theorem 1 are easily shown to be optimal [1, Sec. 7].

Chang [1, Sec. 6] uses Indyk’s [6, Sec. 6.1] technique to give a Monte Carlo algorithm for metric 111-median with time complexity independent of n𝑛n but at the cost of a worse approximation ratio:

Theorem 2 ([1, Sec. 6]).

For all Ο΅>0italic-Ο΅0\epsilon>0, metric 111-median has a Monte Carlo O​((log2⁑(1/Ο΅))/Ο΅3)𝑂superscript21italic-Ο΅superscriptitalic-Ο΅3O((\log^{2}(1/\epsilon))/\epsilon^{3})-time (2+Ο΅)2italic-Ο΅(2+\epsilon)-approximation algorithm with success probability greater than 1βˆ’Ο΅1italic-Ο΅1-\epsilon.

Let ultrametric 111-median be metric 111-median restricted to ultrametric spaces. The approximation ratio of 2+Ο΅2italic-Ο΅2+\epsilon in Theorem 2 cannot be improved to 2βˆ’Ο΅2italic-Ο΅2-\epsilon even if we require the success probability only to be a small constant [1, Sec. 7]. In contrast, this paper gives a Monte Carlo O​((log2⁑(1/Ο΅))/Ο΅3)𝑂superscript21italic-Ο΅superscriptitalic-Ο΅3O((\log^{2}(1/\epsilon))/\epsilon^{3})-time (1+Ο΅)1italic-Ο΅(1+\epsilon)-approximation algorithm for ultrametric 111-median. So our algorithm has the optimal approximation ratio (of 1+Ο΅1italic-Ο΅1+\epsilon) and a time complexity (of O​((log2⁑(1/Ο΅))/Ο΅3)𝑂superscript21italic-Ο΅superscriptitalic-Ο΅3O((\log^{2}(1/\epsilon))/\epsilon^{3})) independent of n𝑛n.

2 Algorithm

For all nβˆˆβ„€+𝑛superscriptβ„€n\in\mathbb{Z}^{+}, [n]=def.{1,2,…,n}superscriptdef.delimited-[]𝑛12…𝑛[n]\stackrel{{\scriptstyle\text{def.}}}{{=}}\{1,2,\ldots,n\} by convention. Let ([n],d)delimited-[]𝑛𝑑([n],d) be an ultrametric space, OPT a 111-median of ([n],d)delimited-[]𝑛𝑑([n],d) and Ο΅>0italic-Ο΅0\epsilon>0. Order the points in [n]delimited-[]𝑛[n] as p1=OPTsubscript𝑝1OPTp_{1}=\text{\rm OPT}, p2subscript𝑝2p_{2}, ……\ldots, pnsubscript𝑝𝑛p_{n} so that

0=d​(OPT,p1)≀d​(OPT,p2)≀⋯≀d​(OPT,pn).0𝑑OPTsubscript𝑝1𝑑OPTsubscript𝑝2⋯𝑑OPTsubscript𝑝𝑛\displaystyle 0=d\left(\text{OPT},p_{1}\right)\leq d\left(\text{OPT},p_{2}\right)\leq\cdots\leq d\left(\text{OPT},p_{n}\right).(1)

Furthermore, let

rβˆ—=def.1nβ‹…βˆ‘i=1nd​(OPT,pi)superscriptdef.superscriptπ‘Ÿβ‹…1𝑛superscriptsubscript𝑖1𝑛𝑑OPTsubscript𝑝𝑖\displaystyle r^{*}\stackrel{{\scriptstyle\text{def.}}}{{=}}\frac{1}{n}\cdot\sum_{i=1}^{n}\,d\left(\text{\rm OPT},p_{i}\right)(2)

be the average distance from a 111-median to all points. Because the brute-force algorithm for ultrametric 111-median takes Ξ˜β€‹(n2)Θsuperscript𝑛2\Theta(n^{2}) time and we want an O​((log2⁑(1/Ο΅))/Ο΅3)𝑂superscript21italic-Ο΅superscriptitalic-Ο΅3O((\log^{2}(1/\epsilon))/\epsilon^{3})-time algorithm, assume Ο΅β‰₯nβˆ’2/3italic-Ο΅superscript𝑛23\epsilon\geq n^{-2/3} W.L.O.G. Furthermore, assume ϡ≀0.0001italic-Ο΅0.0001\epsilon\leq 0.0001 W.L.O.G.222It is easy to see that if our result holds when Ο΅=0.0001italic-Ο΅0.0001\epsilon=0.0001, then it also holds for all Ο΅>0.0001italic-Ο΅0.0001\epsilon>0.0001.

Lemma 3.

For all 1≀ℓ≀n1ℓ𝑛1\leq\ell\leq n,

βˆ‘i=1nd​(pβ„“,pi)≀(1+β„“βˆ’1nβˆ’β„“+1)β€‹βˆ‘i=1nd​(OPT,pi).superscriptsubscript𝑖1𝑛𝑑subscript𝑝ℓsubscript𝑝𝑖1β„“1𝑛ℓ1superscriptsubscript𝑖1𝑛𝑑OPTsubscript𝑝𝑖\sum_{i=1}^{n}\,d\left(p_{\ell},p_{i}\right)\leq\left(1+\frac{\ell-1}{n-\ell+1}\right)\sum_{i=1}^{n}\,d\left(\text{\rm OPT},p_{i}\right).
Proof.

We have

βˆ‘i=1nd​(pβ„“,pi)superscriptsubscript𝑖1𝑛𝑑subscript𝑝ℓsubscript𝑝𝑖\displaystyle\sum_{i=1}^{n}\,d\left(p_{\ell},p_{i}\right)
=\displaystyle=βˆ‘i=1β„“βˆ’1d​(pβ„“,pi)+βˆ‘i=β„“+1nd​(pβ„“,pi)superscriptsubscript𝑖1β„“1𝑑subscript𝑝ℓsubscript𝑝𝑖superscriptsubscript𝑖ℓ1𝑛𝑑subscript𝑝ℓsubscript𝑝𝑖\displaystyle\sum_{i=1}^{\ell-1}\,d\left(p_{\ell},p_{i}\right)+\sum_{i=\ell+1}^{n}\,d\left(p_{\ell},p_{i}\right)
≀\displaystyle\leqβˆ‘i=1β„“βˆ’1max⁑{d​(OPT,pβ„“),d​(OPT,pi)}+βˆ‘i=β„“+1nmax⁑{d​(OPT,pβ„“),d​(OPT,pi)}superscriptsubscript𝑖1β„“1𝑑OPTsubscript𝑝ℓ𝑑OPTsubscript𝑝𝑖superscriptsubscript𝑖ℓ1𝑛𝑑OPTsubscript𝑝ℓ𝑑OPTsubscript𝑝𝑖\displaystyle\sum_{i=1}^{\ell-1}\,\max\left\{d\left(\text{OPT},p_{\ell}\right),\,d\left(\text{OPT},p_{i}\right)\right\}+\sum_{i=\ell+1}^{n}\,\max\left\{d\left(\text{OPT},p_{\ell}\right),\,d\left(\text{OPT},p_{i}\right)\right\}
≀(1)superscript(1)\displaystyle\stackrel{{\scriptstyle\text{(\ref{orderofincreasingdistances})}}}{{\leq}}βˆ‘i=1β„“βˆ’1d​(OPT,pβ„“)+βˆ‘i=β„“+1nd​(OPT,pi)superscriptsubscript𝑖1β„“1𝑑OPTsubscript𝑝ℓsuperscriptsubscript𝑖ℓ1𝑛𝑑OPTsubscript𝑝𝑖\displaystyle\sum_{i=1}^{\ell-1}\,d\left(\text{OPT},p_{\ell}\right)+\sum_{i=\ell+1}^{n}\,d\left(\text{OPT},p_{i}\right)
≀\displaystyle\leqβˆ‘i=1β„“βˆ’1d​(OPT,pβ„“)+βˆ‘i=1nd​(OPT,pi)superscriptsubscript𝑖1β„“1𝑑OPTsubscript𝑝ℓsuperscriptsubscript𝑖1𝑛𝑑OPTsubscript𝑝𝑖\displaystyle\sum_{i=1}^{\ell-1}\,d\left(\text{OPT},p_{\ell}\right)+\sum_{i=1}^{n}\,d\left(\text{OPT},p_{i}\right)
=\displaystyle=βˆ‘i=1β„“βˆ’11nβˆ’β„“+1β‹…βˆ‘j=β„“nd​(OPT,pβ„“)+βˆ‘i=1nd​(OPT,pi)superscriptsubscript𝑖1β„“1β‹…1𝑛ℓ1superscriptsubscript𝑗ℓ𝑛𝑑OPTsubscript𝑝ℓsuperscriptsubscript𝑖1𝑛𝑑OPTsubscript𝑝𝑖\displaystyle\sum_{i=1}^{\ell-1}\,\frac{1}{n-\ell+1}\cdot\sum_{j=\ell}^{n}\,d\left(\text{OPT},p_{\ell}\right)+\sum_{i=1}^{n}\,d\left(\text{OPT},p_{i}\right)
≀(1)superscript(1)\displaystyle\stackrel{{\scriptstyle\text{(\ref{orderofincreasingdistances})}}}{{\leq}}βˆ‘i=1β„“βˆ’11nβˆ’β„“+1β‹…βˆ‘j=β„“nd​(OPT,pj)+βˆ‘i=1nd​(OPT,pi)superscriptsubscript𝑖1β„“1β‹…1𝑛ℓ1superscriptsubscript𝑗ℓ𝑛𝑑OPTsubscript𝑝𝑗superscriptsubscript𝑖1𝑛𝑑OPTsubscript𝑝𝑖\displaystyle\sum_{i=1}^{\ell-1}\,\frac{1}{n-\ell+1}\cdot\sum_{j=\ell}^{n}\,d\left(\text{OPT},p_{j}\right)+\sum_{i=1}^{n}\,d\left(\text{OPT},p_{i}\right)
≀\displaystyle\leqβˆ‘i=1β„“βˆ’11nβˆ’β„“+1β‹…βˆ‘j=1nd​(OPT,pj)+βˆ‘i=1nd​(OPT,pi)superscriptsubscript𝑖1β„“1β‹…1𝑛ℓ1superscriptsubscript𝑗1𝑛𝑑OPTsubscript𝑝𝑗superscriptsubscript𝑖1𝑛𝑑OPTsubscript𝑝𝑖\displaystyle\sum_{i=1}^{\ell-1}\,\frac{1}{n-\ell+1}\cdot\sum_{j=1}^{n}\,d\left(\text{OPT},p_{j}\right)+\sum_{i=1}^{n}\,d\left(\text{OPT},p_{i}\right)
=\displaystyle=β„“βˆ’1nβˆ’β„“+1β‹…βˆ‘i=1nd​(OPT,pi)+βˆ‘i=1nd​(OPT,pi).β‹…β„“1𝑛ℓ1superscriptsubscript𝑖1𝑛𝑑OPTsubscript𝑝𝑖superscriptsubscript𝑖1𝑛𝑑OPTsubscript𝑝𝑖\displaystyle\frac{\ell-1}{n-\ell+1}\cdot\sum_{i=1}^{n}\,d\left(\text{OPT},p_{i}\right)+\sum_{i=1}^{n}\,d\left(\text{OPT},p_{i}\right).

∎

In short, Lemma 3 says that pβ„“subscript𝑝ℓp_{\ell} is an approximate 111-median for all small β„“β„“\ell. Below is the key of the proof of Theorem 1.

Fact 4 ([6, Sec. 6.1]).

Pick 𝐯1subscript𝐯1{\boldsymbol{v}}_{1}, 𝐯2subscript𝐯2{\boldsymbol{v}}_{2}, ……\ldots, 𝐯ksubscriptπ―π‘˜{\boldsymbol{v}}_{k} independently and uniformly at random from [n]delimited-[]𝑛[n], where kβˆˆβ„€+π‘˜superscriptβ„€k\in\mathbb{Z}^{+}. Then for all aπ‘Ža, b∈[n]𝑏delimited-[]𝑛b\in[n] satisfying βˆ‘j=1nd​(b,pj)>(1+Ο΅)β€‹βˆ‘j=1nd​(a,pj)superscriptsubscript𝑗1𝑛𝑑𝑏subscript𝑝𝑗1italic-Ο΅superscriptsubscript𝑗1π‘›π‘‘π‘Žsubscript𝑝𝑗\sum_{j=1}^{n}\,d(b,p_{j})>(1+\epsilon)\,\sum_{j=1}^{n}\,d(a,p_{j}),

Pr⁑[βˆ‘j=1kd​(b,𝒗j)β‰€βˆ‘j=1kd​(a,𝒗j)]<exp⁑(βˆ’Ο΅2​k64).Prsuperscriptsubscript𝑗1π‘˜π‘‘π‘subscript𝒗𝑗superscriptsubscript𝑗1π‘˜π‘‘π‘Žsubscript𝒗𝑗superscriptitalic-Ο΅2π‘˜64\Pr\left[\sum_{j=1}^{k}\,d\left(b,{\boldsymbol{v}}_{j}\right)\leq\sum_{j=1}^{k}\,d\left(a,{\boldsymbol{v}}_{j}\right)\right]<\exp{\left(-\frac{\epsilon^{2}k}{64}\right)}.

The following lemma uses Indyk’s [6, Sec. 6.1] technique that Chang [1, Sec. 6] uses to prove Theorem 2.

Lemma 5.

Pick 𝐯1subscript𝐯1{\boldsymbol{v}}_{1}, 𝐯2subscript𝐯2{\boldsymbol{v}}_{2}, ……\ldots, 𝐯ksubscriptπ―π‘˜{\boldsymbol{v}}_{k} as in Fact 4, where k=⌈109​(log⁑(1/Ο΅))/Ο΅2βŒ‰π‘˜superscript1091italic-Ο΅superscriptitalic-Ο΅2k=\lceil 10^{9}(\log(1/\epsilon))/\epsilon^{2}\rceil. Let x1subscriptπ‘₯1x_{1}, x2subscriptπ‘₯2x_{2}, ……\ldots, xh∈[n]subscriptπ‘₯β„Ždelimited-[]𝑛x_{h}\in[n], where h=⌈109​(log⁑(1/Ο΅))/Ο΅βŒ‰β„Žsuperscript1091italic-Ο΅italic-Ο΅h=\lceil 10^{9}(\log(1/\epsilon))/\epsilon\rceil, and

t=argmini=1hβˆ‘j=1kd​(xi,𝒗j),𝑑superscriptsubscriptargmin𝑖1β„Žsuperscriptsubscript𝑗1π‘˜π‘‘subscriptπ‘₯𝑖subscript𝒗𝑗\displaystyle t=\mathop{\mathrm{argmin}}_{i=1}^{h}\,\sum_{j=1}^{k}\,d\left(x_{i},{\boldsymbol{v}}_{j}\right),(3)

breaking ties arbitrarily. Then

Pr⁑[βˆ‘j=1nd​(xt,pj)≀(1+Ο΅)β‹…mini=1hβ€‹βˆ‘j=1nd​(xi,pj)]>1βˆ’Ο΅.Prsuperscriptsubscript𝑗1𝑛𝑑subscriptπ‘₯𝑑subscript𝑝𝑗⋅1italic-Ο΅superscriptsubscript𝑖1β„Žsuperscriptsubscript𝑗1𝑛𝑑subscriptπ‘₯𝑖subscript𝑝𝑗1italic-Ο΅\Pr\left[\sum_{j=1}^{n}\,d\left(x_{t},p_{j}\right)\leq\left(1+\epsilon\right)\cdot\min_{i=1}^{h}\,\sum_{j=1}^{n}\,d\left(x_{i},p_{j}\right)\right]>1-\epsilon.
Proof.

Let

iβˆ—superscript𝑖\displaystyle i^{*}=\displaystyle=argmini=1hβˆ‘j=1nd​(xi,pj),superscriptsubscriptargmin𝑖1β„Žsuperscriptsubscript𝑗1𝑛𝑑subscriptπ‘₯𝑖subscript𝑝𝑗\displaystyle\mathop{\mathrm{argmin}}_{i=1}^{h}\,\sum_{j=1}^{n}\,d\left(x_{i},p_{j}\right),(4)

breaking ties arbitrarily. Then

Pr⁑[βˆ‘j=1nd​(xt,pj)>(1+Ο΅)β‹…mini=1hβ€‹βˆ‘j=1nd​(xi,pj)]Prsuperscriptsubscript𝑗1𝑛𝑑subscriptπ‘₯𝑑subscript𝑝𝑗⋅1italic-Ο΅superscriptsubscript𝑖1β„Žsuperscriptsubscript𝑗1𝑛𝑑subscriptπ‘₯𝑖subscript𝑝𝑗\displaystyle\Pr\left[\sum_{j=1}^{n}\,d\left(x_{t},p_{j}\right)>\left(1+\epsilon\right)\cdot\min_{i=1}^{h}\,\sum_{j=1}^{n}\,d\left(x_{i},p_{j}\right)\right]
=(4)superscript(4)\displaystyle\stackrel{{\scriptstyle\text{(\ref{thebestfromthesamples})}}}{{=}}Pr⁑[βˆ‘j=1nd​(xt,pj)>(1+Ο΅)β‹…βˆ‘j=1nd​(xiβˆ—,pj)]Prsuperscriptsubscript𝑗1𝑛𝑑subscriptπ‘₯𝑑subscript𝑝𝑗⋅1italic-Ο΅superscriptsubscript𝑗1𝑛𝑑subscriptπ‘₯superscript𝑖subscript𝑝𝑗\displaystyle\Pr\left[\sum_{j=1}^{n}\,d\left(x_{t},p_{j}\right)>\left(1+\epsilon\right)\cdot\sum_{j=1}^{n}\,d\left(x_{i^{*}},p_{j}\right)\right]
=(3)superscript(3)\displaystyle\stackrel{{\scriptstyle\text{(\ref{thebestindexaccordingtorandomsamples})}}}{{=}}Pr⁑[(βˆ‘j=1nd​(xt,pj)>(1+Ο΅)β‹…βˆ‘j=1nd​(xiβˆ—,pj))∧(βˆ‘j=1kd​(xt,𝒗j)=mini=1hβ€‹βˆ‘j=1kd​(xi,𝒗j))]Prsuperscriptsubscript𝑗1𝑛𝑑subscriptπ‘₯𝑑subscript𝑝𝑗⋅1italic-Ο΅superscriptsubscript𝑗1𝑛𝑑subscriptπ‘₯superscript𝑖subscript𝑝𝑗superscriptsubscript𝑗1π‘˜π‘‘subscriptπ‘₯𝑑subscript𝒗𝑗superscriptsubscript𝑖1β„Žsuperscriptsubscript𝑗1π‘˜π‘‘subscriptπ‘₯𝑖subscript𝒗𝑗\displaystyle\Pr\left[\left(\sum_{j=1}^{n}\,d\left(x_{t},p_{j}\right)>\left(1+\epsilon\right)\cdot\sum_{j=1}^{n}\,d\left(x_{i^{*}},p_{j}\right)\right)\land\left(\sum_{j=1}^{k}\,d\left(x_{t},{\boldsymbol{v}}_{j}\right)=\min_{i=1}^{h}\,\sum_{j=1}^{k}\,d\left(x_{i},{\boldsymbol{v}}_{j}\right)\right)\right]
≀\displaystyle\leqPr⁑[(βˆ‘j=1nd​(xt,pj)>(1+Ο΅)β‹…βˆ‘j=1nd​(xiβˆ—,pj))∧(βˆ‘j=1kd​(xt,𝒗j)β‰€βˆ‘j=1kd​(xiβˆ—,𝒗j))]Prsuperscriptsubscript𝑗1𝑛𝑑subscriptπ‘₯𝑑subscript𝑝𝑗⋅1italic-Ο΅superscriptsubscript𝑗1𝑛𝑑subscriptπ‘₯superscript𝑖subscript𝑝𝑗superscriptsubscript𝑗1π‘˜π‘‘subscriptπ‘₯𝑑subscript𝒗𝑗superscriptsubscript𝑗1π‘˜π‘‘subscriptπ‘₯superscript𝑖subscript𝒗𝑗\displaystyle\Pr\left[\left(\sum_{j=1}^{n}\,d\left(x_{t},p_{j}\right)>\left(1+\epsilon\right)\cdot\sum_{j=1}^{n}\,d\left(x_{i^{*}},p_{j}\right)\right)\land\left(\sum_{j=1}^{k}\,d\left(x_{t},{\boldsymbol{v}}_{j}\right)\leq\sum_{j=1}^{k}\,d\left(x_{i^{*}},{\boldsymbol{v}}_{j}\right)\right)\right]
≀\displaystyle\leqPr⁑[βˆƒi∈[h],(βˆ‘j=1nd​(xi,pj)>(1+Ο΅)β‹…βˆ‘j=1nd​(xiβˆ—,pj))∧(βˆ‘j=1kd​(xi,𝒗j)β‰€βˆ‘j=1kd​(xiβˆ—,𝒗j))]Pr𝑖delimited-[]β„Žsuperscriptsubscript𝑗1𝑛𝑑subscriptπ‘₯𝑖subscript𝑝𝑗⋅1italic-Ο΅superscriptsubscript𝑗1𝑛𝑑subscriptπ‘₯superscript𝑖subscript𝑝𝑗superscriptsubscript𝑗1π‘˜π‘‘subscriptπ‘₯𝑖subscript𝒗𝑗superscriptsubscript𝑗1π‘˜π‘‘subscriptπ‘₯superscript𝑖subscript𝒗𝑗\displaystyle\Pr\left[\exists i\in[h],\,\left(\sum_{j=1}^{n}\,d\left(x_{i},p_{j}\right)>\left(1+\epsilon\right)\cdot\sum_{j=1}^{n}\,d\left(x_{i^{*}},p_{j}\right)\right)\land\left(\sum_{j=1}^{k}\,d\left(x_{i},{\boldsymbol{v}}_{j}\right)\leq\sum_{j=1}^{k}\,d\left(x_{i^{*}},{\boldsymbol{v}}_{j}\right)\right)\right]
≀\displaystyle\leqβˆ‘i=1hPr⁑[(βˆ‘j=1nd​(xi,pj)>(1+Ο΅)β‹…βˆ‘j=1nd​(xiβˆ—,pj))∧(βˆ‘j=1kd​(xi,𝒗j)β‰€βˆ‘j=1kd​(xiβˆ—,𝒗j))]superscriptsubscript𝑖1β„ŽPrsuperscriptsubscript𝑗1𝑛𝑑subscriptπ‘₯𝑖subscript𝑝𝑗⋅1italic-Ο΅superscriptsubscript𝑗1𝑛𝑑subscriptπ‘₯superscript𝑖subscript𝑝𝑗superscriptsubscript𝑗1π‘˜π‘‘subscriptπ‘₯𝑖subscript𝒗𝑗superscriptsubscript𝑗1π‘˜π‘‘subscriptπ‘₯superscript𝑖subscript𝒗𝑗\displaystyle\sum_{i=1}^{h}\,\Pr\left[\left(\sum_{j=1}^{n}\,d\left(x_{i},p_{j}\right)>\left(1+\epsilon\right)\cdot\sum_{j=1}^{n}\,d\left(x_{i^{*}},p_{j}\right)\right)\land\left(\sum_{j=1}^{k}\,d\left(x_{i},{\boldsymbol{v}}_{j}\right)\leq\sum_{j=1}^{k}\,d\left(x_{i^{*}},{\boldsymbol{v}}_{j}\right)\right)\right]
<Fact 4superscriptFact 4\displaystyle\stackrel{{\scriptstyle\text{Fact~{}\ref{Indykkeyfact}}}}{{<}}βˆ‘i=1hexp⁑(βˆ’Ο΅2​k64)superscriptsubscript𝑖1β„Žsuperscriptitalic-Ο΅2π‘˜64\displaystyle\sum_{i=1}^{h}\,\exp{\left(-\frac{\epsilon^{2}k}{64}\right)}
=\displaystyle=hβ‹…exp⁑(βˆ’Ο΅2​k64)β‹…β„Žsuperscriptitalic-Ο΅2π‘˜64\displaystyle h\cdot\exp{\left(-\frac{\epsilon^{2}k}{64}\right)}
<\displaystyle<Ο΅,italic-Ο΅\displaystyle\epsilon,

where the second inequality uses t∈[h]𝑑delimited-[]β„Žt\in[h]. ∎

In short, Lemma 5 says how to find a ((1+Ο΅)​κ)1italic-Ο΅πœ…((1+\epsilon)\kappa)-approximate 111-median from {x1,x2,…,xh}subscriptπ‘₯1subscriptπ‘₯2…subscriptπ‘₯β„Ž\{x_{1},x_{2},\ldots,x_{h}\} with probability greater than 1βˆ’Ο΅1italic-Ο΅1-\epsilon, where ΞΊπœ…\kappa is the best approximation ratio among x1subscriptπ‘₯1x_{1}, x2subscriptπ‘₯2x_{2}, ……\ldots, xhsubscriptπ‘₯β„Žx_{h}. Note that computing t𝑑t in Eq. (3) requires no knowledge of the ordering p1subscript𝑝1p_{1}, p2subscript𝑝2p_{2}, ……\ldots, pnsubscript𝑝𝑛p_{n}.

1:  hβ†βŒˆ109​(log⁑(1/Ο΅))/Ο΅βŒ‰β†β„Žsuperscript1091italic-Ο΅italic-Ο΅h\leftarrow\lceil 10^{9}(\log(1/\epsilon))/\epsilon\rceil;
2:  kβ†βŒˆ109​(log⁑(1/Ο΅))/Ο΅2βŒ‰β†π‘˜superscript1091italic-Ο΅superscriptitalic-Ο΅2k\leftarrow\lceil 10^{9}(\log(1/\epsilon))/\epsilon^{2}\rceil;
3:  Pick 𝒖1subscript𝒖1{\boldsymbol{u}}_{1}, 𝒖2subscript𝒖2{\boldsymbol{u}}_{2}, ……\ldots, 𝒖hsubscriptπ’–β„Ž{\boldsymbol{u}}_{h}, 𝒗1subscript𝒗1{\boldsymbol{v}}_{1}, 𝒗2subscript𝒗2{\boldsymbol{v}}_{2}, ……\ldots, 𝒗ksubscriptπ’—π‘˜{\boldsymbol{v}}_{k} independently and uniformly at random from [n]delimited-[]𝑛[n];
4:  t←argmini=1hβˆ‘j=1kd​(𝒖i,𝒗j)←𝑑superscriptsubscriptargmin𝑖1β„Žsuperscriptsubscript𝑗1π‘˜π‘‘subscript𝒖𝑖subscript𝒗𝑗t\leftarrow\mathop{\mathrm{argmin}}_{i=1}^{h}\,\sum_{j=1}^{k}\,d({\boldsymbol{u}}_{i},{\boldsymbol{v}}_{j}), breaking ties arbitrarily;
5:  return  π’–tsubscript𝒖𝑑{\boldsymbol{u}}_{t};

Figure 1: Algorithm approx. median for ultrametric 111-median
Lemma 6.

Algorithm approx. median in Fig. 1 outputs a ((1+Ο΅)​(1+2​ϡ))1italic-Ο΅12italic-Ο΅((1+\epsilon)(1+2\epsilon))-approximate 111-median with probability greater than 1βˆ’2​ϡ12italic-Ο΅1-2\epsilon.

Proof.

With hβ„Žh and 𝒖1subscript𝒖1{\boldsymbol{u}}_{1}, 𝒖2subscript𝒖2{\boldsymbol{u}}_{2}, ……\ldots, 𝒖hsubscriptπ’–β„Ž{\boldsymbol{u}}_{h} as in approx. median,

Pr⁑[βˆƒi∈[h],𝒖i∈{p1,p2,…,pβŒˆΟ΅β€‹nβŒ‰}]Pr𝑖delimited-[]β„Žsubscript𝒖𝑖subscript𝑝1subscript𝑝2…subscript𝑝italic-ϡ𝑛\displaystyle\Pr\left[\exists i\in[h],\,{\boldsymbol{u}}_{i}\in\left\{p_{1},p_{2},\ldots,p_{\lceil\epsilon n\rceil}\right\}\right]
=\displaystyle=1βˆ’Pr⁑[βˆ€i∈[h],𝒖iβˆ‰{p1,p2,…,pβŒˆΟ΅β€‹nβŒ‰}]1Prfor-all𝑖delimited-[]β„Žsubscript𝒖𝑖subscript𝑝1subscript𝑝2…subscript𝑝italic-ϡ𝑛\displaystyle 1-\Pr\left[\forall i\in[h],\,{\boldsymbol{u}}_{i}\notin\left\{p_{1},p_{2},\ldots,p_{\lceil\epsilon n\rceil}\right\}\right]
=\displaystyle=1βˆ’(1βˆ’βŒˆΟ΅β€‹nβŒ‰n)h1superscript1italic-Ο΅π‘›π‘›β„Ž\displaystyle 1-\left(1-\frac{\lceil\epsilon n\rceil}{n}\right)^{h}
>\displaystyle>1βˆ’Ο΅.1italic-Ο΅\displaystyle 1-\epsilon.(6)

When there exists 1≀i≀h1π‘–β„Ž1\leq i\leq h satisfying 𝒖i∈{p1,p2,…,pβŒˆΟ΅β€‹nβŒ‰}subscript𝒖𝑖subscript𝑝1subscript𝑝2…subscript𝑝italic-ϡ𝑛{\boldsymbol{u}}_{i}\in\{p_{1},p_{2},\ldots,p_{\lceil\epsilon n\rceil}\}, Lemma 3 asserts the existence of a (1+2​ϡ)12italic-Ο΅(1+2\epsilon)-approximate 111-median in {𝒖1,𝒖2,…,𝒖h}subscript𝒖1subscript𝒖2…subscriptπ’–β„Ž\{{\boldsymbol{u}}_{1},{\boldsymbol{u}}_{2},\ldots,{\boldsymbol{u}}_{h}\}. So Eqs. (2)–(6) force {𝒖1,𝒖2,…,𝒖h}subscript𝒖1subscript𝒖2…subscriptπ’–β„Ž\{{\boldsymbol{u}}_{1},{\boldsymbol{u}}_{2},\ldots,{\boldsymbol{u}}_{h}\} to contain a (1+2​ϡ)12italic-Ο΅(1+2\epsilon)-approximate 111-median with probability greater than 1βˆ’Ο΅1italic-Ο΅1-\epsilon. By Lemma 5 (with {xi}i=1hsuperscriptsubscriptsubscriptπ‘₯𝑖𝑖1β„Ž\{x_{i}\}_{i=1}^{h} substituted by {𝒖i}i=1hsuperscriptsubscriptsubscript𝒖𝑖𝑖1β„Ž\{{\boldsymbol{u}}_{i}\}_{i=1}^{h}), approx. median outputs a ((1+Ο΅)​κ)1italic-Ο΅πœ…((1+\epsilon)\kappa)-approximate 111-median with probability greater than 1βˆ’Ο΅1italic-Ο΅1-\epsilon if {𝒖1,𝒖2,…,𝒖h}subscript𝒖1subscript𝒖2…subscriptπ’–β„Ž\{{\boldsymbol{u}}_{1},{\boldsymbol{u}}_{2},\ldots,{\boldsymbol{u}}_{h}\} contains a ΞΊπœ…\kappa-approximate 111-median, for all ΞΊ>0πœ…0\kappa>0. Now take ΞΊ=1+2β€‹Ο΅πœ…12italic-Ο΅\kappa=1+2\epsilon. ∎

Theorem 7.

Ultrametric 111-median has a Monte Carlo O​((log2⁑(1/Ο΅))/Ο΅3)𝑂superscript21italic-Ο΅superscriptitalic-Ο΅3O((\log^{2}(1/\epsilon))/\epsilon^{3})-time (1+Ο΅)1italic-Ο΅(1+\epsilon)-approximation algorithm with success probability greater than 1βˆ’Ο΅1italic-Ο΅1-\epsilon.

Proof.

Invoke Lemma 6 (with Ο΅italic-Ο΅\epsilon substituted by Ο΅/4italic-Ο΅4\epsilon/4) and calculate the running time of approx. median. ∎

References

  • [1] C.-L. Chang. Some results on approximate 111-median selection in metric spaces. Theoretical Computer Science, 426:1–12, 2012.
  • [2] C.-L. Chang. Metric 111-median selection: Query complexity vs. approximation ratio. ACM Transactions on Computation Theory, 9(4):20:1–20:23, 2018.
  • [3] T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein. Introduction to Algorithms. The MIT Press, 3rd edition, 2009.
  • [4] S. Guha, A. Meyerson, N. Mishra, R. Motwani, and L. O’Callaghan. Clustering data streams: Theory and practice. IEEE Transactions on Knowledge and Data Engineering, 15(3):515–528, 2003.
  • [5] P. Indyk. Sublinear time algorithms for metric space problems. In Proceedings of the 31st Annual ACM Symposium on Theory of Computing, pages 428–434, 1999.
  • [6] P. Indyk. High-dimensional computational geometry. PhD thesis, Stanford University, 2000.
  • [7] A. Kumar, Y. Sabharwal, and S. Sen. Linear-time approximation schemes for clustering problems in any dimensions. Journal of the ACM, 57(2):5, 2010.
  • [8] S. Wasserman and K. Faust. Social Network Analysis: Methods and Applications. Cambridge University Press, 1994.