Run-Length FM-Index

Advent Calendar 2019 13

12 Immutable@takoshi
14 toptree@niuez_

FM-index count\mathrm{count} suffix array inverse suffix array locate\mathrm{locate} extract\mathrm{extract}

HISAT2arXiv Hyper Collocation

FM-Index id:echizen_tm FM-Index run-length FM-index


Veli Mäkinen and Gonzalo Navarro. 2005. Succinct suffix arrays based on run-length encoding. Nordic J. of Computing 12, 1 (March 2005), 40-66.

0-based rank\mathrm{rank} select\mathrm{select}

rankc(X,i)\mathrm{rank}_c(X, i) :XX X[0..i]X[0..i] cc
selectc(X,i)\mathrm{select}_c(X, i):XXi+1i + 1cc

FM-Index

FM-index

FM-index 2

LL: Burrows-Wheeler LL
rankc(L,i)\mathrm{rank}_c(L, i): LLL[0..i)L[0..i)cc
accessc(L,i)\mathrm{access}_c(L, i) : L[i]L[i]
CC: cccc

TT Burrows-Wheeler LL TT suffix arraySASA

L[p]=if  SA[p]>0  then  T[SA[p]1]  else  T[n1]L[p] = \mathrm{if}\;SA[p]>0\;\mathrm{then}\;T[SA[p]-1]\;\mathrm{else}\; T[n-1]

SASA (Okanohara & Sadakane, 2009)

FM-index FF F[p]=T[SA[p]]F[p] = T[SA[p]] Suffix array FFTTT=mississippi$T = \mathrm{mississippi\$}LLFFTT suffix

Gyazo

LL wavelet tree CCcount\mathrm{count}mmσ\sigmaO(mlogσ)O(m\log\sigma)

let mut s = 0;
let mut e = n;
for i in (0..n).rev() {
let c = P[i];
s = rank(c, L, s) + C[c];
e = rank(c, L, e) + C[c];
if s >= e {
break;
}
}
return e - s

suffix prefix [s,e)[s, e)

count\mathrm{count} rankc(L,p)+C[c]\mathrm{rank}_c(L, p) + C[c] L[p]=cL[p] = c LL LF-mappingLF(p)LF(p) LF-mapping LL FF T[5]=sT[5] = \mathrm{s} L[8]L[8] F[10]F[10] LF(8)=10LF(8) = 10 inverse LF-mapping Inverse LF-mapping FM-index selectc(L,qC[c])\mathrm{select}_c(L, q - C[c])

Run-Length FM-Index


FM-index LL使nn Burrows-Wheeler LLLL (run-length encoding) run-length FM-index


LL run L=xxxzyy$xxL = \mathrm{xxxzyy\$xx}LLxxx,z,yy,$,xx\mathrm{xxx}, \mathrm{z}, \mathrm{yy}, \mathrm\$, \mathrm{xx}5 run
Run-length FM-index LL LL run (run head) run (run length) SS BB run head L=xxxzyy$xxL = \mathrm{xxxzyy\$xx} S=xzy$xS = \mathrm{xzy\$x} SS rank\mathrm{rank} select\mathrm{select} FM-index LL wavelet tree
run length [3,1,2,1,2][3, 1, 2, 1, 2] ll 111 l1l - 1 00 BB L=xxxzyy$xxL = \mathrm{xxxzyy\$xx} B=100110110B = 100110110BBLLBB11LL run BB
run length run head [1,3,2,2,1][1, 3, 2, 2, 1] BBBB'B=110010101B' = 110010101
Gyazo

SS cc SS cc CSC_S CS[$]=0, CS[x]=1, CS[y]=6, CS[z]=8C_S[\$]=0,\ C_S[\mathrm x]=1,\ C_S[\mathrm y]=6,\ C_S[\mathrm z]=8

FM-index L,CL, CS,B,B,CSS, B, B', C_S run-length FM-index

LF-Mapping

FM-index count\mathrm{count} rankc(L,p)+C[c]\mathrm{rank}_c(L, p) + C[c] run-length FM-index

ppLL run run cc

rankc(L,p)+C[c]=select1(B,CS[c]+rankc(S,rank1(B,p)))\mathrm{rank}_c(L, p) + C[c] = \mathrm{select}_1(B', C_S[c] + \mathrm{rank}_c(S, \mathrm{rank}_1(B, p)))


pp: run x\mathrm x LLrun head
rank1(B,p)\mathrm{rank}_1(B, p): run run head SS
rankc(S,rank1(B,p))\mathrm{rank}_c(S, \mathrm{rank}_1(B, p)): LL run cc run
Gyazo

rankc(L,p)\mathrm{rank}_c(L, p) LL run cc C[c]C[c] LL FF cc rankc(L,p)+C[c]\mathrm{rank}_c(L, p) + C[c] run ccFF + 1
BB' 1cc CS[c]C_S[c] cc rankc(S,rank1(B,p))\mathrm{rank}_c(S, \mathrm{rank}_1(B, p)) run cc run 1 CS[c]+rankc(S,rank1(B,p)C_S[c] + \mathrm{rank}_c(S, \mathrm{rank}_1(B, p) select\mathrm{select}

Gyazo

p{0,,n}p \in \{0, \ldots, n\} cc rankc(L,p)+C[c]\mathrm{rank}_c(L, p) + C[c] L[p]L[p]

L[p]cL[p] \neq c
rankc(L,p)+C[c]=select1(B,CS[c]+rankc(S,rank1(B,p)))\mathrm{rank}_c(L, p) + C[c] = \mathrm{select}_1(B', C_S[c] + \mathrm{rank}_c(S, \mathrm{rank}_1(B, p)))

L[p]=cL[p] = c
rankc(L,p)+C[c]=select1(B,CS[c]+rankc(S,rank1(B,p)))+pselect1(B,rank1(B,p))\mathrm{rank}_c(L, p) + C[c] = \mathrm{select}_1(B', C_S[c] + \mathrm{rank}_c(S, \mathrm{rank}_1(B, p))) + p - \mathrm{select}_1(B, \mathrm{rank}_1(B, p))

L[p]=cL[p] = c L[p]L[p] L[p]=S[rank1(B,p+1)1]L[p] = S[ \mathrm{rank}_1(B, p + 1) - 1] SS

L[p]cL[p] \neq c pp run LL pp run pp'rankc(L,p)=rankc(L,p)\mathrm{rank}_c(L, p)=\mathrm{rank}_c(L, p'), rank1(B,p)=rank1(B,p)\mathrm{rank}_1(B, p) = \mathrm{rank}_1(B, p')

rankc(L,p)+C[c]\mathrm{rank}_c(L, p) + C[c]
=rankc(L,p)+C[c]= \mathrm{rank}_c(L, p') + C[c]
=select1(B,CS[c]+rankc(S,rank1(B,p)))= \mathrm{select}_1(B', C_S[c] + \mathrm{rank}_c(S, \mathrm{rank}_1(B, p')))
=select1(B,CS[c]+rankc(S,rank1(B,p))) = \mathrm{select}_1(B', C_S[c] + \mathrm{rank}_c(S, \mathrm{rank}_1(B, p)))


L[p]=cL[p] = c select1(B,CS[c]+rankc(S,rank1(B,p)))\mathrm{select}_1(B', C_S[c] + \mathrm{rank}_c(S, \mathrm{rank}_1(B, p))) ppcc run ccFF + 1 pp run run head LLpp'ccFFqq'
pp run run head p=select1(B,rank1(B,p))p' = \mathrm{select}_1(B, \mathrm{rank}_1(B, p))ppp' - p pp LF-mapping

Gyazo

Inverse LF-Mapping

inverse LF-mapping selectc(L,qC[c])\mathrm{select}_c(L, q - C[c]) S,B,B,CSS, B, B', C_S count\mathrm{count} 便
count\mathrm{count} LF-mapping F[q]cF[q] \neq c F[q]=cF[q] = c

q{0,,n1}q \in \{0, \ldots, n-1\}
selectc(L,qC[c])=select1(B,selectc(S,rank1(B,q+1)1CS[c]))+qselect1(B,rank1(B,q+1)1)\mathrm{select}_c(L, q - C[c]) = \mathrm{select}_1(B, \mathrm{select}_c(S, \mathrm{rank_1}(B', q + 1) - 1 - C_S[c])) + q - \mathrm{select}_1(B', \mathrm{rank_1}(B', q + 1) - 1 )
c=F[q]c = F[q]

Gyazo


FF qq run selectc(L,qC[c])=p\mathrm{select}_c(L, q - C[c]) = p FF LL run q,pq', p' B[0..q]B'[0..q] 1 rank1(B,q+1)\mathrm{rank}_1(B', q+1) cc 1 CS[c]C_S[c] run LL p=select1(B,selectc(S,rank1(B,q+1)1CS[c]))p' = \mathrm{select}_1(B, \mathrm{select}_c(S, \mathrm{rank_1}(B', q + 1) - 1 - C_S[c])) run offsetqq=qselect1(B,rank1(B,q+1)1)q - q' = q - \mathrm{select}_1(B', \mathrm{rank_1}(B', q + 1) - 1 )

c (=F[q])c\ (= F[q]) FM-index C[c]q<C[c+1]C[c] \leq q < C[c + 1] cc Run-length FM-index rank1\mathrm{rank}_1 調C[c]=rank1(B,C[c]+1)1C[c] = \mathrm{rank}_1(B', C[c] + 1) - 1 CS[c]rank1(B,q+1)1<CS[c+1]C_S[c] \leq \mathrm{rank}_1(B', q + 1) - 1 < C_S[c+1] cc


Rust FM-index run-length FM-index


LLSS wavelet matrix (Claude & Navarro, 2012) rsdic Navarro and Providel (2012)

LLB,BB, B'

artificial pp1p1 - ppp 1 12345671234567

Gyazo

10,000,000 pp FM-index run-length FM-index

result
p (FM-index) (byte) (run-length FM-index) (byte) run /
0.5 2182592 3007432 0.3359
0.9 1473872 1585144 0.1303
0.99 1010064 827208 0.0167
0.999 946088 650976 0.0017

Run-length FM-index

count\mathrm{count} https://github.com/ajalab/fm-index/blob/master/benches/count.rs 256 256
run-length FM-index FM-index 3 ~ 4 1,000,000 1 μs 2σ\sigmalogσ\log\sigma

Gyazo

Gyazo


Run-length FM-index BBBB' run Gagie et al (2019) run-length FM-index BB predecessor search

locate\mathrm{locate} run rrO(r)O(r) suffix array O(n)O(n) https://github.com/nicolaprezza/r-index advent calendar


FM-index run-length FM-index

twitter PR



Veli Mäkinen and Gonzalo Navarro. 2005. Succinct suffix arrays based on run-length encoding. Nordic J. of Computing 12, 1 (March 2005), 40-66.
Daisuke Okanohara and Kunihiko Sadakane. 2009. A Linear-Time Burrows-Wheeler Transform Using Induced Sorting. In Proceedings of the 16th International Symposium on String Processing and Information Retrieval (SPIRE '09), 90-101.
Francisco Claude and Gonzalo Navarro. 2012. The wavelet matrix. In Proceedings of the 19th international conference on String Processing and Information Retrieval (SPIRE'12), 167-179.
Travis Gagie, Gonzalo Navarro, and Nicola Prezza. 2018. Optimal-time text indexing in BWT-runs bounded space. In Proceedings of the Twenty-Ninth Annual ACM-SIAM Symposium on Discrete Algorithms (SODA '18), 1459-1477.