Abstract:
The set of all q-ary strings that do not contain repeated substrings of length $\leqslant3$
(i.e., that do not contain substrings of the form aa, abab,andabcabc) constitutes a code cor
recting an arbitrary number of tandem-duplication mutations of length $\leqslant3$. In other words,
any two such strings are non-confusable in the sense that they cannot produce the same string
while evolving under tandem duplications of length $\leqslant3$. We demonstrate that this code is
asymptotically optimal in terms of rate, meaning that it represents the largest set of non-con
fusable strings up to subexponential factors. This result settles the zero-error capacity problem
for the last remaining case of tandem-duplication channels satisfying the “root-uniqueness”
property.
Keywords:tandem duplication, tandem repeat, duplication error, repetition error, sticky in sertion, DNA storage, error correction, zero-error capacity, constrained code, square-free string.