Abstract:
In Natural Language Processing (NLP), tokenization is a critical pre-processing step that significantly influences model performance. The choice of the tokenizer is crucial, especially given the contemporary situation with large LMs that are expensive to train. Our study investigates various subword-level tokenizers, considering their strengths and limitations. Based on our analysis, we propose a practical approach for comparing these tokenizers, considering factors such as tokenization effectiveness, vocabulary size, and tokenization speed. The paper reviews current tokenizer evaluation methods and contributes to a new evaluation dataset. Therefore, this paper aims to help researchers choose and train the most appropriate tokenizer for their tasks, especially when faced with limited training resources. Our objective is to empower the research community to make well-informed decisions about tokenizer selection and improve the quality of their language models.
Key words and phrases:NLP, LLM, tokenizer, tokenization, optimization, benchmark, dataset.