This paper aims to address the limitations of visual text generation in text-to-image synthesis. Although recent efforts have been made for utilizing the stable-diffusion model in visual text generation task, they are constrained to the specific language used during training, and have limitations in the controllability such as reflecting the font of the text or generating it in desired regions. Additionally, geometric transformations have been restricted, leading to failures in text generation with arbitrary geometry, and the alignments between objects and texts have been rather neglected. In this paper, we propose to attach additional modules on the stable-diffusion model to extend its capabilities in controllable multi-lingual visual text generation. Our approach simultaneously expands the model to support multi-lingual generation accommodating arbitrary geometry and font styles in visual text generation with the alignment of texts and images at desired locations. Moreover, our model is able to generalize to new languages exploiting only a small number of training data for the language. Experiments are conducted on multi-lingual benchmarks, and we demonstrate the state-of-the-art performance through user studies and OCR accuracies.
Publisher
Ulsan National Institute of Science and Technology