Unsupervised Statistical Segmentation of Japanese Kanji Strings
Ando, Rie; Lee, Lillian
Word segmentation is an important issue in Japanese language processing because Japanese is written without space delimiters between words. We propose a simple dictionary-less method to segment Japanese kanji sequences into words based solely on character $n$-gram counts from an unannotated corpus. The performance was often better than that of rule-based morphological analyzers over a variety of both standard and novel error metrics.
computer science; technical report
Previously Published As