the hardest part in training model in foreign languages is to get correctly labeled dataset. I worked with pretrain model on Polish language documents and based on this experience it is relatively good if you are using some text similarity measures. There are some examples/pretrain models with Korean/English/French language