Unicode support & localization

Ivan · October 25, 2013, 10:34pm

I just committed first (and hopefully - last for now) part of Unicode support (UTF-8 variety)

H3 encoding handling
During loading all H3 texts will be converted to UTF-8. Since there is no reliable way to detect encoding users of localized versions must select correct encoding via launcher.
Mods (or more precisely - json files) must also use UTF-8 instead of using H3 encoding.

Support for localized H3 versions
If correct encoding is set in launcher VCMI should work with all European languages.
Chinese - due to way font were organized in H3 version it still would need that “Chinese fonts” mod. True type fonts would also need changes (see below)
Any right-to-left languages - don’t know if H3 was ever released on these languages but without proper text rendering engine they just won’t work.

Localizations support
We already have some strings that must be handled - e.g. monsters threat levels and some new strings in game settings menu. I’d like to move them into one (or several if needed) json files and request them in code via some ID’s.

Proper way to translate them is to use tools like gettext - boost-locale provides nice C+±like API for this already. However right now this does not seems to be high-priority task.

True type fonts
My current “mod” should work with UTF-8 already but it does not covers whole Unicode range. Actually there are no (free) fonts that have good European characters as well as Asian (Chinese for example).

For proper support of multiple languages VCMI would need some logical “merging” of multiple font files - some work for future I suppose.

For this reason I’d like to integrate the mod into VCMI source code and turn it into configuration option. This includes uploading font files to trunk - right now I use Liberation fonts which don’t have license issues like Times New Roman that we’ve used long time ago.

gotar · October 26, 2013, 4:07pm

There are 2 ways: either use some existing detection software like enca, or create database of encodings using some specific string from H3 resources (people would have to send you samples). As number of released language versions is strictly limited, the second approach should work the best. It only requires to choose some string that differs in every single language.

Ivan · October 26, 2013, 7:43pm

Database would help but collecting it will take time… So at least for now selectable encodings is probably a better idea. Later if auto-detection will be requested we can make one.

enca - don’t think it would help much. It seems to be oriented towards detecting encoding when language is known. In our case - we don’t know language but we know possible encodings.
Besides - list of supported languages does not sounds promising. Especially if you merge languages that use almost identical alphabet (e.g. Ukrainian/Russian/Belorussian use same encoding with 1-2 language-specific letters)

gotar · October 30, 2013, 4:09pm

Indeed…

Note, you don’t need to detect language, just encoding. If there would be any of the language-specific letters, they would be detected, if not - it doesn’t matter if it’s Ukrainian or Russian, since both have the same characters at the same KOI-8[RU] positions which correspond to the same unicode codes.

To sum this up: converting to unicode doesn’t require language hint, this is needed for transliteration only.

So instead H3 string samples just a list of possible charsets should be enough - knowing whether they had used ISO or sth other one might easily create list of available chars and compare to data.

Ivan · October 30, 2013, 5:29pm

Or just detect language from system locale and keep database “language code -> H3 encoding”. Should work in almost all situations - after all it is unlikely that H3 language will differ from system language.

100% detection is hard to reach anyway - for example using string samples is not that easy with Russian versions (~3 different translations) while language codes may fail in such cases like Ukrainian - H3 was not translated to it, everyone here just uses Russian version.

gotar · November 1, 2013, 10:57am

That depends on how many languages is supported by H3…

~: locale -a | cut -f1 -d_ | uniq | wc -l
218

I couldn’t find the list of available translations anywhere, only the localization packs for Linux edition.

It doesn’t matter as long as these translations were made using the same charset. For example, if you see character 248 (soft sign - provided H3 used the KOI-8R and “°” in ASCII) you guess it’s гражданский_шрифт; if you see char 188 - it must be polish “ź” (as this char represents borders in both ASCII and KOI-8R).

Detecting encoding in general is hard task (including statistical analysis of char codes), but having strictly specified (and short) list is often enough to make it as simple as searching for single character - unique for the charset.

Doesn’t matter in my approach - detect encoding not language. Since single encoding covers multiple languages it should greatly simplify the process. Although I might be wrong if there is some huge number of editions available, hard to tell without the list.

Ivan · November 1, 2013, 3:33pm

It does matter when same term was translated differently. Teal color for example was translated differently each time - бирюзовый, чайный, сизый.
Finding term that everyone translates in the same way is not hard. But creating database for such approach will take quite a while.

H3 uses CP1251 for Russian version.

Distinguishing Russian CP1251 from Polish CP1250 is not a problem. But distinguishing CP1250 from CP1252 is much harder - a lot of characters that are used only by some languages - any frequency tests will fail here, checks for some specific characters won’t help much either.
For example “ź” is specific to Polish - it won’t detect any other languages that use CP1250. Polish “ź” from CP1250 (which is #191 according to wiki) is also Spanish “¿” in CP1252.

CP1250 vs CP1252 is my main problem in terms of charset detection.

Turnam · November 2, 2013, 3:23pm

Some people prefer to watch movies in their original version, even if that version isn’t in their native language. Same thing for video games. You’ll find people playing with the English version even if the system language is not English.

Inversely, I’ve seen people playing a translated version of a game even though the original version was in their own language, to help them practice a foreign language a bit even while they’re playing. Granted, that’s not a frequent occurrence, but it does happen too.

Ivan · November 2, 2013, 3:42pm

FYI - I’m one of such people ^
One more factor you haven’t mentioned is questionable quality of some of translations.

Anyway, this discussion is only about auto-detected value - there always will be some way to override this value. In fact current implementation is manually selectable value.

gotar · November 3, 2013, 11:37am

Ivan, you’re right, it would be too hard to distinguish CP125x encodings to be worth coding and depending on system locale might be some kind of hint until some database is created (or final solution).

I got LC_CTYPE and LC_COLLATE only set to pl_PL and that’s what enca used (I prefer original english strings and it happens that I play only english H3), but vast majority of users feels comfortable only with their native language and would probably have appropriate H3 - but even if not (e.g. when no translation was made), CP125x encodings gracefully fallback to ASCI when no national characters are used.

Cases when X-language speaker would use Y-language localized H3 different than english (CP1252) would be probably too rare to care.

oceanking7 · January 22, 2014, 2:07pm

I saw that you made a russia translation,
for VCMI’s international
Can you show us how to make a translation
for we do the translation for our own language

thanks

Ivan · January 22, 2014, 3:48pm

What exactly you want to translate?
Heroes 3 itself - AFAIK there is already Chinese version of H3 so you can use it. VCMI should work with localized versions of H3, othervice - this is bug that we must fix.
As for Chinese specifically you also need to

Install “chinese fonts” mod
In Launcher settings select Simplified Chinese encoding.

VCMI - we have quite few strings that need translation. Check this: github.com/vcmi/vcmi/blob/maste … slate.json
If you’ll translate all English strings from that file and upload translated file here I can turn it into separate mod.

Some of VCMI mods - if you’ll tell me what mods you want to translate then I can generate files with all translatable strings from that mod

oceanking7 · January 23, 2014, 12:39am

thanks
i’ll try to do it