Peter Burkimsher, 2018-06-19
peterburk@gmail.com
Contents
|
Introduction> Have you ever used a computer in another language? What does the 關機 button do? > I have 100 localisations; that must be enough. Text looks cleaner, let's get rid of the icons. If you like pretty graphs and want to change the world, this article is for you! You'll find 9 graphs about languages of by famous brands, 2 datasets of bilingual strings, and 2 more graphs about Chinese dialects. I did this side project because I'm looking a job, hopefully in New Zealand or Canada, possibly Australia. I'm currently working for a memory card manufacturer in Taiwan, logging and analysing testing data about microSD cards. I've been a foreigner all my life, and I'm trying to find a country to call home. Please tell me where I'm welcome! There are a lot of graphs. If you work for one of these companies, you can send them to a director! I can also make some graphs for you and your company! (e.g. Coca-Cola, Pepsi, McDonalds, or other multinational brands). |
Microsoft Windows![]()
Maltese (655,560), Maori (642,700), Norwegian Nynorsk (576,996), Icelandic (438,800), Luxembourgish (424,100), Cherokee (302,000), Irish (279,700), Scottish Gaelic (64,400), Inuktitut (35,000), Catalan - Valencian (13,000).
The bias towards European languages is probably because of pressure from the EU. This is not a bad thing, per se; it should be a call for other regional groups (ASEAN, ECOWAS, ODECA, UNASUR, Arab League, Pacific Union) to exert the same pressure that the EU does.
To give all languages equal priority (I'm looking at you, 13,000 Valencian speakers) would require translating Windows to 2670 languages.
Most of those sound small, but Norwegian? It turns out there are two dialects: Bokmål and Nynorsk. Both are supported by Windows.
I got my population data from JoshuaProject, which doesn't separate those dialects. Therefore I used the Wikipedia value of 86.3% Bokmål to estimate the populations speaking each dialect.
Dividing dialects means that values may vary between graphs. For example, Windows separates English (International) and English (United States).
I modified the data, subtracting the population of English speakers in the United States.
Microsoft calls it English (International), and English (United Kingdom) in other places. I'm biased because my passport is British, so I showed favouritism and made my native dialect look larger.
![]() |
Apple macOS![]() ![]() |
Ubuntu![]() |
Google Translate![]()
Assamese (India), Cherokee, Dari, Inuktitut (Latin, Canada), K'iche' (Guatemala), Kinyarwanda, Konkani (India), Odia (India), Quechua (Peru), Setswana (South Africa), Tatar (Russia), Turkmen, Uyghur, Valencian, Wolof
On a more optimistic note, if you're a programmer, you can easily be the first to make a machine translation tool for these languages!
You can use the Windows Strings dataset above as a bilingual dictionary to get started.
Some languages are supported by Google Translate but not Windows. If you're a developer at Microsoft, have a close look at these and decide whether to support them.
Google Translate not Microsoft Windows (19):
Cebuano, Chichewa, Corsican, Esperanto, Frisian, Haitian Creole, Hawaiian, Hmong, Javanese, Latin, Malagasy, Myanmar (Burmese), Pashto, Samoan, Shona, Somali, Sundanese, Yiddish.
It gets worse when you consider dialects. Punjabi can be written in Gurumukhī (India) or Shahmukhi scripts (Pakistan, Arabic). Google Translate does not support the Shahmukhi script for Punjabi, which might upset 160 million people.
Dialects are not shown on the graphs unless they're supported, because I want the red bars to appear larger.
Some dialects are supported by Google Translate, such as Chinese (Simplified) and Chinese (Traditional). Therefore it is appropriate to request for these to be added. This includes English (United Kingdom), which is personally important to me.
Instead of translating "喜愛" as "favorite", I'd rather make you happy by adding the "u" - making "favourite".
Microsoft Windows Dialects not Google Translate (16):
Bangla (Bangladesh), Bangla (India), Chinese (Hong Kong SAR), Central Kurdish, English (United Kingdom), French (Canada), Norwegian Nynorsk (Norway), Portuguese (Brazil), Punjabi (Arabic), Serbian (Cyrillic, Bosnia and Herzegovina), Serbian (Cyrillic, Serbia), Serbian (Latin, Serbia), Spanish (Mexico).
|
Wikipedia![]() |
International BaccalaureateFigure 8 - IB Languages over 10 million![]()
Assyrian, Maori, Dhivehi (Maldives), Icelandic, Dzongkha (Bhutan), Irish, Classical Greek, Latin.
I'd also like to mention that the IBO is recruiting a Tamil examiner.
The language was supported in 2016 but was not offered in 2017, probably as a result of not finding a suitable candidate.
The statistical bulletins of May and November 2017 are made of screenshots, which makes copy-pasting impossible.
The 2016 data included real text in the PDF, so I used the November 2016 and May 2017 exams to reduce the amount of transcribing I had to do.
The exams are offered twice a year because the summer holiday is at a different time in the southern hemisphere.
Some languages are only available during the May exam session, so students needing to re-sit would be required to wait a whole year instead of only 6 months.
|
Bible![]() |
ConclusionThe world is a very diverse place. Recommending that people replace icons with text and "translate those into the 72 languages that Gmail supports" will isolate a lot of the market who must use a second language to operate a computer. I'm a fan of skeuomorphic design, even though it's no longer the trend. Adding only a few more languages can make a lot of people very happy. Installing Windows language packs is hard, but they support more languages than macOS. Changing from video to photo mode on a Chinese-language iPhone was easy in iOS 6, but became more difficult for me when the icons were replaced by text. For future work, I'd like to make graphs of iOS and Android localisations, extract the strings from other apps (e.g. Office), and estimate future populations of these languages based on demographic data. |
Appendix 1: Developer documentationThe title, i2018n, is a pun on the term "i18n", which is short for "internationalisation" (there are 18 letters between i and n). I extracted the strings from shell32.dll using a scraper I built using sample code from StackOverflow. For some languages, I couldn't install the language pack using lpksetup.exe, so I reinstalled Windows from the ISO and used a slightly different scraper. If you download too many ISOs, Microsoft will ban your IP for 3 hours or more, giving Error 715-123130. I used a spare computer to work around that problem. I then had 157 GB of ISOs, 1.59 GB of language pack CAB files, and 11.45 GB for each virtual machine. Installing language packs took a whole day, unattended. Reinstalling Windows from ISOs for the other 38 languages took 2 whole days of manual interaction. Arabic is written right to left, and this can be really confusing. The Windows installer for Thai is in English. The fonts for Japanese and Chinese make ASCII characters look awful. Setting a password is optional. Some of the ISOs from Microsoft are Win10_1511, others are Win10_1803. Some languages let you select a version (Home, Professional, Ultimate), others don't. Running virtual machines uses a lot of power, and your MacBook Pro will get pretty hot. German and French have different keyboard layouts, which makes typing a username or product key a bit harder. Thankfully there's always an English (ASCII) keyboard option. There are 107 text files, one for each language. The file names correspond to the integer value of the language ID. My data scrape is missing 3 languages, because the links are broken on MSDN: 1074 Setswana (South Africa), 7194 Serbian (Cyrillic, Bosnia and Herzegovina), 10266 Serbian (Cyrillic, Serbia). All other language packs are available, although you must remove everything before download.windowsupdate.com to make the URL work. There are also 5 languages that are no longer used: 1158 K'iche' (Guatemala), 3098 Serbian (Cyrillic, Serbia), 2074 Serbian (Latin, Serbia), 2141 Inuktitut (Latin, Canada), 3076 Chinese (Hong Kong SAR). Although there are 65536 possible keys, only 4410 actually have values. A word of warning: parsing and aligning the data is not trivial. There are many special characters, mixed right-to-left scripts, and stray newlines that confuse most programming languages, including my first parser in Python. If you don't believe me, just try to open 16821.txt in TextWrangler and sort it. Building the table above required making a parser that can grep for the string keys I need. I extracted the icons from shell32.dll and imageres.dll by renaming the DLL to 7z, opening it with Keka, and going to the hidden .rsrc folder, looking through for the icon I want, and converting each ICO file to a PNG. |
Appendix 2: Chinese DialectsAfter you've tried learning Chinese, everything else seems easy. As part of Pingtype, I've gathered a lot of bilingual English/Chinese data. As well as the train timetable, Bible, restaurant menu, 200 Christian song lyrics, and movie subtitles on the website, I have much more on my local machine, including bilingual English/台語 songs. Please contact me if you're interested in that. In addition to Mandarin, I'm personally interested in Minnan/Hokkien/Taiwanese/台語 and Hakka, because my girlfriend's mum speaks 台語 and her dad speaks Hakka. The difference between English (United States) and English (United Kingdom) is relatively minor, although we joke about the words we misunderstand. That's not true of Chinese dialects. Cantonese and Taiwanese are not mutually intelligible with Mandarin.![]() ![]() |