How much can you squeeze in 140? Arabic vs English vs French vs Chinese
(This is a loose and abridged translation of the original piece from Grégoire Fleurot published in French in slate.fr)
Since the demonstrations in Iran last year Twitter has become an increasingly important tool for activists needing to “get the message out”. Its key feature being to limit messages to 140 characters, whatever the alphabet we use, it is interesting to compare languages and find out which one enables to squeeze the maximum of information in 140.

Arabic: Gold – English: Silver – French: Bronze
In English and 130 characters you can say:
«I just came back from Tahrir square. Everyone there was calling for Mubarak to leave. Peaceful atmosphere, no policemen to be seen»
The French equivalent message needs more space and uses the entire 140:
«Je viens de rentrer de la Place Tahrir. Tout le monde y réclamait le départ de Moubarak. Ambiance pacifique, il n’y a pas de policier en vue»
Whilst the Arabic translation only needs 93:
«لقد رجعت للتو من ميدان التحرير. الجميع تطالب برحيل مبارك. جو هادئ. لايوجد شرطي في مرمى البصر»
On the other hand if you wanted to make use of the full 140 in Arabic you could then say:
“عدت للتو من ميدان التحرير. الجميع يطالبون برحيل مبارك على الفور. وهناك أناس من جميع الأعمار والفئات الاجتماعية والقرى والمدن. مصر كلّها هنا.”
Which becomes a longer 197:
“I just came back from Tahrir Square. Everyone there was calling for Mubarak to leave immediately. There were people of all ages and social classes, from cities and villages. All of Egypt was there.”
Whilst the French equivalent becomes a whooping 218:
«Je viens de rentrer de la Place Tahrir. Tout le monde y réclamait le départ immédiat de Moubarak. Il y avait des gens de tous âges et de toutes les classes sociales, des villes et des villages. Toute l’Egypte était là.»
So Arabic appears to be more concise, followed by English, whilst French is.. well.. more verbose.
The main reason for those differences comes from the structure of the Arabic words: no vowels, 3 to 6 consonants per word, as well as the frequent use of nominal sentences, as opposed to verbal sentences, which tend to be more compact. A classic example being: “The more, the merrier’” as opposed to “The more we are, the merrier we are”.
English wins tie-breaker thks 2 SMS Txt talk
However “Shakespeare’s language” has an other advantage besides its inherent grammar. The dominance of English as the Lingua Franca of the internet has resulted in the generalisation of abbreviations and acronyms, which other languages have not necessarily embraced so widely. We are all familiar with the classic: ”For” = 4, “to” = 2 , “Be” = b, “are” = r, etc…
This is where English gets its edge. For instance in Arabic the word «government» (10 characters) is written with 7 signs: الحكومة whereas the English abbreviation «gov» only uses 3 characters.
So we can shrink our Tahrir message to 86 without altering its *understandability*:
«Bck from Tahrir sqre. Every1 was callin 4 Mubarak 2 go. Peaceful atmosphere, no police»
However the Platinum winner of the information/data ratio is the Chinese language:
A good example is the following piece in 139 Mandarin characters from wikipedia …:
1960年代中苏关系破裂,社会主义阵营解散,共和国从此走上了完全独立发展的道路,并积极与在亚、非、拉三大洲的发展中国家建立和发展友好关系,并陆续得到了英国、法国和以色列等西方国家的承认。但是美国仍然承认在台湾地区的中華民國政府为中国的中央政府,对中华人民共和国采取孤立封锁政策。
… which would be translated in 490 English characters:
“1960 was the year of the Sino-Soviet split, the dissolution of the socialist camp. The Chinese Republic embarked on a completely separate development path and actively worked to establish and develop friendly relations with countries in Asia, Africa, and Latin American. It was gradually recognised by Britain, France, Israel and other Western countries. The United States were still recognising the ROC government in Taiwan against the Chinese central government, to counter a PRC blockade.”








.





Would love to see the Finnish translations as I suspect they would win as the longest.
I have occasionally translated a long (>140c) English tweet into Chinese (via Google Translate, deliberately), checked the back-translation (also via GT), then tweeted it in Chinese together with the GT link (ideally, shortened), so that my followers — if they can be bothered, of course — get the full message by clicking through. It works, but the main problem is that URLs (URL shorteners in particular) don’t support Unicode fully, so trying to encode things like bit.ly/… => translate.google.com/#zh-CN|en|此是試 doesn’t always work. Still, there must be an opportunity here for someone to write an app that does it all automatically, as the process is so simple.
just kidding, but russian version (not a direct translation though, but without sms lingo) “Пришел с Тахрира. Там все требовали ухода Мубарака. Всё спокойно, полиции нет.” – 79 chars
Now that I think about it, can you find any studies about comprehension time for the various languages? Does it take meaningfully longer to read a more compressed language?
Must must not get distracted by this question. I’ve got to edit my gorram dissertation.
@LeLassiezFaire
Neat. Looking here: http://en.wikipedia.org/wiki/SMS#Message_size SMS providers use 7 bit, 8 bit, and 16 bit alphabets. “Depending on which alphabet the subscriber has configured in the handset, this leads to the maximum individual short message sizes of 160 7-bit characters, 140 8-bit characters, or 70 16-bit characters (including spaces). GSM 7-bit alphabet support is mandatory for GSM handsets and network elements,[27] but characters in languages such as Arabic, Chinese, Korean, Japanese or Cyrillic alphabet languages (e.g. Russian, Serbian, Bulgarian, etc.) must be encoded using the 16-bit UTF-16 character encoding (see Unicode). Routing data and other metadata is additional to the payload size.
”
So, the answer is as ever “it depends”
Your 93 character arabic message would need to be a multi-part SMS, though twitter counts by the character rather than the byte. The question then becomes, is it possible to employ any kind of linguistic compression on the Arabic to overcome the technical doubling of each character?
Ok, so withstanding that was really matters – I insist – is how much natural language with real words you can fit in 140 real characters, let’s assume we want to be real nerds.
Indeed the UTF-8 page on the most authoritative source of knowledge these days does clarify things: http://en.wikipedia.org/wiki/UTF-8
Yes it says “ the first 128 characters (US-ASCII) need ONLY ONE byte (basically the Latin alphabet). The next 1,920 characters do need TWO bytes to encode. This includes Latin letters with diacritics and characters from the Greek,Cyrillic, Coptic, Armenian, Hebrew, Arabic, Syriac and Tāna alphabets.”
Arabic seems to need more bytes.
However, let’s not stop there. I was starting to get all excited and was about to go back to my college courses and throw big words in the air like Shannon Theorem, Fourier Transform or Huffman Coding when I stumbled across this paper comparing Huffman coding compression on English and Arabic texts: http://www.scipub.org/fulltext/jcs/jcs212885-888.pdf
Results show that the average message length and the efficiency of compression on Arabic text is better than the compression on English text. Arabic wins again.
Also interesting to note that the compression ratio (= compressed size / uncompressed size) decreases (ie. compression is better) as the file size increases. This is expected because when the file size increases, so will the frequency of the symbols: therefore we expect to have better compression on large files.
Feel better?
Regarding the Mandarin example at the end… perhaps that’s why China sees sites like Twitter as so much more of a threat…
True about the sms-part (it will be half or quarter of the available size), but twitter doesn’t honour that…
I shared this on my google reader feed, and one of my friends commented:
Kazriko Redclaw – That’s all well and good, but when converted to unicode and sent over SMS, how big is 97 arabic characters? How about the 139 Mandarin characters? I’d think there’s more than 8 bits of data for every character in the list. SMS is only a 6-bit format IIRC
Can you address the question of the *fundamental* translation of all of these languages to bits? In comparison, you may even want to provide visual representations of the bitstream of each to show the technical compression as well as the linguistic compression of each language.
[...] This post was mentioned on Twitter by Caroline De Cock, jpc 0604 and Aidan Dullard, leLaissezFaire. leLaissezFaire said: I translated this tops French piece for u: How much can u squeeze in 140? Arabic vs English vs French vs Chinese: http://is.gd/IAXNAl [...]