The Other School of Economics

How much can you squeeze in 140? Arabic vs English vs French vs Chinese

(This is a loose and abridged translation of the original piece from Grégoire Fleurot published in French in slate.fr)

Since the demonstrations in Iran last year Twitter has become an increasingly important tool for activists needing to “get the message out”. Its key feature being to limit messages to 140 characters, whatever the alphabet we use, it is interesting to compare languages and find out which one enables to squeeze the maximum of information in 140.

rosetta-680

Arabic: Gold – English: Silver – French: Bronze

In English and 130 characters you can say:

«I just came back from Tahrir square. Everyone there was calling for Mubarak to leave. Peaceful atmosphere, no policemen to be seen»

The French equivalent message needs more space and uses the entire 140:

«Je viens de rentrer de la Place Tahrir. Tout le monde y réclamait le départ de Moubarak. Ambiance pacifique, il n’y a pas de policier en vue»

Whilst the Arabic translation only needs 93:

«لقد رجعت للتو من ميدان التحرير. الجميع تطالب برحيل مبارك. جو هادئ. لايوجد شرطي في مرمى البصر»

On the other hand if you wanted to make use of the full 140 in Arabic you could then say:

“عدت للتو من ميدان التحرير. الجميع يطالبون برحيل مبارك على الفور. وهناك أناس من جميع الأعمار والفئات الاجتماعية والقرى والمدن. مصر كلّها هنا.”

Which becomes a longer 197:

“I just came back from Tahrir Square. Everyone there was calling for Mubarak to leave immediately. There were people of all ages and social classes, from cities and villages. All of Egypt was there.”

Whilst the French equivalent becomes a whooping 218:

«Je viens de rentrer de la Place Tahrir. Tout le monde y réclamait le départ immédiat de Moubarak. Il y avait des gens de tous âges et de toutes les classes sociales, des villes et des villages. Toute l’Egypte était là.»

So Arabic appears to be more concise, followed by English, whilst French is.. well.. more verbose.

The main reason for those differences comes from the structure of the Arabic words: no vowels, 3 to 6 consonants per word, as well as the frequent use of nominal sentences, as opposed to verbal sentences, which tend to be more compact. A classic example being: “The more, the merrier’” as opposed to “The more we are, the merrier we are”.

English wins tie-breaker thks 2  SMS Txt talk

However “Shakespeare’s language” has an other advantage besides its inherent grammar. The dominance of English as the Lingua Franca of the internet has resulted in the generalisation of abbreviations and acronyms, which other languages have not necessarily embraced so widely.  We are all familiar with the classic:  ”For” = 4, “to” = 2 , “Be” = b, “are” = r, etc…

This is where English gets its edge. For instance in Arabic the word «government» (10 characters) is written with 7 signs: الحكومة whereas the English abbreviation «gov» only uses 3 characters.

So we can shrink our Tahrir message to 86 without altering its *understandability*:

«Bck from Tahrir sqre. Every1 was callin 4 Mubarak 2 go. Peaceful atmosphere, no police»

However the Platinum winner of the information/data ratio is the Chinese language:

A good example is the following piece in 139 Mandarin characters from wikipedia …:

1960年代中苏关系破裂,社会主义阵营解散,共和国从此走上了完全独立发展的道路,并积极与在亚、非、拉三大洲的发展中国家建立和发展友好关系,并陆续得到了英国、法国和以色列等西方国家的承认。但是美国仍然承认在台湾地区的中華民國政府为中国的中央政府,对中华人民共和国采取孤立封锁政策。

… which would be translated in 490 English characters:

“1960 was the year of the Sino-Soviet split, the dissolution of the socialist camp. The Chinese Republic embarked on a completely separate development path and actively worked to establish and develop friendly relations with countries in Asia, Africa, and Latin American. It was gradually recognised by Britain, France, Israel and other Western countries. The United States were still recognising the ROC government in Taiwan against the Chinese central government, to counter a PRC blockade.”

Original piece in French from Grégoire Fleurot in slate.fr

Disseminate:
  • Print
  • Digg
  • del.icio.us
  • Facebook
  • Google Bookmarks
  • Blogplay
  • Add to favorites
  • email
  • FriendFeed
  • Identi.ca
  • Netvibes
  • Ping.fm
  • Posterous
  • Reddit
  • RSS
  • StumbleUpon
  • Tumblr
  • Twitter
  • Wikio
  • Yahoo! Bookmarks

10 Comments

    Would love to see the Finnish translations as I suspect they would win as the longest.

  • I have occasionally translated a long (>140c) English tweet into Chinese (via Google Translate, deliberately), checked the back-translation (also via GT), then tweeted it in Chinese together with the GT link (ideally, shortened), so that my followers — if they can be bothered, of course — get the full message by clicking through. It works, but the main problem is that URLs (URL shorteners in particular) don’t support Unicode fully, so trying to encode things like bit.ly/… => translate.google.com/#zh-CN|en|此是試 doesn’t always work. Still, there must be an opportunity here for someone to write an app that does it all automatically, as the process is so simple.

  • just kidding, but russian version (not a direct translation though, but without sms lingo) “Пришел с Тахрира. Там все требовали ухода Мубарака. Всё спокойно, полиции нет.” – 79 chars :)

  • Now that I think about it, can you find any studies about comprehension time for the various languages? Does it take meaningfully longer to read a more compressed language?

    Must must not get distracted by this question. I’ve got to edit my gorram dissertation.

  • @LeLassiezFaire

    Neat. Looking here: http://en.wikipedia.org/wiki/SMS#Message_size SMS providers use 7 bit, 8 bit, and 16 bit alphabets. “Depending on which alphabet the subscriber has configured in the handset, this leads to the maximum individual short message sizes of 160 7-bit characters, 140 8-bit characters, or 70 16-bit characters (including spaces). GSM 7-bit alphabet support is mandatory for GSM handsets and network elements,[27] but characters in languages such as Arabic, Chinese, Korean, Japanese or Cyrillic alphabet languages (e.g. Russian, Serbian, Bulgarian, etc.) must be encoded using the 16-bit UTF-16 character encoding (see Unicode). Routing data and other metadata is additional to the payload size.

    So, the answer is as ever “it depends” :) Your 93 character arabic message would need to be a multi-part SMS, though twitter counts by the character rather than the byte. The question then becomes, is it possible to employ any kind of linguistic compression on the Arabic to overcome the technical doubling of each character?

  • Ok, so withstanding that was really matters – I insist – is how much natural language with real words you can fit in 140 real characters, let’s assume we want to be real nerds.

    Indeed the UTF-8 page on the most authoritative source of knowledge these days does clarify things: http://en.wikipedia.org/wiki/UTF-8
    Yes it says “ the first 128 characters (US-ASCII) need ONLY ONE byte (basically the Latin alphabet). The next 1,920 characters do need TWO bytes to encode. This includes Latin letters with diacritics and characters from the Greek,Cyrillic, Coptic, Armenian, Hebrew, Arabic, Syriac and Tāna alphabets.”
    Arabic seems to need more bytes.

    However, let’s not stop there. I was starting to get all excited and was about to go back to my college courses and throw big words in the air like Shannon Theorem, Fourier Transform or Huffman Coding when I stumbled across this paper comparing Huffman coding compression on English and Arabic texts: http://www.scipub.org/fulltext/jcs/jcs212885-888.pdf

    Results show that the average message length and the efficiency of compression on Arabic text is better than the compression on English text. Arabic wins again.

    Also interesting to note that the compression ratio (= compressed size / uncompressed size) decreases (ie. compression is better) as the file size increases. This is expected because when the file size increases, so will the frequency of the symbols: therefore we expect to have better compression on large files.

    Feel better?

  • Regarding the Mandarin example at the end… perhaps that’s why China sees sites like Twitter as so much more of a threat… :)

  • True about the sms-part (it will be half or quarter of the available size), but twitter doesn’t honour that…

  • I shared this on my google reader feed, and one of my friends commented:

    Kazriko Redclaw – That’s all well and good, but when converted to unicode and sent over SMS, how big is 97 arabic characters? How about the 139 Mandarin characters? I’d think there’s more than 8 bits of data for every character in the list. SMS is only a 6-bit format IIRC

    Can you address the question of the *fundamental* translation of all of these languages to bits? In comparison, you may even want to provide visual representations of the bitstream of each to show the technical compression as well as the linguistic compression of each language.

  • [...] This post was mentioned on Twitter by Caroline De Cock, jpc 0604 and Aidan Dullard, leLaissezFaire. leLaissezFaire said: I translated this tops French piece for u: How much can u squeeze in 140? Arabic vs English vs French vs Chinese: http://is.gd/IAXNAl [...]

Leave a Reply




  • Inspirers

  • .

  • .

  • Brad Fidler

  • fidler-ism

    http://fidler.bol.ucla.edu/
    http://blog.bradfidler.net/

    Best summarized by this line:
    "A serendipitous juxtaposition, for those who know Brad and for those who should get know him, an intrepid explorer of the spaces between pharmaceuticals, networks, Chinese culture, economics and philosophy."

  • Paul Krugman

  • RSS Paul Krugman

  • New Matilda

  • RSS Front page feed

    • Immigrants Under Golden Dawn's Boot June 20, 2013
      Jorge Sotirios is reporting from Greece on fascist party Golden Dawn. In his second report, he details its anti-immigrant campaign. Where is Golden Dawn influential, and who is in its sights? […]
    • We Don't Recognise Any Marriage Equality June 20, 2013
      Australian law doesn’t acknowledge same-sex marriages performed overseas. Tomorrow MPs vote on a bill to change this. Kerryn Phelps on what recognition of her 15-year marriage will mean to her […]
    • Australia's Wilful Blindness On Sri Lanka June 20, 2013
      'Enhanced screening' is the latest deterrent aimed at Sri Lankan asylum seekers - as Bob Carr ignores the Rajapakse government's human rights abuses, writes Greens Senator Lee Rhiannon […]
    • Has The Media Treated Nigella Fairly? June 20, 2013
      Mainstream media outlets have a poor track record when it comes to violence against women. The coverage of Nigella Lawson this week hasn't redeemed them, writes Violeta Politoff […]
    • The Gonski Mess In Progress June 20, 2013
      The Gonski reforms have been underway for most of Labor's two terms in office and the legislation will pass shortly. But the likely state of schools funding in 2014 remains a mystery, writes Ben Eltham […]
    • The Sexism The Polls Don't Show June 20, 2013
      When the chapter on Julia Gillard gets written in the history of Australian women, it will relate how the treatment our first female PM exposed entrenched habits of sexism, writes Catriona Menzies-Pike […]
    • Greece's Nostalgic Fascists June 20, 2013
      The Greek fascist party Golden Dawn paint their battle against the establishment as a Herculean labour. Jorge Sotirios reports from Greece on how the far right wins hearts and minds […]
    • Villawood Detention Centre Isn't Secure June 20, 2013
      Security breaches and systems breakdowns are commonplace at Villawood Detention Centre. Is Serco taking responsibility for the escapes? The Detention Logs team reports […]
  • the Australia Institute

  • Books & Ideas

  • RSS Books & Ideas

    • The Commons, Old and New June 20, 2013
      The idea of the Commons prospers today as a powerful trope of twenty-first century sharing. To tell the story of how yesterday's digging and grazing became today's googling and sampling, we need to look more closely at the way the unique properties of the modern information landscape come into focus by reference to the old commons economy: through […]
    • Civic Mobilization in Russia: Protest and Daily Life June 20, 2013
      Has Russia, amidst rising social discontent and pervasive economic crisis, rediscovered collective mobilization? In this essay, Carine Clément emphasizes the potential for self-organization evident in mobilization “from below,” which is rapidly expanding in daily life. - Essays / rebellion, citizenship, mobilization, social movements […]
    • Field Testing in Development Economics June 20, 2013
      Education, microcredit, health policy…. How can we really measure the effectiveness of a public policy? Esther Duflo talks about the principles of the experimental method she has developed and perfected in several situations around the world. - Essays / development, poverty, experimentation, experimental economy […]
    • Providing Fair Access to Housing June 20, 2013
      Is there not a contradiction between the aims of sustainable urban development, which inflates the cost of housing, and the requirements of fairness in access to housing? Analysing the situation in France and comparing it to neighbouring European countries, Vincent Renard provides answers to this question. - Essays / inequalities, city, housing, sustainable […]
    • A Stroll through Public Space June 20, 2013
      Urban philosopher Thierry Paquot's synthetic work maps out the historical development of the notion of public space. It highlights the diverse representations and uses of the public which structure citizens' lives, with a fair share of hesitations and conflicts. - Reviews / city, public sphere […]