Google Says ...

An unofficial, unaffiliated source of comment and opinion on statements from Google, Google employees, and Google representatives. In no way is this site owned by, operated by, or representative of Google, Google's point of view, policies, or statements.

My Photo
Location: California, United States

Use your imagination. It's more entertaining.

Wednesday, August 02, 2006

How do you say that again?

Back in April, Google Research announced their beta Arabic-English translation tools. I performed a simple translation test, one which exemplifies just how difficult it is for people to use online translation tools. Let me share an anecdote with you first before I reveal the test and my results.

A few years ago, I had a very popular Web site. I called it Parma Endorion. Originally created in the fall of 1996, Parma Endorion was just a collection of a handful of essays I wrote about randomly selected topics concerning J.R.R. Tolkien's Middle-earth. From 1996 to 1998, I received hundreds of emails from teachers, librarians, and students around the world asking me how they could print out the essays (I had deliberately made this difficult to do). It finally dawned on me that I should stop being so intellectually proprietous and let people print the essays.

So in 1998 I redesigned the site to work more like a book (parma is the Elvish word for "book" in Tolkien's invented languages). Well, my email exploded with thank yous for a while. And then something wonderful happened. People started linking to Parma Endorion all over the place. Problem was, new research was showing me that the essays needed serious updating. And my readers wanted a sequel to Visualizing Middle-earth I didn't have time to write. So in 2001 I arranged with Matt Tinaglia to update Parma Endorion and offer the third edition as a free eBook. All this is pretty well documented.

As the original essays had been translated into Polish and Italian, I thought it would be cool to translate the eBook. I contacted various overseas Tolkien groups and asked for help. I used a well-known online translation tool to write my letters of invitation. I translated the letters sentence-by-sentence, translating them back to English, repeating the process as often as necessary -- changing words where the translations didn't work -- until I found some consistency.

It's a brutal technique but it works for the most part. However, I just didn't pay close enough attention to the Spanish language translations. I can actually read Spanish to some degree. I used to read the Miami Herald's Spanish edition every day, so I really had no excuse for what happened. But I didn't notice that the tool had translated "fans" (as in devoted readers of Tolkien's books) to "ventiladores" (ventillators -- those rotating things that push air around). Well, when the Spanish ventiladores stopped laughing, one of them wrote back and said, "Yes, it's obvious you need help, and we'll be glad to help you."

Thus was born the Spanish translation of Parma Endorion.

So, what is the significance of all that? English has a fairly standardized spelling system. We have differences between the United States and areas of the British Commonwealth. For example, we write "color" and they write "colour". But for the most part English is very standardized. One area of notable exception, however, is the rendering of certain foreign languages into English. Arabic names in particular have multiple renderings in English. Take the terrorist organization Hezbollah. Or is that Hizballa? Or was that supposed to be Hizbullah? You can see the variation in spellings just by switching news sources in your Web browser.

So what does Google with these kinds of variations in spelling? I decided to type a phrase into the English to Arabic tool using a variation of Hizballah's name. The Arabic to English tool returned the exact English meaning of my phrase with a different spelling of "Hizbullah".

I wondered what their authoritative source for that spelling might be. I suppose there may be a book of translation standards running around, but even the Israeli news media cannot agree on how to render the name into English. The Jerusalem Post uses Hizbullah (like Google) and Haaretz uses Hezbollah. So what does Google do to normalize English spellings of non-English words?

While it may seem to some people that I'm getting lost in the details, the SEO world should pay some attention to what Google does with its translation technologies. Spellings are only one aspect of the challenges that face translators. Idiom -- the way we form phrases -- causes an even bigger headache. Today we say "I'm down with that" to mean "that's cool by me" which used to be "I'm okay with that" which replaced "I most heartily agree" which was subsequent to "I approve in all aspects".

When you think about how the Google tool translates Web pages, which may or may not use non-standard idiom, or obsolete idiom, think about how phrase normalization will become very important for Google. If they can accurately render whole passages of text into foreign languages (better than the older tool we're all so familiar with), Google will have taken online translation to a new level.

Furthermore, the significance of idiomatic translation is that a core relevance standard can be established. Think of a universal language that underlies all of our human languages. Linguists have been seeking the means of tying all human languages together for decades, perhaps centuries. They are not really close to doing that, but once they achieve a Unified Human Language Theory, they'll be able to offer translators new techniques and tools for determining what unusual passages may mean.

"may mean" is in itself significant. Language doesn't simply rely upon words and spelling and expressions. For example, nearly every human language -- if not indeed all of them -- incorporates metaphor to some degree. That is, we can use the phrase "hatching of the ugly duckling" to refer to the birth of something other than a duck. If you're a native Turkish language user and you have to translate a paper that uses "hatching of the ugly duckling", how do you determine what that expression is really referring to?

One of my Tolkien essays, "Is your canon on the loose", was translated into Hebrew a couple of years ago. The translator could not replicate the pun in my title ("canon" refers to the authoritative body of texts used for Tolkien research, but the title is styled on the popular idiomatic expression, "He is a loose cannon" -- "canon" and "cannon" are pronounced exactly the same way). My translator sensed the connection but did not fully appreciate it, and after I explained how the joke worked, he said, "We have no similar expression in Hebrew".

After giving the matter some thought, he chose to retitle the essay (with my approval, as well as my permission, after conferring with me and getting my opinion) "Choir of a thousand voices" (which is a much closer rendering of the actual Hebrew title than he could come to my original meaning). It was an appropriate choice for a twisted metaphor, as the core meaning for both expressions is quite the same thing. "Is your canon on the loose" refers to the fact that Tolkien canon discussions are very fluid -- I freely admit to changing contexts and canons on a frequent, mind-boggling basis. I have to because no two people use the same parameters.

So online translation tools are going to hit the same walls that human translators hit, and how those tools make their choices will be very important to search engine optimization specialists. Why is that? Because even though it's not presently possible for search engines to truly practice semantic indexing, that is exactly what they hope to achieve some day.

Semantic indexing would allow a search engine to capture a user's query in any language, any jargon, any idiomatic context and find all relevant documents -- in any language, any jargon, any idiomatic context. Isn't placing the world's collective knowledge at your fingertips one of Google's stated objectives? Hey, whether they can or cannot do it really doesn't matter right now. They are trying to do it.

Hopefully, they'll take whatever lessons they learn from these translation projects and apply them to working with user queries and page indexing.

But beware what you ask for. Today, the real estate for Web page design and optimization is wide open. If you cannot dominate one expression, you can dominate another, similar expression and then brand that similar expression into your targeted market. Teach them to search for the phrases you dominate and you blow your competition out of the water while he is still gloating over his number 1 ranking for the phrases you cannot touch.

If Google creates the Universal Translation Tool, they'll be able to substitute one expression for another in resolving queries. One might hope they would allow for an exact find search anyway (how else would one find an exact passage one wants to retrieve?). But if that day ever comes, optimizing for specific expressions will take a back seat to optimizing for concepts.

Concept optimization is in its barest infancy right now. Our methodologies are clumsy and rely mostly on brute force. Things will remain that way until the search engines become more sophisticated in language analysis and translation. When that happens, the rules will become more clear.

Until then, we'll have to keep an eye on things, standing watch over the camp, monitoring their progress, staying apace with their technological developments, matching their innovations with our improvisations.

We have to stay in touch on the issue.

Got that?


Blogger Elena Temnova said...

As far as I know, some modern translation systems can manage the problem of national dialects (e.g. British and American English, European Spanish and Latin American Spanish etc. Users can make their choice between these variants, only this feature is supported by full-function desktop versions of translation software, whereas the online translators are much more restricted in their functionality.

This problem has two aspects.

Firstly, it's that of correct interpretation of the source text (surely, a translation program must recognize all the orthographic variants of a word, such as "colour" and "color", but at the same time some words, although written and spelled in a same way, may have different translations depending on the country, for example, the well-know "fall" in American and in British English. But some smart software can recognize automatically not only the language (it's not difficult to tell English from German or French comparing the text with a list of the most frequently used words), but also a national variants, only the text in question must be long enough and contain some specific words).

Secondly, the program have to generate the target text using the rules of the required national variant of the target language (selected by default or chosen by the user).

2:23 AM  

Post a Comment

<< Home