Imagine discovering a secret language spoken only online by a knowledgeable and learned few. Over a period of weeks, as you begin to tease out the meaning of this curious tongue and ponder its purpose, the language appears to shift in subtle but fantastic ways, remaking itself daily before your eyes. And just when you are poised to share your findings with the rest of the world, the entire thing vanishes.
This fairly describes my roller coaster experience of curiosity, wonder and disappointment over the past few weeks, as I’ve worked alongside security researchers in an effort to understand how “lorem ipsum” — common placeholder text on countless Web sites — could be transformed into so many apparently geopolitical and startlingly modern phrases when translated from Latin to English using Google Translate. (If you have no idea what “lorem ipsum” is, skip ahead to a brief primer here).
Admittedly, this blog post would make more sense if readers could fully replicate the results described below using Google Translate. However, as I’ll explain later, something important changed in Google’s translation system late last week that currently makes the examples I’ll describe impossible to reproduce.
It all started a few months back when I received a note from Lance James, head of cyber intelligence at Deloitte. James pinged me to share something discovered by FireEye researcher Michael Shoukry and another researcher who wished to be identified only as “Kraeh3n.” They noticed a bizarre pattern in Google Translate: When one typed “lorem ipsum” into Google Translate, the default results (with the system auto-detecting Latin as the language) returned a single word: “China.”
Capitalizing the first letter of each word changed the output to “NATO” — the acronym for the North Atlantic Treaty Organization. Reversing the words in both lower- and uppercase produced “The Internet” and “The Company” (the “Company” with a capital “C” has long been a code word for the U.S. Central Intelligence Agency). Repeating and rearranging the word pair with a mix of capitalization generated even stranger results. For example, “lorem ipsum ipsum ipsum Lorem” generated the phrase “China is very very sexy.”
Kraeh3n said she discovered the strange behavior while proofreading a document for a colleague, a document that had the standard lorem ipsum placeholder text. When she began typing “l-o-r..e..” and saw “China” as the result, she knew something was strange.
“I saw words like Internet, China, government, police, and freedom and was curious as to how this was happening,” Kraeh3n said. “I immediately contacted Michael Shoukry and we began looking into it further.”
And so the duo started testing the limits of these two words using a mix of capitalization and repetition. Below is just one of many pages of screenshots taken from their results:
The researchers wondered: What was going on here? Has someone outside of Google figured out how to map certain words to different meanings in Google Translate? Was it a secret or covert communications channel? Perhaps a form of communication meant to bypass the censorship erected by the Chinese government with the Great Firewall of China? Or was this all just some coincidental glitch in the Matrix?
For his part, Shoukry checked in with contacts in the U.S. intelligence industry, quietly inquiring if divulging his findings might in any way jeopardize important secrets. Weeks went by and his sources heard no objection. One thing was for sure, the results were subtly changing from day to day, and it wasn’t clear how long these two common but obscure words would continue to produce the same results.
“While Google translate may be incorrect in the translations of these words, it’s puzzling why these words would be translated to things such as ‘China,’ ‘NATO,’ and ‘The Free Internet,'” Shoukry said. “Could this be a glitch? Is this intentional? Is this a way for people to communicate? What is it?”
When I met Shoukry at the Black Hat security convention in Las Vegas earlier this month, he’d already alerted Google to his findings. Clearly, it was time for some intense testing, and the clock was already ticking: I was convinced (and unfortunately, correct) that much of it would disappear at any moment.
A BRIEF HISTORY OF LOREM IPSUM
Search the Internet for the phrase “lorem ipsum,” and the results reveal why this strange phrase has such a core connection to the lexicon of the Web. Its origins in modernity are murky, but according to multiple sites that have attempted to chronicle the history of this word pair, “lorem ipsum” was taken from a scrambled and altered section of “De finibus bonorum et malorum,” (translated: “Of Good and Evil,”) a 1st-Century B.C. Latin text by the great orator Cicero.
According to Cecil Adams, curator of the Internet trivia site The Straight Dope, the text from that Cicero work was available for many years on adhesive sheets in different sizes and typefaces from a company called Letraset.
“In pre-desktop-publishing days, a designer would cut the stuff out with an X-acto knife and stick it on the page,” Adams wrote. “When computers came along, Aldus included lorem ipsum in its PageMaker publishing software, and you now see it wherever designers are at work, including all over the Web.”
This pair of words is so common that many Web content management systems deploy it as default text. Case in point: Lorem Ipsum even shows up on healthcare.gov. According to a story published Aug. 15 in the Daily Mail, more than a dozen apparently dormant healthcare.gov pages carry the dummy text. (Click here if you skipped ahead to this section).
FURTHER TESTING
Things began to get even more interesting when the researchers started adding other words from the Cicero text from which the “lorem ipsum” bit was taken, including: “Neque porro quisquam est qui dolorem ipsum quia dolor sit amet, consectetur, adipisci velit . . .” (“There is no one who loves pain itself, who seeks after it and wants to have it, simply because it is pain …”).
Adding “dolor” and “sit” and “consectetur,” for example, produced even more bizarre results. Translating “consectetur Sit Sit Dolor” from Latin to English produces “Russia May Be Suffering.” “sit sit dolor dolor” translates to “He is a smart consumer.” An example of these sample translations is below:
Latin is often dismissed as a “dead” language, and whether or not that is fair or true it seems pretty clear that there should not be Latin words for “cell phone,” “Internet” and other mainstays of modern life in the 21st Century. However, this incongruity helps to shed light on one possible explanation for such odd translations: Google Translate simply doesn’t have enough Latin texts available to have thoroughly learned the language.
In an introductory video titled Inside Google Translate, Google explains how the translation engine works, the sources of the engine’s intelligence, and its limitations. According to Google, its Translate service works “by analyzing millions and millions of documents that have already been translated by human translators.” The video continues:
“These translated texts come from books, organizations like the United Nations, and Web sites from all around the world. Our computers scan these texts looking for statistically significant patterns. That is to say, patterns between the translation and the original text that are unlikely to occur by chance. Once the computer finds a pattern, you can use this pattern to translate similar texts in the future. When you repeat this process billions of times, you end up with billions of patterns, and one very smart computer program.”
Here’s the rub:
“For some languages, however, we have fewer translated documents available, and therefore fewer patterns that our software has detected. This is why our translation quality will vary by language and language pair.”
Still, this doesn’t quite explain why Google Translate would include so many references specific to China, the Internet, telecommunications, companies, departments and other odd couplings in translating Latin to English.
In any case, we may never know the real explanation. Just before midnight, Aug. 16, Google Translate abruptly stopped translating the word “lorem” into anything but “lorem” from Latin to English. Google Translate still produces amusing and peculiar results when translating Latin to English in general.
A spokesman for Google said the change was made to fix a bug with the Translate algorithm (aligning ‘lorem ipsum’ Latin boilerplate with unrelated English text) rather than a security vulnerability.
Kraeh3n said she’s convinced that the lorem ipsum phenomenon is not an accident or chance occurrence.
“Translate [is] designed to be able to evolve and to learn from crowd-sourced input to reflect adaptations in language use over time,” Kraeh3n said. “Someone out there learned to game that ability and use an obscure piece of text no one in their right mind would ever type in to create totally random alternate meanings that could, potentially, be used to transmit messages covertly.”
Meanwhile, Shoukry says he plans to continue his testing for new language patterns that may be hidden in Google Translate.
“The cleverness of hiding something in plain sight has been around for many years,” he said. “However, this is exceptionally brilliant because these templates are so widely used that people are desensitized to them, and because this text is so widely distributed that no one bothers to question why, how and where it might have come from.”
It looks like Google is still giving some interesting translations for Latin to English. Here are some of the more interesting translations I found while copying in blocks of Lorem Ipsum text:
vestibulum lacus = programming language
eros = United States
felis commodo pellentesque = gas commodity trade
pharetra purus = Mexican immigration
tortor ut = password
The section:
“Suspendisse rutrum, eros id condimentum consectetur, metus nulla blandit tortor, eget euismod mi tortor sit amet nisl”
Translates into:
“Stress official website, to improve the main ie, the United States, the jobs most exciting home base, the development of site content, my macro will not be published.”
Krebs… I’ve watched Google Translate for many years. I hung around at their little-known forum (now closed) and answered random questions from people who mostly would never post there again. I even got a nice thank-you mail from a guy at Google Translate once.
And the things I’ve seen there… statistical translation is an ever surprising beast. At one time, “Sarkozy Sarkozy Sarkozy” translated from French to English would become “Bush defeats Blair”. If you translated the Irish national anthem, you would get a bork-up of a literal translation and the British national anthem. (One Irishman was convinced that there was no way “God save the Queen” would appear in there unless it was a deliberate attempt of someone at Google to make fun of the Irish). Sometimes, the translations would be shocking in the bizarre sense they made. At one point, the word “elado” (misspelling of the Spanish word gelado, ice cream) would be translated to “ais krimh” in English!
It’s not surprising at all that the lorem ipsum text would have bizarre translations. If you have an “en” site and a “de” site, but the “de” site only has placeholder text (something that happens very commonly, I bet), that would confuse the hell out of the statistical translator. And in response to confusion, it can get VERY creative.
But there’s no conspiracy here, any more than there was a conspiracy at Google to make fun of the Irish – or the Estonians, or the Israelis, or the Turks, or the Indonesians who all at one time came with such outraged accusations at offensive and weird translations.
“You’re a detective now, son. You’re not allowed to believe in coincidence anymore.” – The Dark Knight Rises
Is this not just a cigar posing as a cigar? Why can’t these just be a large number of pages that were written in English, but have undone pages full of “lorem ipsum” text left on their non-English pages? Then pulled into Googly’s translate function and treated as authoritative? Perhaps this is (in this community) just a case of confirmation bias/furtive fallacy?
Google Translate also has problems with Pig Latin. Thomas Jefferson’s “Hoggibus, piggibus et shotam damnabile grunto” is translated as “Hogg, piggibus damning shots and grunts,” though Pig Latin has substantially changed since Jefferson’s time.
And it wasn’t that long ago that Google refused to autocomplete “Islam is” when it was happy to autocomplete every other religion to all sorts of insults.
Speaking of FireEye / Mandiant, it wasn’t able to protect one of its clients, Community Health Systems, from China.
http://www.reuters.com/article/2014/08/18/us-community-health-cybersecurity-idUSKBN0GI16N20140818
P.S. Latin is commonly used in medical technology as well as the legal profession.
Of course, it’s extensively used for taxonomy of species too. Latin is used an awful lot for a dead language.
What’s really interesting is that less than a century ago they would teach it in high schools as an elective course.
Less than a century ago indeed! I learned it in high school in the early 1960’s.
Try “two decades ago” – I graduated a US public high school in 1993 and took three years of Latin.
> “Speaking of FireEye / Mandiant, it wasn’t able to protect one of its clients, Community Health Systems, from China.”
No. This is not correct at all.. FireEye was brought in after the fact to investigate the breach and data-theft! CHS were not FireEye technology customers when they were breached.
It appears I tarnished the good name of FireEye / Mandiant. I had read a few articles which suggested that CHS had been protected by FireEye, but it now appears that the reporters mixed up their sources. FireEye / Mandiant is being used in forensics. Heartbleed is now the prime suspect.
Interesting: Michael just pointed out that GT still translates this, only backwards. Change GT so that it’s set to translate from English to Latin, and then type “China” without the quotes. You get “Lorem ipsum dolor”
If you really want to back this research up, do a page scrape of the number of lorem-ipsum sites (or any other site using your chosen translation string), and see what comes out in the text surrounding the string. As others above have said, Google uses statistical translation and machine learning to translate things. I’ve seen some pretty amusing results trying to translate from englisht to french, let alone from boilerplate to anything. Garbage in, Garbage out, as they say.
Exactly. In fact, as of this moment, the following English->Latin translations come from GTranslate:
placeholder -> Lorem ipsum
placeholder text -> consectetuer adipiscing elit
Garbage in, indeed.
Let’s throw some fuel on the fire! Several years ago Stratfor (let’s call them some “spooky” folks) suffered a breach and a lot of emails and documents were dumped on the internet for public viewing. Lo and behold, we find lorem ipsum related text in various of those dumped documents.
For example – http://wikileaks.org/gifiles/docs/52/5269468_re-military-portal-tagging-problem-.html
If we take strings of this text and put them in Google translate we come up with interesting results. For example:
“Integer aliquet libero a est porta vitae adipiscing mauris pulvinar” translates as “The complete economic freedom from the gate of the ecological environment of China”
I love waking up in the morning to the smell of conspiracy.
It’s meant to baffle: hang up by the heels.
Sed ut pastrami occaecat jerky landjaeger chuck meatball, sunt venison eu ad. Tail reprehenderit cillum exercitation pastrami cupidatat occaecat kevin nostrud cow quis brisket leberkas. Sirloin pig beef, jerky landjaeger venison tail ham sed. T-bone jerky exercitation ut elit excepteur proident irure fatback.
Ask people who do machine translation and they’ll just laugh at this story. This is just an example of security paranoia seeing patterns and conspiracies and failing to understand how capricious statistical classifiers get in the tail.
What a fun article! But I agree, the translation is most likely just side effects of machine learning. My favorite Google Translate oddity is the translation of names. I translate TONS of academic articles to and from other languages to Enlish and have loved seeing these oddities like how a single initial will translate to an entire name, or a very long Indian name will be rendered as something like “Sam” or “George”.
I work in Machine Translation, and the behavior you are seeing is simply the effect of a very VERY undertrained statistical model.
Google Translate is looking for the closest statistical match for the bigram “Lorem Ipsum” (or any of the others you give), and has what are possibly millions of VERY low probability matches for it – and no high probability ones. It’s probably picking some winning sequence with an incredibly low occurrence in the training data that happens to be 10^-15 or so higher than the next nearest neighbor. In other words, you are looking at direct statistical noise.
The regular occurrence of China is probably just due to some bias in the underlying training data. Many modern SMT (statistical Machine Translation) engines are trained on newswire data. Google probably spidered a wide variety of news sources looking for parallel translations, and found a ton of data that had the lorem ipsum placeholder that seemed correlated in some way with real text.
By the way, the “Lorem Ipsum” sequence is frequently known as “greeking”.
So this is why I did so badly on my Latin homework 🙂
This Google dork returns some interesting results:
site:*.gov “Lorem ipsum dolor sit”
Not sure I’m on the conspiracy bandwagon, as my tinfoil hat is in the shop for repairs. With that said, it would be an interesting way to hide messages in plain sight. There have to be at least 100 better ways to do it, though, if someone was so inclined. Why not use the same LI text in HTML comments?
I suspect, as others have suggested, that a Google coder is having fun. Very interesting, though.
Nah, this isn’t the result of deliberate code — that would be the “Hello world!” Easter egg noticed in 2010.
Really, it’s just a machine learning algorithm being fed bad data (lots of placeholder text “detected” as Latin by the language sensing algorithm). It’s very easy to make a machine learning data cluster/tree fall into a degenerate state and output completely unexpected associations.
Go to this site: http://www.lipsum.com/ generate any “lorem ipsum” paragraph. Go to google translate and check the results…. SCARY.
Example:
Original:
Etiam placerat non urna in semper. Etiam molestie condimentum massa cursus suscipit. Vestibulum id purus arcu. Duis sit amet viverra arcu, vel tristique mauris. Fusce libero orci, tempor vitae blandit vitae, viverra consequat libero. Donec ut odio arcu. Phasellus et lectus id arcu laoreet suscipit a non turpis. Integer fringilla mauris a purus blandit, rutrum rhoncus tellus suscipit.
Translation:
Even the real estate does not specialize in always. Even the employee to improve mass market commodity. This game is pure alcohol. To be very important, timely, emotional, or more comfortable place to start. Chase played a role, long-life, exciting life, education and development of children. I just hate the airlines. Many factors, performances and undertakes a variety of this bow from the non-ugly. It’s going to can help you to create a pure afternoon, the official website of a wide range of skills to raise up.
These translations are the direct result of the DEF CON 22 badge challenge, nothing more. http://elegin.com/dc22/ http://potatohatsecurity.tumblr.com/post/94565729529/defcon-22-badge-challenge-walkthrough
EDIT: The above post should read: These translations were actively taken advantage of by the DEF CON 22 badge challenge, which seems to have exacerbated the issue.
I think you’ll find the Latin phrase in the DEFCON challenge was actual Latin. So I don’t think DEFCON appears to be related to these issues we’re seeing with Google translator.
Right, because
Lorem ipsum dolor s
is latin for
Pussycat Dolls.
L057’s ‘Latin’ poem – http://defcon.org/1057/FissilingualElucidation/
Translation image – http://elegin.com/dc22/ipsum_translate.png
http://potatohatsecurity.tumblr.com/post/94565729529/defcon-22-badge-challenge-walkthrough
Step 6 email says
+++
Well done!
Find 1o57, and hand him a note- written on blue paper….
On the note must be your name(s) / team name – and this phrase:
perfer et obdura; dolor hic tibi proderit olim
Congratulations, you have earned a spot … but I’ve said too much…
Include an email 🙂
The line “perfer et obdura; dolor hic tibi proderit olim” actually translates from Latin to English as:
“Be patient and tough; someday this pain will be useful to you.”
Yes but both emailing PussycatDolls@1o57.uk and visiting http://defcon.org/1057/NATO146483093172709523 were clues.
‘PussycatDolls’ in the email address as well as ‘NATO’ in the URL came directly from the funny translation you can see in the image I reference above. The phrase at the end came from a response to an email sent to JaneLeeves@curious.codes, NOT the Google translation.
Brian, you’ve discovered a new language: Web Latin! Try this: go to http://www.lipsum.com/, generate some Lorem Ipsum, paste it into Google Translate, and enjoy.
Wow, playing Non Sequitur just got a lot easier!
After watching how Google translator works…I do notice that most web templates now have the text of Lorem Ipsum in them as a default text. Can it be after they change the words Google thinks that was a translation and since China has so many people maybe this is why the words appear…think about it.
Testing Google Translate against Bacon Ipsum (baconipsum.com) produces some genuinely weird results…
Great article Brian. Really interesting stuff. Hope all is well.
Mad Libs at the expense of compute power.
GT has an “improve this translation” option, and it’s possible to game this by inserting any text you want as a supposed “improvement”. The more obscure the text (such as the multiple repetitions of lorem) the more likely that someone “improved” the translation.
The sudden change in GT behavior may well have come about because someone at Google cleaned house (but not very efficiently).
To access the “improve this translation” option, click on the translated text for whatever you choose to translate; at the bottom of the list of alternative translations is the option to add your own.
KOS never ceases to amaze me! Modern Day parlour games!
One more reason to use http://baconipsum.com, as if we needed another.
Dan Sanders +1
Came to post the same. No conspiracy here. Just Google cleverness backfiring.
The root of “all evil” may reside in the 3rd party app google uses to translate these phrases. I don’t know if they created it themselves, or is the back end linked to another outside source for transaltion. Google would then probably be hit up for a small fee or give the 3rd party some sort of compensation.
Couldn’t this just be due to the Chinese having widely pirated a copy of some western publishing software that uses the Loren ipsum place holder convention?
Who is surprised that some enterprising person up online a document (or two or three) that translates Lorem etc from the Latin (I took four long years of it) to _____ fill in the language? Doh.
This is probably a free spammer tool to generate text that most spam detectors will let through, no?
Interesting how many people here in the comments try to cover up this bizzarre blunder. The thing interesting the most is that these seemigly random phrases are all about geopolitics, but not about some more common stuff like sport, love, games or children.
This is not a coincidence. Someone uses lorem ipsum as a cipher to talk about world geopolitics. Google’s translation algos just caught these talks by accident.
The algorithm explains it all. You can reverse engineer commonalities to identify properties of the equation which can explain the behavior. The key word being “behavior” since the equation is to also use behavior as it’s learning mechanism. The geo-political could be a natural result of the global population utilizing the service.
It still “works” from English to Latin 🙂
NATO – Lorem Ipsum
sexy – Lorem ipsum
Sexy – Lorem Ipsum
China – Lorem ipsum dolor
@NN, now I have this song “I’m too sexy for your translator” bouncing around my head.
Fascinating!
The only other language I speak well is Esperanto. (Seriously. I worked at the central office of the World Esperanto Association in Rotterdam for six months.) Here are some Latin->Esperanto results:
Lorem ipsum = Lorem ipsum
Ipsum lorem = la Lorem (“The Lorem”)
lorem lorem ipsum ipsum = Lorem Lorem ipsum ĝi (“Lorem Lorem ipsum it”)
ipsum ipsum ipsum = tre tre tre (“very very very”)
Lor = koloro (“color”)
ipsu = la menciita (“the mentioned”)
More strangeness:
eros sit pharetra purus = Perl is a pure IP
eros sit Pharetra purus = the United States is pure quiver
eros Sit pharetra purus = The Mexican immigration policies
eros sit pharetra purus sequi = Mexican immigration to the United States is
eros sit pharetra purus sequi nesciunt perspiciatis = Mexican immigration in the United States is extremely painful encounter
eros sit pharetra Purus sequi Nesciunt perspiciatis = They must follow the criteria of Mexican immigration
Eventually, someone will discover an old message that says Putin wants to invade Ukraine.