Imagine discovering a secret language spoken only online by a knowledgeable and learned few. Over a period of weeks, as you begin to tease out the meaning of this curious tongue and ponder its purpose, the language appears to shift in subtle but fantastic ways, remaking itself daily before your eyes. And just when you are poised to share your findings with the rest of the world, the entire thing vanishes.
This fairly describes my roller coaster experience of curiosity, wonder and disappointment over the past few weeks, as I’ve worked alongside security researchers in an effort to understand how “lorem ipsum” — common placeholder text on countless Web sites — could be transformed into so many apparently geopolitical and startlingly modern phrases when translated from Latin to English using Google Translate. (If you have no idea what “lorem ipsum” is, skip ahead to a brief primer here).
Admittedly, this blog post would make more sense if readers could fully replicate the results described below using Google Translate. However, as I’ll explain later, something important changed in Google’s translation system late last week that currently makes the examples I’ll describe impossible to reproduce.
It all started a few months back when I received a note from Lance James, head of cyber intelligence at Deloitte. James pinged me to share something discovered by FireEye researcher Michael Shoukry and another researcher who wished to be identified only as “Kraeh3n.” They noticed a bizarre pattern in Google Translate: When one typed “lorem ipsum” into Google Translate, the default results (with the system auto-detecting Latin as the language) returned a single word: “China.”
Capitalizing the first letter of each word changed the output to “NATO” — the acronym for the North Atlantic Treaty Organization. Reversing the words in both lower- and uppercase produced “The Internet” and “The Company” (the “Company” with a capital “C” has long been a code word for the U.S. Central Intelligence Agency). Repeating and rearranging the word pair with a mix of capitalization generated even stranger results. For example, “lorem ipsum ipsum ipsum Lorem” generated the phrase “China is very very sexy.”
Kraeh3n said she discovered the strange behavior while proofreading a document for a colleague, a document that had the standard lorem ipsum placeholder text. When she began typing “l-o-r..e..” and saw “China” as the result, she knew something was strange.
“I saw words like Internet, China, government, police, and freedom and was curious as to how this was happening,” Kraeh3n said. “I immediately contacted Michael Shoukry and we began looking into it further.”
And so the duo started testing the limits of these two words using a mix of capitalization and repetition. Below is just one of many pages of screenshots taken from their results:
The researchers wondered: What was going on here? Has someone outside of Google figured out how to map certain words to different meanings in Google Translate? Was it a secret or covert communications channel? Perhaps a form of communication meant to bypass the censorship erected by the Chinese government with the Great Firewall of China? Or was this all just some coincidental glitch in the Matrix?
For his part, Shoukry checked in with contacts in the U.S. intelligence industry, quietly inquiring if divulging his findings might in any way jeopardize important secrets. Weeks went by and his sources heard no objection. One thing was for sure, the results were subtly changing from day to day, and it wasn’t clear how long these two common but obscure words would continue to produce the same results.
“While Google translate may be incorrect in the translations of these words, it’s puzzling why these words would be translated to things such as ‘China,’ ‘NATO,’ and ‘The Free Internet,'” Shoukry said. “Could this be a glitch? Is this intentional? Is this a way for people to communicate? What is it?”
When I met Shoukry at the Black Hat security convention in Las Vegas earlier this month, he’d already alerted Google to his findings. Clearly, it was time for some intense testing, and the clock was already ticking: I was convinced (and unfortunately, correct) that much of it would disappear at any moment.
A BRIEF HISTORY OF LOREM IPSUM
Search the Internet for the phrase “lorem ipsum,” and the results reveal why this strange phrase has such a core connection to the lexicon of the Web. Its origins in modernity are murky, but according to multiple sites that have attempted to chronicle the history of this word pair, “lorem ipsum” was taken from a scrambled and altered section of “De finibus bonorum et malorum,” (translated: “Of Good and Evil,”) a 1st-Century B.C. Latin text by the great orator Cicero.
According to Cecil Adams, curator of the Internet trivia site The Straight Dope, the text from that Cicero work was available for many years on adhesive sheets in different sizes and typefaces from a company called Letraset.
“In pre-desktop-publishing days, a designer would cut the stuff out with an X-acto knife and stick it on the page,” Adams wrote. “When computers came along, Aldus included lorem ipsum in its PageMaker publishing software, and you now see it wherever designers are at work, including all over the Web.”
This pair of words is so common that many Web content management systems deploy it as default text. Case in point: Lorem Ipsum even shows up on healthcare.gov. According to a story published Aug. 15 in the Daily Mail, more than a dozen apparently dormant healthcare.gov pages carry the dummy text. (Click here if you skipped ahead to this section).
FURTHER TESTING
Things began to get even more interesting when the researchers started adding other words from the Cicero text from which the “lorem ipsum” bit was taken, including: “Neque porro quisquam est qui dolorem ipsum quia dolor sit amet, consectetur, adipisci velit . . .” (“There is no one who loves pain itself, who seeks after it and wants to have it, simply because it is pain …”).
Adding “dolor” and “sit” and “consectetur,” for example, produced even more bizarre results. Translating “consectetur Sit Sit Dolor” from Latin to English produces “Russia May Be Suffering.” “sit sit dolor dolor” translates to “He is a smart consumer.” An example of these sample translations is below:
Latin is often dismissed as a “dead” language, and whether or not that is fair or true it seems pretty clear that there should not be Latin words for “cell phone,” “Internet” and other mainstays of modern life in the 21st Century. However, this incongruity helps to shed light on one possible explanation for such odd translations: Google Translate simply doesn’t have enough Latin texts available to have thoroughly learned the language.
In an introductory video titled Inside Google Translate, Google explains how the translation engine works, the sources of the engine’s intelligence, and its limitations. According to Google, its Translate service works “by analyzing millions and millions of documents that have already been translated by human translators.” The video continues:
“These translated texts come from books, organizations like the United Nations, and Web sites from all around the world. Our computers scan these texts looking for statistically significant patterns. That is to say, patterns between the translation and the original text that are unlikely to occur by chance. Once the computer finds a pattern, you can use this pattern to translate similar texts in the future. When you repeat this process billions of times, you end up with billions of patterns, and one very smart computer program.”
Here’s the rub:
“For some languages, however, we have fewer translated documents available, and therefore fewer patterns that our software has detected. This is why our translation quality will vary by language and language pair.”
Still, this doesn’t quite explain why Google Translate would include so many references specific to China, the Internet, telecommunications, companies, departments and other odd couplings in translating Latin to English.
In any case, we may never know the real explanation. Just before midnight, Aug. 16, Google Translate abruptly stopped translating the word “lorem” into anything but “lorem” from Latin to English. Google Translate still produces amusing and peculiar results when translating Latin to English in general.
A spokesman for Google said the change was made to fix a bug with the Translate algorithm (aligning ‘lorem ipsum’ Latin boilerplate with unrelated English text) rather than a security vulnerability.
Kraeh3n said she’s convinced that the lorem ipsum phenomenon is not an accident or chance occurrence.
“Translate [is] designed to be able to evolve and to learn from crowd-sourced input to reflect adaptations in language use over time,” Kraeh3n said. “Someone out there learned to game that ability and use an obscure piece of text no one in their right mind would ever type in to create totally random alternate meanings that could, potentially, be used to transmit messages covertly.”
Meanwhile, Shoukry says he plans to continue his testing for new language patterns that may be hidden in Google Translate.
“The cleverness of hiding something in plain sight has been around for many years,” he said. “However, this is exceptionally brilliant because these templates are so widely used that people are desensitized to them, and because this text is so widely distributed that no one bothers to question why, how and where it might have come from.”
This is fascinating. Nice work Brian!
This Google Translate “bug” was also used as part of the most recent Defcon badge contest.
See the write up here: http://potatohatsecurity.tumblr.com/post/94565729529/defcon-22-badge-challenge-walkthrough
lorem ip
Lorem ipsum dolor si
Lorem ipsum do
Lorem ipsum dolor s
lorem ipsum ama
Lorem ipsum dolor sit amet
Lorem ipsum dolor sit ame
Lorem ipsum dolor sit
lorem ipsum ips
lorem ipsum lor
lorem ipsum lo
lorem ipsum lorem
lorem ipsum amat
Lorem Ipsum
Translated to:
Internet ip
Let’s see if
We give
Pussycat Dolls
The Free Love
It can be used
Our goal is to ame
Our goal is to
vehicle dimensions
Free of pain
China, elsewhere
Free Internet
China loves
NATO
And you’re right…it doesn’t work anymore.
I feel that this is the most relevant answer, and could very well be the entire reason. Google might have been happy to play along with the Defcon badge game.
Except….Lorem Ipsum showed up in emails leaked from Stratfor.
https://wikileaks.org/gifiles/docs/35/3509974_re-fwd-lorem-test-1-.html
I can still translate Lorem Ipsum to English, but the “main” words like lorem and ipsum lost their meanings.
http://pastebin.com/h8tp3FAx
“Just remember a very important in order to create a nice pollutants. poisoned
floating platform, and it’s a lot of value. When fellowship
homes, and the great gods into labor over the mountains, ridiculous mouse will be born.
No need to say products and services, and the main pain. But it is funny, but the timing
retail price.”
I seem to remember Anonymous trying to come up with a way to game Google Translate in its early days. But that’s been so long ago that I don’t remember any specifics. Something like the Hungarian phrasebook skit on Monty Python’s flying Circus.
+Rep for anyone able to bring the Hungarian-English dictionary skit into the discussion.
At the moment if you translate English to Latin, weird things still occur. For example, NATO in english translates to Lorem Ipsum in Latin.
Yes. It looks like Google blocked Latin to English for several other words. You can still translate them using English to Latin.
This is interesting. If you translate the “Latin” phrase Curabitur enim adipiscing aliquot to English you get Dinner for downturn. Then start deleting characters from the “Latin” phrase and look what you get. This is some sort of phrase book.
Here’s another oddity at the moment. The “Latin” phrase aliquot translated into different English word over time. At first it was word Environment, then the word Certain, then the word Some. So there appears to be some sort of timing element to this.
Still working here in Indonesia as of 18-August at 1408
In my view, the ‘who’ isn’t important. It’s the ‘what.’ Fascinating to think of the amount of effort that went into creating each alt meaning for every single change in letter/capitalization/punctuation/repetition. The creativity and determination behind the whole thing are insane.
I think this is a case of reading too much into an artifact of badly-fed machine learning algorithms. How many Web pages and texts out there have multiple languages available for machine comparison, where one of the languages is an actual Latin translation?
Consider CMSs which have the ability to present content in multiple languages based on the user. Suppose you enable the translations option, but don’t fill in translated versions of specific posts/articles/pages. Some CMSs will put lipsum placeholder text in the alternate language versions of these pages, which present to a crawler bot as alternatives of the same content, subject to the learning algorithms.
And it isn’t confined to just off-the-shelf CMSs. It’s *not* uncommon to see lipsum text show up on websites with multiple language options, when the person running the website has filled in page templates but never gone back to fill in the correct translations.
“But the lipsum placeholder isn’t marked as Latin text!” — well, that’s a red herring. In the majority case, because of lack of cultural focus on multilingualism, pages are not properly marked with Content-Language or other correct ISO language code hints. Google already performs content language guessing based on the fact that these hints are usually not present, and thus it’s likely to guess Latin for lipsum variants of content. Voila, you have the translation engine being fed “Latin translations” of content that is actually lipsum.
And why so much China? I know that I’ve seen lipsum placeholders all over otherwise Chinese-language websites. So all this is not surprising to me whatsoever (especially when you also get common results like “Internet marketing”).
Consult someone with a background in machine learning algos, and that person should be able to explain how the data learning I described above would lead to the artifacts you found quite easily. Occam’s razor definitely applies here.
Todd, if it’s so obvious, why has it taken you so many paragraphs to explain, and yet at the conclusion of your oh-so-insightful comment you haven’t really explained anything other than to recommend talking to an expert about it?
I gave background information, so that it’s possible for those who don’t know about the innards of Google Translate to understand how it learns to translate arbitrary text.
Good explaination Todd, thanks.
(I should point out that a researcher from FireEye should already know all of the above. I worked for one of their direct compeitiors, and that whole market niche is full of machine-learning systems.)
@Edward – “So there appears to be some sort of timing element to this”- There must be some in languages poorly sourced for translation. Because – “A document is translated according to the probability distribution that a string in the target language (for example, English) is the translation of a string in the source language (for example, French).” (https://en.wikipedia.org/wiki/Statistical_machine_translation)
There are no Latin translation available from Bing and Yandex. Why is that? Latin is not complex, but what is more important – this is a “dead” language. In the sense that it is not developing with time.
Otherwise, consider this a bug, temporary placeholder, “Mercury rising” game, a plot, a joke. It even may be an idea behind new Brown’s book.
The Google Translate Latin-to-English translation of “Lorem ipsum” is now “lorem ipsum”; but interestingly Google Translate English-to-Latin of “internet” still results in “Lorem ipsum”.
Re: “it seems pretty clear that there should not be Latin words for ‘cell phone,’ ‘Internet’ and other mainstays of modern life in the 21st Century.”
I heard that Pope Paul VI set up the Latinitas Foundation, whose mission it is to promote the modern use and study of the Latin language. If I’m informed correctly, this foundation keeps a lexicon of Latin neologisms.
I have no idea how active this foundation is today and whether or not their Lexicon Recentis Latinitatis now has entries for “cell phone” and “Internet”, but it is not off-hand “pretty clear” to me that it does not.
A number of years ago there was a bit of attention given to the Lexicon Recentis Latinitatis just because they had introduced a latin term for ‘UFO’: Res Inopinata Volans.
It seems quite likely that more down-to-earth manifestations of modern society are also represented in the work.
Although this organization appears as an order and as a ..nobel one, it is also an intelligence network with close diplomatic relations to UN and countries.
As they are closely related to the Vatican, which official language is Latin, could it be that the Google scanner happen to come across papers from this organization which could explain the “diplomatic keywords/topics” in the translations?
https://en.wikipedia.org/wiki/Sovereign_Military_Order_of_Malta
(note: the intelligence part of the organization is not covered in the wiki article though, but be careful with all the noise out there about this topic in specific..).
https://en.wikipedia.org/wiki/List_of_Permanent_Observers_of_the_Sovereign_Military_Order_of_Malta_to_the_United_Nations
https://en.wikipedia.org/wiki/Foreign_relations_of_the_Sovereign_Military_Order_of_Malta
Just throwing out an idea…
First you say “and another researcher who wished to be identified only as “Kraeh3n.””, but you through your text refer to Kraeh3n as “she”. If this researcher only wished to be identified in that way, how was the gender divulged?
@David,
The ‘she’ is just a qualifier. it does not necessarily mean Kraeh3n is a she. Just a word to use in place of saying Kraeh3n over and over. Kraeh3n could be a he or a she; we will never know.
I am wondering if Google somehow didn’t have a “correct response” from Translate, that it quickly found words on webpages that were in close proximity to the words searched.
Has anyone tried simply typing the words into Google and seeing if those words are found in a huge mass? Are most of the finding centered on chinese sponsored websites trying to use that phrase – since it is potentially quite common on the internet?
I am thinking it may have been / could still be a crude way of using SEO here – its a word that can used with very little competition, and has a lot of commonly used letters.
The Only Latin I know is;
Ipso po Facto my father beat your father at dominoes.
Sounds more like something from the Illuminati or the Free Masons
Sounds like pending warnings from the Illuminati or the Free Masons about the pending demise of the US economy due to the Chinese government
It could also be an attempt at stock market manipulation from individuals with large investments in Aluminum manufacturers.
Those hats don’t make themselves.
my best guess, based on behaviors I have triggered inadvertantly, documents (text and images) either stored deliberately or otherwise slurped up accidentally into google drive were automagically added to the google uberbrains. If I were to conjecture further, Google is now working to undo some of the damage. At this point are we observing a convergence of algorithmic data gathering combined with algorithmic hive mind and its consequence of a whole new level machine intelligence ala Skynet? I think all bets are off. Google had a similar issue in late oughts when faux web pages and SEO pretty much turned google search into muddy fields; info crawlers mining faux infos for their faux pages and whatnot. I think this is a bigger problem however.
lam hu enim dolor enim adi Secrre, Mr. Krebs
dolor mis sel abitur edica
Amet enisa dolosum Ipsum a secrre
enima dolor ipsum…
Enisae lam dolor adime dolor ab ipsum dolor ipsum
aquila dolor Enisa Adime dolor ab dolor ipsum ipsum
dolor rabitur enim adipisci aquiem…Curabitur enim aduot ipsu
mone abique ipsum a nebu dol malu opse abitur
Dolosum exercitationem cing losum
totusquem dolor ipsum abi abitem
Da aDolor ad Ipsum ipsum edictir ad enim uot adi missol ipsum amet
Contra quos omnis dicendum breviter existimo. Quamquam philosophiae quidem vituperatoribus satis responsum est eo libro, quo a nobis philosophia defensa et collaudata est, cum esset accusata et vituperata ab Hortensio.
My previous response is gibberish, you actually said something. Google translate comes up with this.
“Apply now for the Department for Home Secretary, Mr. Krebs
Department chair will be withdrawn from dire Beuca Currently many proofs of deceit by the Secretary unfortunately for him … Now, the company has endeavored to take away the pain from Network Eagle team had struggled to take away the pain from the actual data Department seeks delay for water … I’d advocate for the said rectors of the company will be withdrawn from the cloud pain praying SECURITY Deceitful practice encompasses ever dangerous Go, get the whole Network
Give Him the COURT ORDER to go to the heavens, you alone are the key
imber ipsa dol…ipsum dolor a dolo
Buo ipsum Cepao Ipsum Dolor Ipsum a dolor
Translate recognized your phrase as Portuguese. In Latin get
rain the pain … the pain of deceit
Protected by the Onion Network from China
—–
Well, I’m not really trying to hide anything. Guess they can find out I visit this site if they wanted to. Assuming you are referencing Tor or “The Onion Router”. But who knows, it is a mystery to me.
Alternative theory: From my misspent career as a guerrilla warfare ad man (and my dalliance as a novelist) another possibility is that Google is having some fun with the tin-foil helmet crowd.
Sounds to me like the secret is inside Goggle and I suspect some programmer having fun or bored.
Well there is something people could do as a prank which used to be done in the past, if you have multiple suggestions on a certain word than Google Translate will add that into it’s suggestions.
I guess this is just a result of the same prank, I doubt Google used some high algo to get from lorem ipsum to China 🙂
It still works by translating from English to Latin. I found a bunch by running a list of NSA keywords through it:
internet
NATO
America
China
nuke
North Korea
cellphone
IRS
football
NAIA
SSL
Perl
Archives
site
Corporate Security
Security Consulting
Security Evaluation
Electronic Surveillance
Event Security
contacts
e-cash
market
credit card
package
Results here: https://i.imgur.com/UGMIPpE.png
I also tested with a conventional English word list for comparison. Here are those Lorem hits: http://pastebin.com/yh26U7iz
Sorry just have to!
https://www.youtube.com/watch?v=XbI-fDzUJXI
HA! Good one! +1
More than likely, Google Translate is capturing multilingual sites that have used placeholder text on pages that have yet to be translated. So, for example, an English site, with most of its pages translated to French, but a few still have placeholder text and produce these wonky translation rules. Google is detecting the Lorem Ipsum as Latin, and captures what it thinks the translations are, by comparing with the corresponding English page. I’m sure Google’s translation engine is just as confused, with zero patterns, hence the constantly evolving Google translate results.
Did you see>>>>>>
Except….Lorem Ipsum showed up in emails leaked from Stratfor.
https://wikileaks.org/gifiles/docs/35/3509974_re-fwd-lorem-test-1-.html
Unless I’m misunderstanding how the Google Translate algorithms work, Google translate is only capturing text that has translations. So, where would Google Translate get the translations from the Stratfor emails that ultimately become language rules? Google translate is going to ignore any texts that don’t have correlating translations. It can’t learn translation rules without two texts. Namely, the source text and the translated text. Without which, it can not build the rules it needs.
@Mark
What about the translate toolkit? You can upload your own translation and it does not have to be public.
enim Curabitur aliquot adipiscing enim …
In order to use Lorem Ipsum effectively, one would have to convert the lexicon into a plaintext dictionary and devise the mechanism of what the ciphertext key would be. Almost certainly this would be a permutational block cipher with the key based around the capitalization and arrangement of words. This means the dictionary’s scale could be exponentially increased, unless the dictionary was only based around a subset of latin words and subsequent ordering and not the full Lorem Ipsum lexicon.
If one were using this to create a lexicon for plaintext/ciphertext, it wouldn’t be the best to use Google Translate as it changes regularly based upon the machine learning when ingesting huge amounts of data as points of translation. If what is being described is true here, then gaming of Google Translate would have to occur on a massively large scale – or – there would have to be tweaking done by a Google insider. Both seem far-fetched to me, but then again far-fetched things have actually happened in recent years.
“If what is being described is true here, then gaming of Google Translate would have to occur on a massively large scale – or – there would have to be tweaking done by a Google insider.”
And yet, in response, a Google insider has really quickly been able to make the results (Latin to English) significantly change.
Not that far fetched, indeed.
All youse guys and gals, including Brian, need to find real work!
jucundus iucundus