MATLAB Community

MATLAB, community & more

Text Alignment, Dinosaurs, and Orators

When we’re creating new pages for our community apps, our designers sometimes add some sample text as part of the review process. More often than not, this means Lorem Ipsum.

Like me, you’ve probably wondered about the standard filler text that begins

Lorem ipsum dolor sit amet, consectetuer adipiscing elit, sed diam nonummy nibh euismod tincidunt ut laoreet dolore magna aliquam erat volutpat

You might have even looked it up and learned that it is some sort of corrupted version of an essay by the Roman orator Cicero. I’m not much of a Latin scholar, but I remember my first clue that something was amiss was the very non-Latin word adipiscing. Nonummy isn’t much better.

Clearly some text got deranged along the way. But how did these strange letters come down to us across the ages? The original text, from the essay De Finibus Bonorum et Malorum (On the Extremes of Good and Evil), goes like this:

Neque porro quisquam est, qui dolorem ipsum, quia dolor sit, amet, consectetur, adipisci velit, sed quia non numquam eius modi tempora incidunt, ut labore et dolore magnam aliquam quaerat voluptatem.

We may translate this roughly as “Nobody likes pain, but hey, sometimes it’s worth it.” You can see scraps of our current version hiding in there.

The theory is that some typesetter in the ancient past needed to do what we all need to do from time to time: make some random filler text. We assume he started with purest Cicero. Then he passed the idea on to his apprentice, and so on down the ages. But each generation of typesetters introduced some errors. Let’s call them mutations. Suddenly this is looking like an exercise in genetics!

Having said all that, we can now have some fun with MATLAB. Specifically, the Bioinformatics Toolbox, which has functions that are designed to analyze sequence mutations. The people who wrote the toolbox were thinking more about protein and nucleotide sequences than Roman orators, but no matter! Code is code, and we can still use the algorithm.

I am going to use genetic alignment software to calculate how long ago our primal printer first set Cicero to type.

What follows is an alignment sequence provided by nwalign. It uses the Needleman-Wunsch global alignment algorithm. The algorithm expects to align amino acid sequences in proteins, but fortunately I only need to align as many letters as there are amino acids. So with some minor transformations, we can run this.

[score,alignment] = nwalign(seq1,seq2,'ScoringMatrix',sm)

The scoring matrix is just used to determine how much of a penalty to apply for substitutions and deletions. I’m treating them all the same. Here’s what we get back. The original text is on the top in gray. The final Lorem Ipsum text is below in blue.

The vertical “bars” between the two lines show direct matches. A long string of deletions is obvious right at the beginning of the text. The dashes in the text indicate errors of one kind or another. So, starting from an original message length of 191 characters, we see that 55 mutations have been introduced (changes, insertions, or deletions). Some regions (such as the phrase “dolor sit amet”) are, as the geneticists say, highly conserved. Perhaps deletions in these regions are fatal to the organism.

Let’s assume a mutation rate of 2.5 x 10-8 letters per letter per generation. This is typical for human genes. To get to a total mutation count of 55 out of 191 characters, we would need 11.5 million generations of typesetters. If we assume 20 years for an average typesetter generation, that puts our primal typesetter 230 million years ago, or squarely in the Triassic time period. That’s too early for famous dinosaurs like Tyrannosaurus rex, but shown below you can see a rare photo of Cicero hanging out with his pal, Coelophysis bauri.

It’s less fun, but we can slice this problem the other way around. What would the mutation rate have to be if the first printer appeared around the same time as Gutenberg, say 1450? There’s only time for 28 generations between now and the beginning of printing in Europe. That implies a rate of 0.0102 letters per letter per generation, almost two letters per generation. A species with that level of mutation might have some serious viability problems.

But quasi-Latin nonsense text isn’t DNA, so no harm done. Now get back to work.

|
  • print

Comments

To leave a comment, please click here to sign in to your MathWorks Account or create a new one.