Protein Folding: Mid-Contest Analysis

Posted by Matthew Simoneau,

Mid-Contest Analysis

Bowen Kerins,
Friday, November 8, 3:00 pm

Below is an analysis of some of the over 500 entries submitted to the MATLAB Protein Folding Contest so far.


Great names!

Typical of these contests are some fantastic and bizarre entry names. With over 500 entries, there are a lot to choose from, like:

  • Pat ‘n’ Rob’s entry, “Super faster master blaster”. According to the entry, it’s got go faster stripes, and wider wheel arches.
  • Philip Top’s entry, “Zigging Zig-Zag”. It has “Less Zigs with something Added.” Take off every zig, perhaps?
  • Per Rutquist’s entries, “More Dum” and “Ow”. Per has taken the lead three times by removing obsolete lines from others’ code.
  • Colin Ross’s “Quirky” series of entries, which includes names like “Who Quirks, Wins”, “A Quirk Too Far?”, “Quirk with a Twist of Lemon”, and the ever-popular “Vive la Quirk!”

Our Test Case

Throughout this analysis we will refer to a test case we’ve created here. This test case is not in either the sample test suite or the actual suite, but can give some insight into what entries are doing.

testa = [zeros(2,1);ones(3,1);zeros(4,1);...

This test case can be folded very nicely:


This is the best possible score for this sequence, although there is more than one way to arrange it. We’ll see how the entries perform with this small example. The chains in the test and contest suites have as many as 100 amino acids in a particular protein sequence.

Early entries: Spirals and Cheating

We’ll be focusing on many of the entries considered “corner points”, entries that significantly improved the overall score. The first was Stijn Helsen’s “turn around”, which took a spiral approach:


The entry ignores the colors, just trying to stuff everything into as tight a package as possible. It took first place easily at the time.

An even smaller package was achieved, briefly, by E. Brian Welch, who noticed that our contest entry checker did not verify that all turns were taken at right angles. His entry does not produce a very interesting plot, but sure scores very well:


This isn’t the first time E. Brian Welch has found a way to “cheat” the contest. We plugged the hole quickly.
Brian took the lead again, this time legitimately, by submitting an entry that added a “back-n-forth” test. This marked the beginnings of what we see in almost every entry currently: try several different methods, score each, and take the one with the best score. It also marked the first
entry we can find that tests to see if the “reverse” sequence would score better than the given one.

Brian was able to improve the score for our test case:


Brian’s entry still completely ignored the colors of the chain, just placing them in a square back-and-forth arrangement.

More methods, more entries, even more lines…

As the contest continued, many more methods emerged, and if a method made the score drop it was quickly (and I mean quickly!) absorbed by virtually all new entries from that point forward. Eleven hours after Brian Welch’s entry, which checked in at 44 lines, Cory Sharp submitted an
entry of 429 lines that significantly beat all other entries to that point:


Cory’s entry included fourteen different methods for particular situations, including two new ones he called “Snake Disc” and “RLE Snake”. Each of these improved the overall score, and makes significant headway on our test case. It’s clear that entries were now working on each particular sequence, trying to identify which amino acids to bring together.

By 3 pm on Thursday, the number of methods and the number of lines had increased even further, as Stijn Helsen retook the lead by resubmitting Cory’s “RLE Snake” algorithm along with some significant improvements to one of the earlier methods called “zigzag3”:


Stijn’s entry took the lead, but also broke the record for length at a seemingly overwhelming 657 lines.

On the exact opposite end of the spectrum was Kevin McGill, who created a new algorithm he called “just pleats”. It was very fast, and while it did not take the lead, it came remarkably close while taking less than 1/100 the CPU time of the leaders:


On our test case, this is the best yet; no algorithm has done better.

Later, this code was adapted by Carlos Colon in his “quickfix” series. Carlos incorportated Kevin’s algorithm with some of his own to form “quickfixx11”. Incorporating “quickfix11” with the other methods led to “quickfix19”, an entry that took the lead handily, and promptly led many others to make new entries based on it. “quickfix19” checks in at a gargantuan 994 lines of code, which may have led Nicke to submit the entry “quickfix19_4 (on a diet)”.

We did a little more investigation of this phenomenon. Read A Contest Story for more information.

One other remarkable note is that no entry has yet been able to produce the best possible fold for our sample sequence. We’re pretty curious to find out what will happen next.

The details: What methods?

One thing we noticed about the entries being submitted is that many different methods are tried for each test case. Some are very effective, some not so effective, some fast, some slow. We decided to take a closer look at one “corner point” entry and find out which method is being used
to return the final answer for all 106 entries in the sample test suite.

The entry used is Colin Ross’s “Never Too Quirky” (entry #8402). According to Colin, the significant gain was made when he noticed that one of the methods was not being scored in reverse.

We edited Colin’s file to output a string indicating the method used for returning a solution. Looking through the 710 lines of this file, we identified the 17 different methods it contains. We identified each by a string based on the code we saw, and tried to identify the original
contributor. Our apologies if we made any mistakes in attribution:

  • ‘small’ — sequence with 2 or fewer acids
  • ‘two green’ — sequence with exactly 2 greens
  • ‘dum’ — hardcoded solution by Justine Tester
  • ‘zigzag1’ — by Stijn Helsen
  • ‘zigzag2’ — by Claus Still
  • ‘zigzag3’ — by Stijn Helsen
  • ‘jgh’ — by Philip Top
  • ‘spiral diamond’ — by Philip Top
  • ‘zigzag diamond’ — by Philip Top
  • ‘new ebw’ — by Brian Welch
  • ‘interleave spiral’ — by D. Holmes
  • ‘snake disc’ — by Cory Sharp
  • ‘rle snake’ — by Cory Sharp
  • ‘snake square’ — by Jack Snoeyink
  • ‘msolver1’ — by Carlos Colon
  • ‘msolver2’ — by Carlos Colon
  • ‘msolver3’ — by Kevin McGill

We ran Colin’s solver and set up a bar graph for the number of times each method was used (creating “seemethods.m”, a modified version of “runcontest.m”).

figure methodset = {'small','two green','dum','zigzag1','zigzag2','zigzag3','jgh',...
'spiral diamond','zigzag diamond','new ebw','interleave spiral','snake disc',...
'rle snake','snake square','msolver1','msolver2','msolver3'};
methodlist = seemethods;
[dummy,whichmethod] = ismember(methodlist,methodset);
for ix = 1:length(methodset)
  methodcount(ix) = sum(whichmethod == ix);
barh(methodcount)                        % Horizontal bar graph
set(gca, 'YTick', 1:length(methodcount));
set(gca, 'YTickLabel', methodset);       % Put method names on y-axis
set(gca,'Ydir','reverse')                % ... from top to bottom.
set(gca,'TickLength',[0 0])              % Get rid of tick marks
set(gcf,'Position',[20 260 780 420],'Color','w')

This bar graph shows that Kevin’s and Stijn’s contribute the most to the scores, but a lot of different methods are paying dividends. Curiously, some of the seventeen methods we identified were not used at all, and others were only used once! Again, this speaks to the type of work done by Per Rutquist and others, removing code that has been pushed aside by other work.

As the contest continues and the solutions continue to evolve, we’ll keep an eye out. If you notice something interesting happening, please post on the contest thread and tell everyone.

Thanks for all your effort and skill in what has quickly become a mishmash of modules!

See you in the queue,
The MATLAB Contest Team


Comments are closed.