w-shingling finds similarities in 6 Legend of Zelda Game Boy games
I am playing Dr. Mario in my living room, frantically destroying viruses. The "Win!" message appears and I press start to begin level 20. The board fills up with viruses, and I wait until it's done to start. And I wait. I wait...what is going on here? Looks like the game has soft locked!
There is only thing to do - reverse engineer the game board initialization routine and figure out why this has happened.
Fast forward a few nights of fun and the reason has become clear. #gbdev is the first to know about the reason behind it. Someone brings up a good point: is this fixed in a later revision? It hadn't even occurred to me there was a second revision of Dr. Mario, but there is! My interest in finding similarity between two files begins here.
The pain of finding similarities in two files by hand
Like anything, we need a base case to work from. How do we know our similarity checker is actually working? Before I had the idea of writing any software, I naively went ahead and opened the two files in a binary diff program. Of course if anything is offset by even one byte, the program will report a massive block of different bytes. So in order for me to check the whole file, which lucky me is only 32kibibytes (not a typo), I align the data by removing and adding bytes until there are no more differences. Using this method I was able to find every single piece of data that development team had changed between revisions, and define what they were.
Unfortunately this process took me two nights.
Around the same time, the developer of Aevilia, ISSOtm, began to disassemble the game in order to compile it with RGBDS, like what the people behind pokered did, because why not. This is when a great idea hit us: why not make a dual Dr.Mario/Tetris game, like what Nintendo released for the SNES? In order to do this effectively, the games would have to shrink to 16kibibytes each. The best way to do that? Remove the similarities between games.
![]() |
| The SNES intro screen. I think a Game Boy version would've been fun! |
At this point it had clicked that I wanted a tool to find similarities between Game Boy games.
Sif, Manber fingerprinting, W-shingling? What?
Before writing software I ask: does it exist, and does it meet my requirements? A quick search and I couldn't find anything that resembled what I was looking for. There were fortunately a few academic papers describing finding similarities between files, in particular, one by Udi Manber from 1994. This paper details how to create fingerprints, how to select fingerprints and generally how the whole system of comparing things should work. It's a very cool read. The paper describes a prototype program called sif. Out of curiosity, I emailed Udi to see if he still had the source, in hopes of using the program. To my surprise he replied that the source was not, and probably never will be, public-ready. He did point me to another group that had figured out the same method of comparing two documents, and that they were calling it w-shingling. The issue with their system though, is that it operated at the text level. I needed something that was data agnostic.So I implemented sif in Rust, and called it rif.
Where's my Zelda...
Armed with rif and my manual Dr. Mario comparisons, I tested it out a few times and it seems to work as expected. The difference is about 3% between the two revisions. Sweet!For a "stress" test, someone recommended I try it with the assortment of Zelda versions and revisions. There are about 6 versions in total, so it would be interesting to see how the differ between release times and revisions.
$ ./rif gb/Legend\ of\ Zelda\,\ The\ -\ Link\'s\ Awakening\ \(USA\,\ Europe\).gb gb/Legend\ of\
Zelda\,\ The\ -\ Link\'s\ Awakening\ \(*
73.99991% : gb/Legend of Zelda, The - Link's Awakening (Canada).gb
62.69528% : gb/Legend of Zelda, The - Link's Awakening (France).gb
94.14732% : gb/Legend of Zelda, The - Link's Awakening (Germany).gb
99.99774% : gb/Legend of Zelda, The - Link's Awakening (USA, Europe) (Rev A).gb
100% : gb/Legend of Zelda, The - Link's Awakening (USA, Europe) (Rev B).gb
We can make a few assumptions from this:
- France's version is most likely the last version in the 1st release phase (December 1993, which turns out to be true!). German was second and USA/EU was first. This (well, hopefully!) means France should have the most stable version of the game.
- Because a difference of 38% is so high, we should look to see if they did any major or minor changes to code to the France version.
- Since the chronology matches that of the differences between the games, we can analyze the changes they did over time with confidence.
- Rev A and B for USA/EU do not seem to make significant changes.
We need to be careful though. Some cartridge dumpers use random values to fill in empty spaces, causing a lot of random noise. This is one limitation of rif. For example, if wrote a document, created a copy, and replaced the copy's spaces with letters, rif would report that the two documents are not similar. A recommended fix is to remove all padding and whitespace when doing comparisons. To me this is not an ideal solution - we should not have to prepare documents before-hand.
Another issue is - at least for game comparison - if game A is smaller than game B, then there is a good chance that all of game A's fingerprints will be found in game B's fingerprints. The result will be something like 99% or 100%. One way to correct this is to pad game A in order to generate more fingerprints. It could be seen as a feature though, in case someone is curious how much of A can be reconstructed in B.
Here is an example, using Dr. Mario and Tetris 2:
$ ./rif gb/Dr.\ Mario\ \(World\).gb gb/Tetris\ 2\ \(USA\).gb
100% : gb/Dr. Mario (World).gb
99.429756% : gb/Tetris 2 (USA).gb
At first glance I would think "Wow! Tetris 2 is just like Dr. Mario!!", but the reality is, Tetris 2 contains 99% of the fingerprints Dr. Mario has.
It's because of results like these too, that the size of the "window" used to create the fingerprints must be adjusted, as well as the amount of fingerprints selected. For small 32kibibyte Game Boy games, I am using a 20 byte window and 6 bits of random selection bits from the hash, which gives me about 5000 fingerprints per 32kibibyte game.
The nice thing about this algorithm is you can calculate these fingerprints offline for later use too. In particular it can be used to check for plagiarism between documents or even be used for network data caching.
The next step would be to do a cross comparison of all Game Boy games in existence, and see what insight we can gain from it.
A person could also build an effective personal search engine. Pass it a document and it will return similar files with "pointers" into them if the user wanted to dig deep to see exactly where they are similar.
rif source is not yet available. As I keep using it, I find things that could be done better or more generally. You can find the fingerprinting package for Rust. Be warned, it is not optimized in the slightest! It is however a fairly clean and flexible implementation.


One technique for dealing with size differences between the files is known as "min-hashing". Instead of comparing all the hashes, you compare a "randomly" selected small percentage by comparing only those hashes with minimum hash value. See https://github.com/BartMassey/simhash for an old C implementation of this idea as instantiated by Menassas's "shingleprinting" algorithm, with references and stuff: you should be able to find it in your random Linux distro. Please feel free (obviously) to steal code and/or ideas for your Rust implementation.
ReplyDelete