Write it three times
(1370 words)I have a theory that when writing software, you will do the best work only after you have written it for the third time. My hypothesis is that the first time you write it, you don’t really know what the problem is and therefore are just stumbling in the dark trying to find your way. The second time you now understand the problem domain but you haven’t worked out the best way to implement it yet. And then the third time you not only know the problem domain but understand the implementation constraints and therefore can create some of your best work.
To illustrate this, let’s look back at a problem I’ve solved three times. This problem is about scheduling people at an UnPlugFest. The Bluetooth SIG holds testing events three or so times a year. At these events a few hundred people all gather in a large ballroom, typically at a swanky hotel someplace on the planet, and then test against each other. When I first attended these events, as an engineer, it was frustrating that our schedules for the day were only available on the morning of that day. We couldn’t plan ahead, we couldn’t ask for a specific test partner, either because we knew there was a problem that needed to be tested or because we’d already tested with them, found a problem, and have fixed it and wanted to do a quick retest.
The reason that these schedules were so late at being published is that they were being done manually on a spreadsheet. Yes, somebody would go back to their room at the end of the day and manually create a new schedule for the next day. This was not only inefficient, but this was a database problem not a spreadsheet problem.
To resolve this, I offered to implement a scheduling system for a future event. I needed actual data to run tests against, and I needed to work out how to implement this time. After all, it is just “speed dating” for engineers, it can’t be that hard! Actually, after three times, it wasn’t hard at all.
I broke out the text editor, spun up apache, PHP and mysql. These weren’t great choices, but it was what I could get running in a few hours. The PHP code I wrote was terrible, but it worked. Not only did it work, but we could create a new schedule in about 10 minutes, rather than 5 hours. An improvement certainly, but not each shattering. Another little improvement was that we could also publish this schedule to participants directly to their computer. This did require a local area network to also be installed at each event, but that wasn’t too much work.
But, I was not happy with this implementation. It was slow, clunky, and often stopped working for no apparent reason. Ok, there was a very good reason and it mostly involved myself having written some very bad code. This became a meme after a colleague of mine got shirts created for the next event which the phrase “Has Robin fixed the scheduler?”. But remember the hypothesis, first time is just there to prove it works. Let’s just call it a prototype.
Between events I started again from scratch on the second attempt. This time I changed a few things. Gone was apache and PHP, replaced with Python, but I kept mysql. This time it was a much better system. We even created a live scrolling schedule displayed up on a big screen at one end of the ballroom with the current test matches and the table they were scheduled to. This version did a much better job at optimising people’s schedules. Instead of just getting things working, it also scored test matches based on how much these two sets of engineers could test, and then assigns those matches to tables based on how little people have to move. If one platform has just moved table, then it tried really hard to stop them moving the next time things changed.
Of course, people being people, wanted some improvements. A few platforms wanted fixed tables, so they never had to move. Others wanted to run interoperability tests, along side the UnPlugFests, at the same venues. Interoperability testing was testing done between different implementations of new upcoming specifications. This is one thing that the Bluetooth SIG does that really improves the quality of the specifications when they are published. These all added complexity to the system, and even though the system ran better than before, it was still a little clunky to setup, and difficult to use. But I now knew the problem domain pretty well and how to do the implementation. Time to rewrite it for the final time, and make that shirt history.
I again sat down with a completely clean text editor and started again. This time it was python and sqlite. I did consider using a compiled language like C, but I also wanted this system to be run my a non-engineer at events. I wasn’t going to be babysitting the system at each event for the rest of my life, and this final rewrite was a way to achieve this. Although I must admit the air-miles were great and the room upgrades at the hotels were lovely.
This time, I started by rewriting the scheduling algorithm itself. I’d gotten it down from 20 minutes in PHP to about 30 seconds in python. This time I wanted it to be a lot quicker. Instead of doing complex queries in an external database process, I just held everything in memory and only used the sqlite database to save state to allow re-loading of that state when the process restarts. This took the rescheduling time from 30 seconds to just a couple of seconds. Yes, I could’ve got this to probably a few hundred milliseconds but that wasn’t a requirement and would just be considered technical masturbation.
Next, I implemented the public interface. Instead of making everything wait on the database and using multiple threads, I went for a simple single threaded model with each query just gets the state from memory. This was really quick. It helped that the web pages created were just plain html with no javascript. It worked, and reduced the number of dependencies. It also meant that any web browser could run the system, including text based browsers like w3m that some of the engineers would run.
At the first event this was run at, there was no downtime. It ran perfectly. Most people didn’t notice anything different, as the user interface had mostly stayed the same. However, on the second day, somebody suggested a new feature. You know those questions that start, “You know what would be really great is if we could do…”. That new feature was implemented by the end of the day. The code base was good. It worked. It did the job required. It was easily extensible. And it was now time to hand it over.
The Bluetooth SIG’s staff now run the scheduler. I spent a week at their
headquarters doing the handover. I think they were surprised how simple
and elegant the code was. They were also surprised that I didn’t use any
frameworks. Last time I saw the system, it looked mostly identical to
what I had created. Obviously, they removed the if username == 'robin': is_admin = true
line of code, but added a couple of their own special
users.
In summary, it took three attempts to create this system. The first time was really just a prototype. It just about worked, but also sometimes didn’t. The second time, the system worked well, but wasn’t implemented very efficiently. The third time, the system worked great, and has an elegant implementation.
Great courage is required to turn around to your boss and say that the code you’ve been working on for the last few days, week, or even months, is crap and needs to be rewritten. Doing that twice is brave. But at the end, you will have a much better system. Just make the first iteration very quick and dirty and hacky. Learn stuff. Break it. Fix it. Then rewrite it, again.