At Honeycommb, we have to deliver hundreds of applications to both our developer account and our customer's accounts. When we first started, we would manually deploy off our laptops and we could deploy up to 10 apps on a weekly basis without it being a massive burden. As we ramped up, we found ourselves spending too much time on deploying and not enough time on building.
Version 1.0
Our first version of scaling up was dependent on fastlane and travis-ci. When we deployed, we would queue 1 job per app (each customer would require 2 jobs because we had to deploy 1 for iOS and 1 for Android).
This would serve us well for a few years but we eventually ran into an issue of building concurrently. No app was dependent on each other so our 4 workers were maxed out and the time to deploy all apps was reaching close to 24 hours.
As much as we prefer to never ship any bugs, there were times where we needed quick turnarounds in order to fix high priority items in production. Waiting 24 hours for a deploy was unacceptable in the case of an emergency.
Another issue was auto-submitting iOS apps required the workers to wait until the build was processed. During this time, the worker sits idle until Apple had finished processing the build.
Version 2.0
Taking what we learned from 1.0, we made some changes to make things faster and more cost efficient.
At the time, a new company Buildkite was starting up and it had interesting characteristics that we wanted to explore.
- Buildkite allowed you to run on your own infrastructure meaning you could specify any hardware you wanted. Travis only let us select from prebuilt things they supplied.
- Buildkite agents could be installed anywhere. We had macbooks in the office that were no longer being used by developers since they were older, but they seemed to be prime candidates for CI machines.
For Android builds, we chose to move our workload to Amazon using https://github.com/buildkite/elastic-ci-stack-for-aws. We would provision several spot instances (cause they're cheaper) to scale up for the Android app deploy. We could deploy all the Android apps in a little over an hour rather than the 24 hours it used to take.
For iOS builds, we moved the workload onto our leftover macbooks. The other change we made was to have a single machine dedicated to uploading to Apple. The general workflow was as follows:
- Workers 1-n are responsible for building IPAs and saving them as artifacts.
- A dedicated upload worker takes the IPA and delivers it to Apple then waits for Apple to finish processing so we can auto submit the application.
We had looked to using AWS mac instances but it was so much cheaper for us to run on the Macbooks we had already purchased.
At the time of writing, we could get through hundreds of apps in a little over 3 hours instead of the 24 previously.
Future
Scaling our new infrastructure is much simpler now since it's just a horizontal scaling problem. We can just add more machines to improve our delivery time.