The unreasonable effectiveness of algorithms in boosting team happiness

Distributing shifts fairly and efficiently between a team of people has been a long standing challenge. With the rise of remote working and the distribution of teams across different time zones, this challenge has become exponentially more difficult. It’s an awful problem to solve by hand, but can be very interesting and rewarding when you do it with advanced constraint programming algorithms!

At balena, we realize that the difference between a bad schedule and a great one could mean the difference between someone working late when they didn’t necessarily have to versus spending extra time with their family. If both outcomes can satisfy the scheduling constraints, why not try hard to find the schedule that results in the best outcome for the team? It’s for this reason we devote so much time and energy to scheduling, and ultimately to maximize the happiness of our team.

I’m Alida, a team data engineer at balena, and in this post I’m taking you on a deep dive into our fully automated process that resolves all the constraints and ultimately schedules our support team.

The challenge

At balena, we practise support-driven development; you can read more about this philosophy at the end of this blog post. This means that we don’t outsource our customer support; it’s handled by my fellow engineers, who work from a wide variety of time zones, and across flexible working hours. This enables us to offer customer support for 16 hours every weekday, from 8 am to midnight London time (UTC+1 during daylight saving time, UTC otherwise), every week of the year. Two support agents are on duty simultaneously during these hours, to make sure all customer queries are addressed as swiftly as possible, and to serve as each other's backup. This amounts to \(16\times5\times2=160\) support hours to be covered per week. We also provide support at weekends but this is a simpler, voluntary setup not currently covered by the automated system.

This presents a substantial scheduling challenge: how do we distribute the support load fairly amongst our engineers, whilst still allowing focus on their core within in the company?

In short, we achieve this via the use of constraint programming with the following goal: maximizing support scheduling fairness and efficiency, while minimizing pain.

We combine the personal preferences of each agent, along with soft constraints (e.g. don’t ask the team to work out of preferred hours), and hard constraints (e.g. all shifts must be covered), and feed them into our core scheduling algorithm in order to produce a schedule that satisfies all requirements in the most considerate way.

Where do we start?

Scheduling is exactly the type of problem that a machine solves better than a human.

To understand why, we first need to appreciate the scale of that problem. Let us consider a simplified situation where we divide the 16 hours into exactly 4 four-hour shifts, yielding 8 shifts per day when multiplying by the 2 parallel tracks. This yields a total of 40 shifts per week. Our growing number of support agents is currently around 45, including the new engineers onboarding this month. The 40 shifts can be filled by 45 agents in \(45^{40}\sim 1.3 \times 10^{66}\) ways! This is reduced by a factor of 2 when removing the symmetry of simultaneous shifts. We actually started out with a constant shift length of 4 hours, but now the shift lengths can vary from 2 to 8 hours (a lunch break is allowed!) based on personal preference. Such freedom in shift length increases the number of possibilities again.

There are a staggering number of schedules that could be created, but due to agents' availability limits, not all of these would be viable. Finding by hand even a single solution that satisfies all the constraints is already very difficult. Not to mention tedious, repetitive, and prone to errors.

If this task were to be carried out manually, and assuming that person survives the boredom, there will almost certainly be some degree of suspicion on the part of the scheduler's colleagues as to whether everyone was treated fairly, and whether there might have been some bias on the part of the scheduler to favor their friends, etc. Even if these suspicions happen to be completely unfounded, their existence can still create tension within the team.

The realisation that some solutions are more desirable than others adds another layer to the problem. The most important consideration here is the agents' availability, which they specify per hour as one of the following: unavailable, available (preferred) and available (non-preferred). An agent is never scheduled for a shift when unavailable, and we strive to keep shifts within agents' preferred hours as far as possible, using non-preferred hours only when absolutely necessary. Furthermore, if an agent has had a heavy support load in the past, we also try to lighten their load for the next week's schedule by shifting the weight to other agents. These considerations, amongst others detailed below, constitute a score for “agent pain” and enable the algorithm to determine whether one solution is "better" than another.

Optimally, one would like to find the "best" schedule every time, a feat which is impossible for a human scheduler to accomplish in any reasonable time frame. Given enough time and computing resources, an algorithm can actually find the best schedule, but even if it can just find a schedule that is better than many others, it will still outperform a human scheduler by far. Our algorithm is currently run for several hours at a time, but tends to find significant improvements only in the first half-hour, with tiny improvements thereafter, essentially flat-lining after a couple of hours. Score-versus-time curves obtained by running the scheduler for four different weeks, are shown in the figure below, truncated at 3 hours.

Pain scores achieved by scheduler for 4 different scheduling weeks

Hence the “unreasonable effectiveness” of a cold, impersonal scheduling algorithm in maximizing happiness and optimizing the use of agents’ time, much more effectively than even the most fair-minded, empathic person. In fact, it maximizes my own happiness as well; instead of working on a boring, labor intensive process that repeats every week, I get to work on building and optimizing scheduling algorithms!

Now, let’s look at exactly how this is accomplished. The code is publicly available on GitHub, you're welcome to try it out!

The model

In order to obtain a solution to our scheduling problem through an algorithm, a mathematical representation (a “model”) of the problem needs to be created first, which can then be coded into a form that is understood by the constraint solver. Any solution to this model represents a viable support schedule for a given week, while the optimal solution represents the best schedule.

Before we dive into the model parameters, I want to dedicate a short section to “agent availability”, which is represented by the symbol \(\mathcal{H}_n\) in the parameter list below. Even though agent availability is not the only input to the scheduler, it is so fundamental to maximizing happiness that it deserves special attention.

Agent availability

In order for us to minimize our support agents' pain, we need to know when exactly they prefer to work. To this end, we have set up a team model in Google Sheets, where agents enter their preferences.

  • The default time zone from where they work, as well as temporary overrides for business and personal trips.
  • Their unique availability for every hour, Monday to Friday, in their local time zone: unavailable, available (preferred) or available (non-preferred), as shown below.
  • They also enter their preferred support shift length here, for example 5 hours per shift. This technically not part of their “availability”, but also an important preference that is represented by the parameter \(p_n\) in the model below.

Personal working hours and shift length preferences

In addition to these preferences, the following data is also incorporated:

  • Agents' time off data from the calamari.io leave management system, to avoid booking agents when they are on holiday.
  • Existing events from the agents' Google Calendars, to avoid clashes.
  • Possible opt-out exceptions for certain agents: from time to time, an agent may be excused from support duty for a few weeks to focus on other urgent work.

Most of the data processing in the team model spreadsheet is performed with customized Google Apps Script functions. This involves a lot of date, time and time zone calculations, which makes extensive use of the Moment and Moment Timezone JavaScript libraries.

Model parameters

The global model parameters (with defaults where applicable) are:

  • \(N\) (number of active support agents for the week)
  • \(n \in \mathbb{Z}:~n \in [1, N]\) (counter for specific agent)
  • \(M=5\) (number of consecutive days to schedule)
  • \(m \in \mathbb{Z}:~m \in [1, M]\) (counter for specific day)
  • \(T = 2\) (number of simultaneous support tracks)
  • \(t \in \mathbb{Z}: t \in [1, T]\) (counter for specific track)
  • \(S=8\), \(E=24\) (balena support start and end times)
  • \( d_{\mathrm{min}} = 2 \), \( d_{\mathrm{max}} = 8 \) (shift length limits)

For each agent, there is also the following set of parameters:

  • \(\mathcal{H}_n\) (domain constructed from ranges of available hours)
  • \(p_n\) (preferred shift duration)
  • \(a_n\) (historical average support hours per week)
  • For each day \(m\) and each track \(t\), solving the model requires finding appropriate values for:
    • \(s_{nmt}\) (shift start time)
    • \(e_{nmt}\) (shift end time)
    • \(d_{nmt}\) (shift duration)
    • \(i_{nmt}\) (interval constructed from the 3 variables above)

Hard constraints

The "hard constraints" are hard limits in the model, so that a solution is only valid if all of these are satisfied. These are:

  1. Each interval \(i_{nmt}\) is contained in the availability domain \(\mathcal{H}_n\), since an agent can only do support when actually available.

  2. For any given day \(m\) and track \(t\), the intervals \(i_{nmt}\) may not overlap, so that the next agent takes over just as the previous one goes off support.

  3. For any given day \(m\) and track \(t\), \(\sum_{n=1}^N d_{nmt} = E - S\).

    This simply means that all the shift durations must add up to the 16 support hours we cover, so that there are no gaps in the schedule.

  4. The value of each shift duration is constrained to \(d_{\mathrm{min}} \leq d_{nmt} \leq d_{\mathrm{max}}\) if agent \(n\) was assigned, or \(d_{nmt}=0\) if agent \(n\) was not assigned.

Soft constraints ("pain")

The model also contains several "soft constraints", which do not exclude certain schedules like the hard constraints, but rather define which schedules are better than others. Our model currently has 5 soft constraints, which are expressed mathematically as \(S_1,...,S_5\) , the sum of which can be regarded as the total pain associated with a specific schedule, i.e.

$$\mathrm{Pain} = S_1 + S_2 + S_3 + S_4 + S_5$$

Finding the best solution to the model means finding the schedule with the lowest corresponding pain value. The soft constraints, together with their corresponding mathematical expressions, are listed below. Note that each term is summed over all agents, days and simultaneous tracks, so that it represents the pain for the week's schedule as a whole.

  1. Penalty for total number of non-preferred hours used during assigned shift intervals. This is considered to be the most pronounced type of pain, since non-preferred hours fall outside an agent's normal working hours:
    $$S_1 = 8 \sum_{t=1}^T \sum_{m=1}^M \sum_{n=1}^N \left( \text{number of non-preferred hours in }i_{nmt} \right)$$

  2. Penalty for deviating from agents' shift length preferences (assigning an agent a shift longer than the preferred length is penalized more heavily than assigning a shorter shift):
    $$S_2 = \sum_{t=1}^T \sum_{m=1}^M \sum_{n=1}^N \left\{ \begin{array}{ll} 3(p_n - d_{nmt})
    &\text{ if \(d_{nmt} < p_n\) and \(d_{nmt} \neq 0\)} \cr 4(d_{nmt} - p_n)
    &\text{ if \(d_{nmt} > p_n\)} \cr 0
    &\text{ if \(d_{nmt} = 0\)} \end{array} \right. $$

  3. Penalty for assigning too many hours per week to individual agents (in other words, not spreading the load):
    $$S_3 = 0.2 \sum_{n=1}^N \left(\sum_{t=1}^T \sum_{m=1}^M d_{nmt}\right)^2$$

  4. Penalty for assigning shifts to agents that have had a heavy past support load. The value of \(a_n\) for the agent with the lowest historical load is taken as a baseline, so that the cost for assigning shifts to this agent would be zero, while assigning any other agent would carry a non-zero historical cost.
    $$S_4 = 3\sum_{t=1}^T \sum_{m=1}^M \sum_{n=1}^N \left\{ \begin{array}{ll} a_n - {\rm min}\{a_1, ..., a_N\}
    &\text{ if \(d_{nmt} \neq 0\)} \cr 0
    &\text{ if \(d_{nmt} = 0\)} \end{array} \right.$$

  5. Penalty for the number of support hand-overs taking place. At the end of each shift, the agent going off support needs to add some notes and comments on existing customer queries that are being handled, so that the next agent can take over as seamlessly as possible. However, a hand-over does create a discontinuity in the support flow, therefore having too many is not desirable. If there are, for example, 4 consecutive shifts for a specific day and track, there are 3 hand-overs, hence the "minus 1" in the expression below:
    $$S_5 = 3\sum_{t=1}^T \sum_{m=1}^M \left[ \left(\sum_{n=1}^N \left\{ \begin{array}{ll} 1
    &\mbox{ if \(d_{nmt} \neq 0\)} \cr 0
    &\mbox{ if \(d_{nmt} = 0\)} \end{array} \right.\right) - 1\right]$$

Note that the expression above goes to zero for a given track and day if there is only one 16-hour shift, which would be the "perfect" solution without a hand-over, if this was the only consideration. Remember, however, that the maximum shift length is set to 8 hours. Furthermore, the agents' preferred shift lengths (implemented in \(S_2\)) are mostly shorter than this. These factors counterbalance the \(S_5\) term so that there has so far been at least 3, and usually more, agents assigned per 16-hour block.

You would have noted the seemingly arbitrary constant factors in each of the expressions above. These represent the weights of each term \(S_1,...,S_5\) respectively in calculating the total pain, and were set through experimentation to values that yield a realistic balance.

Maximizing happiness

The core scheduling algorithm takes as input all of the above, and provides as output the best possible schedule it can find within the optimisation time set by the user. The central part of the algorithm is a constraint solver, and we currently use the Google OR-tools CP-SAT solver, which is well suited to scheduling optimisation.

It is a CP (constraint programming) SATisfiability solver, which means that it finds a solution to a set of boolean constraints. It can handle only 2 variable types, integer and boolean, and it is mainly for use with linear constraints, although it can handle variable products as well. It also supports multi-threading.

Output, finalisation and record-keeping

Once the final schedule is produced by the algorithm, the shifts are saved in a JSON file, and an automated script writes the events to our support calendar, sending invites to all the assigned agents. This happens early in the week that precedes the assigned shifts. If an agent is unable to take up an assigned shift, I facilitate a swap or a new assignment, and update the calendar.

It can also happen that an agent is initially happy with an assigned shift, but is unable to take it up on the day due to a work-related or personal emergency, and a substitute is then found on an ad hoc basis. Any such last-minute changes are also maintained on the support calendar.

To keep the support history statistics accurate, we need to keep track of every hour of support service provided by every agent. This is achieved through a Google Apps Script function that is run every week, reading all the past week's actual shifts from the support calendar (which might be slightly different than the hours assigned originally), and recording it in a Google Sheet that contains our support data. Every non-preferred hour worked counts twice as much as a preferred hour. The updated historical average \(a_n\) for each agent is then used during the optimisation of the next schedule.

Next steps

In terms of constraint solvers, our initial solution made use of the Z3 Theorem Prover, which also worked well. However, for our particular problem, the Google OR-tools CP-SAT solver was able to find a solution with a lower pain score in less time, hence our current implementation of the latter. For the support scheduler in particular, these are the only two we have tested so far, although we have utilized the clasp solver for other purposes. Therefore, venturing further into the fascinating world of constraint programming tools and comparing their performance to that of these three solvers to make sure that we use the one best suited to our needs, is high on our list of priorities.

The support schedule is currently calculated on a week-by-week basis, with a given week's schedule being calculated and published early in the preceding week. We are considering a system where the main schedule is determined for a much longer period of time, with an optimizer running on a weekly basis to tweak the schedule (biased towards making minimal changes) based on changes in agent availability as related to leave periods, new calendar events and possible changes in working hour preferences.

Currently, the support load history per agent is simply used to distribute the support load for consecutive cycles, but we are planning to integrate this into a "team service" score per team member, which will increase each time a person "takes one for the team". Other aspects that should form part of this score are: flexibility offered in working hours, actual flexibility used, for example staying late (or getting up early) for a meeting, participation in interviewing job applicants, giving company-wide presentations and writing blog posts, to name but a few.

The main challenge here would be to set up the appropriate weights for the different terms, and to find the best way to balance these activities across the team, so that the "team service" load does not fall almost exclusively on a handful of individuals, like it does in many companies.

Conclusion

At balena, we place a high premium on the efficiency and happiness of both our team members and customers, and implementing constraint programming to schedule shifts has already made a significant contribution to having our support process run smoothly. We are actually using algorithms for scheduling other events as well; we hope to share more about this in a future blog post!

In many ways, we have but scratched the surface of possibilities, and believe that algorithmic optimisation can play a key role in multiplying the impact of our company in ways that may not even be apparent yet.

Still, you may ask, why do we put so much effort into scheduling our engineers’ valuable time for customer support, instead of simply outsourcing the support process? In other words, what is support-driven development and why is it so important to us?

We consider customer support as an invaluable source of data on our product, instead of an inconvenient necessity, which has unfortunately become the standard in modern times. A user has a much more unbiased perspective on the product than an employee within the company, therefore customer feedback can act as a stethoscope for diagnosing bugs and potential for improvement that would otherwise not have been obvious.

Since the engineers providing support have actually built the product, they provide high quality support, which makes our customers very happy. Furthermore, interacting directly with the customers keeps the engineers in touch with how the product is received in the outside world, and helps them to take ownership of product development. The customer empathy they acquire helps them to drive the product forward in a way that maximizes impact. It also provides our engineers with a shared context of the product, which has the natural consequence of keeping the remote team coordinated, without the need for micromanagement. You can read more in the Support Driven Development blog post we published several years ago, when we were still known as resin.io.

Give this a try

We encourage everyone who is interested to try this out. The code behind this guide is openly available on Github.

And always welcome applications under our Open Call as well!

Balena’s happy team at our recent annual summit, held in Barcelona this year

comments powered by Disqus
Terms of Service | Privacy Statement | Master agreement | Copyright 2019 Balena | All Rights Reserved