Why rate-limit?

All commercial cloud services providing public API’s implement a rate-limiting system of some form, primarily to protect systems from abuse and over-utilization of system resources.

How the current rate-limit works

The rate-limit system in place right now imposes a limit of 1000 requests per minute (16.6 requests per second) per access-token. We do allow users to surpass this limit slightly by introducing an artificial latency to a maximum of 10 requests surpassing the 16.6 r/s limit. Essentially these requests will be put in a queue before being sent to our back-end at the enforced limit of 16.6 r/s. If the queue of 10 requests is also saturated beyond the 16.6 r/s limit, requests will be rejected with HTTP status 503.

Problems with the current rate-limit implementation

We have encountered a few problems with the implementation of rate-limiting described above. First and foremost, the HTTP status code of 503 can be confusing, since it’s not obvious if the rate limit has been reached or if there is an actual outage of the API.

The same situation applies to the queuing of excessive requests – it’s not obvious if the rate limit is introducing experienced latency or if there could be a problem with the network, or Fortnox back-end.

Finally, we have found the limit of 16.6 r/s to be a bit too high for a single access-token. Heavy utilization of the API today can affect the performance of the Fortnox web application.

New rate limit implementation

When designing our new implementation, we wanted to remedy all these issues. For starters, we will replace the current HTTP 503 (Service Temporarily Unavailable) with HTTP 429 (Too Many Requests) making it very clear if the rate-limit is being enforced.
Secondly, we decided to set the new limit at 4 requests per second per access-token to avoid overloading our back-ends. This equals to a bit more than 200 requests per minute. Our loadbalancer tracks requests at millisecond granularity, so this limit corresponds to 1 request every 250 milliseconds.

We also disabled the queuing of excessive requests, in effect we avoid spacing out the forwarding of queued requests and introducing latency to the API which has caused confusion. Instead of queuing, we apply a “burst”-zone of up to 20 requests – essentially this allows “borrowing” future request slots from the rate-limit.

Suppose that the 20-slot burst zone is empty and 15 requests arrive simultaneously from a given access-token. All 15 requests will be forwarded immediately while 14 slots in the burst zone is marked as taken, upon which 1 slot is freed every 250 milliseconds. (If there were 25 requests instead, 21 would immediately be forwarded, 20 slots marked as taken, and 4 requests be rejected with status 429.)

Another example would be an application that does 10 API requests within one second, at set intervals of 5 seconds each – this application will never have API calls rejected, since 2.5 seconds after those 10 requests, the burst zone will be fully cleared.
This system will ensure that applications doing short bursts of requests at semi-regular intervals will continue working just fine, while simultaneously correctly sending rate-limit responses if the average request rate is too high.

The rate-limit scales with more tenants / access-tokens

Remember that the rate limit is not based on external IP-addresses, but rather access-token. If your integration is used by five different Fortnox-tenants, every one of those tenants will have a unique access-token granting you 4 r/s for each such token. This would effectively put you at a global limit of 20 requests per second towards Fortnox API. Remember that each individual access-token will still be at the 4 r/s limit though.