App Engine Python 3 - What's the point of gunicorn?

From what I understand, gunicorn sits between an HTTP request and your app, and can be configured to spawn multiple workers within a single App Engine instance, with each worker containing a separate instance of your app (I think?).

If that's the case, what's the point in using gunicorn with > 1 workers instead of just using gunicorn with 1 worker (on a smaller sized instance class), and letting the App Engine infrastructure take care of spinning up instances and load balancing requests to those instances?

For example, what's the advantage of:

gunicorn --workers 2 on an F2 instance

vs

gunicorn --workers 1 on an F1 instance?

It seems gunicorn is just another layer of configuration which can already similarly be done in your app's YAML file (whilst keeping gunicorn's workers at 1).

Furthermore, other problems arise when using more gunicorn workers, such as a larger consumption of your instance's memory:

https://www.googlecloudcommunity.com/gc/Serverless/Migrated-to-py3-and-Appengine-is-much-more-expens...

Is there something I'm missing?

1 2 361
2 REPLIES 2

By default a gunicorn worker only has a single thread (sync workers, see [1]). With a single worker your App Engine instance will only process one request at a time. If those requests perform a lot of RPCs, the instance is essentially waiting for those RPCs to complete, while during that time it could process other requests.

So that is the case for having more than one worker on an F1 instance. FWIW, with F1 instances the best results we have seen is with having 2 workers for each F1 instance. I wouldn't add more workers, as the F1 instances are insanely slow and too much work will make all your requests take longer and thus hurt throughput.

Each worker indeed loads a separate copy of your app. Thus more workers use (significantly) more memory. If you use the `--preload` flag to start gunicorn [3], the app is loaded first, *then* the process gets forked into the workers. This saves some memory as the workers can share the read-only memory portion of the app (watch out though with dependencies that start threads when using preloading, make sure those are started *after* the workers have forked, as the threads won't survive the forking).

Now, gunicorn has an option to spawn more threads for each worker. These also share the same process memory so in theory this option should use less memory. This raises a question about what is better: one worker with two threads or two workers with a single thread each? I don't know the answer and have not experimented with it. The sync worker model is the simplest and I tend to gravitate towards those, but it could be worth trying with a single worker and multiple threads. I also found that keeping max_concurrent_requests at the recommended value of 10 works well for F1 instances [2].

As for choosing a higher instance type and using more workers/threads: Back in the day when I experienced with it, I found that the larger instance types just resulted in higher costs and not much better performance. Maybe this is different now. But unless you need the additional instance memory and/or CPU, I would try to stick with the lowest instance type. It all depends on your workload of course. 

[1]: https://docs.gunicorn.org/en/stable/design.html

[2]: https://cloud.google.com/appengine/docs/standard/reference/app-yaml?tab=python#scaling_elements

[3]: entrypoint: gunicorn --bind=:$PORT --workers=2 --preload main:app

After doing some of our own testing, we've come to similar conclusions.

Importantly, if your app is network bound (waiting on lots of RPC calls, and not maxing out the CPU), there does not seem to be a benefit to using larger instance classes with more workers.

We'll have to experiment with multiple threads per worker to see what works for us (more threads per worker vs more workers with a single thread each).