Warm tip: This article is reproduced from serverfault.com, please click

Streaming zip in Django for large non-local files possible?

发布于 2020-12-07 14:48:53

I've got a proxy written in Django which receives requests for certain files. After deciding whether the user is allowed to see the file the proxy gets the file from a remote service and serves it to the user. There's a bit more to it but this is the gist.

This setup works great for single files, but there is a new requirement that the users want to download multiple files together as a zip. The files are sometimes small, but can also become really large (100MB plus) and it can be anywhere from 2 up to 1000 files simultaneously. This can become really large, and a burden to first get all those files, zip them and then serve them in the same request.

I read about the possibility to create "streaming zips"; a way to open a zip and then start sending the files in that zip until you close it. I found a couple php examples and in Python the django-zip-stream extension. They all assume locally stored files and the django extension also assumes the usages of nginx.

There are a couple things I wonder about in my situation:

  1. I don't have the files locally stored. I can get them with an async/await structure and serve them simultaneously. That would mean I always have two files in memory (the one I'm currently serving, and the next one I'm getting from the source server).
  2. Unfortunately I don't have control over the web servers which will serve this. I can of course put an nginx container in front of it, but I don't think nginx could serve from files I store in Python vars because I get them from the source server.
  3. Whether I'm doing this in Python or let it be zipped in nginx, I presume the needed CPU cycles for this would be substantial.

Does anybody know whether streaming zips are a good idea with my setup of very large remote files? I'm a bit afraid that many requests will easily DOS our servers because of CPU or memory limits.

I can also build a queue which zips the files and sends an email to the user, but if possible I'd like to keep the application as stateless as possible.

All tips are welcome!

Questioner
kramer65
Viewed
0
Mario Orlandi 2020-12-12 07:35:49

This sounds to me like a perfect use case to be solved queueing jobs and processing them in the background.

Advantages:

  1. since retrieving and zipping the files requires a variable (and possibly significant) time, that should be decoupled from the HTTP request/response cycle;
  2. multiple jobs will be serialized for execution in the task queue.

The second advantage is particularly desirable since you’re prepared to receive multiple concurrent requests.

I would also consider using a “task” Django model with a FileField to be used as a container for the resulting zip file, so it will be statically and efficiently served by Nginx from the media folder. As an additional benefit, you will monitor what’s going on directly from he Django admin user interface.

I’ve used a similar approach in many Django project, and that has proven to be quite robust and manageable; you might want to take a quick look at the following django app I’m using for that: https://github.com/morlandi/django-task

To summarize:

  • write a “task” Model with a FileField to be used as a container for the zipped result
  • upon receiving a request, insert a new record in the “task” table, and a new job in the background queue
  • the background job is responsible for collecting resources and zipping them; this is common Python stuff
  • on completion, save the result in the FileField and send a notification to the user
  • the user will follow the received url to download the zip file as a static file