Cloud Tasks

vs PubSub
1. Cloud Tasks and Pub/Sub may be used to implement message passing and asynchronous integration
2. The core difference between Pub/Sub and Cloud Tasks is the notion of implicit vs explicit invocation
3. Pub/Sub aims to decouple publishers of events and subscribers to those events
4. Publishers do not need to know anything about their subscribers
5. Pub/Sub gives publishers no control over the delivery of the messages save for the guarantee of delivery
6. Pub/Sub supports implicit invocation: a publisher implicitly causes the subscribers to execute by publishing an event
7. Cloud Tasks is aimed at explicit invocation where the publisher retains full control of execution
8. Publisher specifies an endpoint where each message is to be delivered
9. Cloud Tasks provides tools for queue and task management unavailable to Pub/Sub publishers
  1. Scheduling specific delivery times
  2. Delivery rate controls
  3. Configurable retries
  4. Access and management of individual tasks in a queue
  5. Task/message creation deduplication
vs Scheduler
1. Cloud Tasks triggers actions based on how the individual task object is configured
2. Cloud Scheduler triggers actions at regular fixed intervals
3. Cloud Task initiates actions based on the amount of traffic coming through the queue
4. Cloud Scheduler initiates actions on a fixed periodic schedule
5. Each Cloud Task has a unique name, and can be identified and managed individually in the queue
6. With exception of time of execution, each run of a Cloud Scheduler cron job is exactly the same as every other run of that cron job
7. If the execution of a Cloud Task task fails, the task is re-tried until it succeeds
8. If the execution of a Cloud Scheduler cron job fails, the failure is logged, and the job is not rerun until the next scheduled interval
Pitfalls
1. With the exception of tasks scheduled to run in the future, task queues are completely agnostic about execution order
2. There are no guarantees or best effort attempts made to execute tasks in any particular order
3. There are no guarantees that old tasks will execute unless a queue is completely emptied
4. A number of common cases exist where newer tasks are executed sooner than older tasks, and the patterns surrounding this can change without notice
5. Cloud Tasks aims for a strict "execute exactly once" semantic
6. In situations where a design trade-off must be made between guaranteed execution and duplicate execution, the service errs on the side of guaranteed execution
7. A non-zero number of duplicate executions do occur
8. Developers should take steps to ensure that duplicate execution is not a catastrophic event
9. In production, more than 99.999% of tasks are executed only once
10. The most common source of backlogs in immediate processing queues is exhausting resources on the target instances
11. If a user is attempting to execute 100 tasks per second on frontend instances that can only process 10 requests per second, a backlog will build
12. This typically manifests in one of two ways, either of which can generally be resolved by increasing the number of instances processing requests
13. Servers that are being overloaded can start to return backoff errors in the form of HTTP response code 503
14. Cloud Tasks will react to these errors by slowing down execution until errors stop
15. This can be observed by looking at the "enforced rate" field in the Cloud Console
16. Overloaded servers can also respond with large increases in latency
17. Requests remain open for longer
18. Because queues run with a maximum concurrent number of tasks, this can result in queues being unable to execute tasks at the expected rate
19. Increasing the max_concurrent_tasks for the affected queues can help in situations where the value has been set too low, introducing an artificial rate limit
20. Increasing max_concurrent_tasks is unlikely to relieve any underlying resource pressure
Security
1. Restrict queue management methods to a small set of people or entities
2. For large organizations, use a service account to run software that enforces proper queue configuration
3. Separate users and other entities into Queue Admins, Cloud Task Workers and App Engine Deployer categories
4. Queue Admins group have permission to call Cloud Tasks queue management methods, or to upload queue.yaml files
5. Queue Admins group is restricted to a very small set of users so as to reduce the risk of clobbering queue configuration
6. Cloud Tasks Workers group have permission to perform common interactions with Cloud Tasks such as enqueuing and dequeuing tasks
7. Cloud Tasks Workers group are not allowed to call Cloud Tasks queue management methods
8. App Engine Deployers for projects that have App Engine apps have permission to deploy the app
9. They are not permitted to upload queue.yaml files or make any Cloud Tasks API calls, thus allowing the queue admins to enforce the proper policies
10. Users who are queue admins should not also be Cloud Tasks workers, since that would defeat the purpose of the separation
11. Small projects and organizations can assign Cloud IAM roles directly to users to place them into the groups above
12. Large projects and organizations can use Service Accounts to separate duties and responsibilities
Queues
1. Most standard App Engine apps use queue.yaml to configure queues in the App Engine Task Queue service
2. For Java apps, the queue.xml file is used instead
3. The Cloud Tasks API provides an App Engine-independent interface to the App Engine Task Queue service
4. Cloud Tasks API provides the ability to manage queues, including doing so via the console or the gcloud command
5. Queues that are created by Cloud Tasks API are accessible from the App Engine SDK and vice versa
6. To maintain compatibility, it is possible to use the configuration file used by the App Engine SDK, queue.yaml, to create and configure queues used via the Cloud Tasks API
7. It is strongly recommended to use either the configuration file method or the Cloud Tasks API to configure queues, but not both
8. If new to Cloud Tasks or App Engine, use the Cloud Tasks API exclusively to manage queues and avoid the use of queue.yaml and queue.xml altogether
9. Cloud Tasks queue management methods give users more choice in creating, updating and deleting queues
10. Inspect project's Admin Activity audit logs to retrieve the history of queue configuration changes including queue creations, updates, and deletions.
11. Resuming many high-QPS queues at the same time can lead to target overloading
Scaling
1. Queues or queue groups can become overloaded any time traffic increases suddenly and experience increased task creation latency, task creation error rate, and reduced dispatch rate
  1. To defend against this, establish controls in any situation where the create or dispatch rate of a queue or queue group can spike suddenly
  2. Google recommends a maximum of 500 operations per second to a cold queue or queue group, then increasing traffic by 50% every 5 minutes
  3. In theory, traffic can grow to 740K operations per second after 90 minutes using this ramp up schedule
  4. If tasks are created by an App Engine app, leverage App Engine traffic splitting to smooth traffic increases
  5. By splitting traffic between versions, requests that need to be rate-managed can be spun up over time to protect queue health
  6. When launching a release that significantly increases traffic to a queue or queue group, gradual rollout is, again, an important mechanism for smoothing the increases
  7. Gradually roll out instances such that the initial launch does not exceed 500 total operations to the new queues, increasing by no more than 50% every 5 minutes
2. Newly created queues are especially vulnerable
  1. Groups of queues, for example [queue0000, queue0001, …, queue0199], are just as sensitive as single queues during the initial rollout stages
  2. For these queues, gradual rollout is an important strategy.
  3. Launch new or updated services, which create high-TPS queues or queue groups, in stages such that initial load is below 500 TPS and increases of 50% or less are staged 5 minutes or more apart
  4. When increasing the total capacity of a queue group, for example expanding [queue0000-queue0199 to queue0000-queue0399], follow the 500/50/5 pattern.
  5. For rollout procedures, new queue groups behave no differently than individual queues
  6. Apply the 500/50/5 pattern to the new group as a whole, not just to individual queues within the group
  7. For queues group expansions, gradual rollout is an important strategy
  8. When migrating service to add tasks to the increased number of queues, gradually roll out instances such that the initial launch does not exceed 500 total operations to the new queues, increasing by no more than 50% every 5 minutes
  9. An existing queue group may be expanded because tasks are expected to be added to the queue group faster than the group can dispatch them
3. If the source of traffic is App Engine, use traffic splitting
  1. If the names of the new queues are spread out evenly among existing queue names when sorted lexicographically, then traffic can be sent immediately to those queues as long as there are no more than 50% new interleaved queues and the traffic to each queue is less than 500 TPS.
  2. This method is an alternative to using traffic splitting and gradual rollout.
4. When a large number of tasks, for example millions or billions, need to be added, a double-injection pattern can be useful
  1. Instead of creating tasks from a single job, use an injector queue
  2. Each task added to the injector queue fans out and adds 100 tasks to the desired queue or queue group
  3. The injector queue can be sped up over time, for example start at 5 TPS, then increase by 50% every 5 minutes
5. A name can be assigned to a task using the name parameter
  1. When a new task is created, Cloud Tasks assigns the task a unique name by default
  2. Name parameter introduces significant performance overhead, resulting in increased latencies and potentially increased error rates associated with named tasks
  3. These costs can be magnified significantly if tasks are named sequentially, such as with timestamps
  4. If assigning own names, use a well-distributed prefix for task names, such as a hash of the contents
6. Cloud Tasks can overload other services that such as App Engine, Datastore
  1. Network usage increases if dispatches from a queue increase dramatically in a short period of time
  2. If a backlog of tasks has accumulated, unpausing queues can potentially overload these services
  3. The recommended defense is the 500/50/5 pattern suggested for queue overload
  4. If a queue dispatches more than 500 TPS, increase traffic triggered by a queue by no more than 50% every 5 minutes
  5. Monitoring and logging metrics can be used to proactively monitor traffic increases
  6. Alerts can be used to detect potentially dangerous situations
7. Unpausing or resuming high-TPS queues
  1. When a queue or series of queues is unpaused or re-enabled, queues resume dispatches
  2. If the queue has many tasks, the newly-enabled queue’s dispatch rate could increase dramatically from 0 TPS to the full capacity of the queue
  3. To ramp up, stagger queue resumes or control the queue dispatch rates
8. Bulk scheduled tasks
  1. Large numbers of tasks, which are scheduled to dispatch at the same time, can also introduce a risk of target overloading
  2. To start a large number of tasks at once, consider using queue rate controls to increase the dispatch rate gradually or explicitly spinning up target capacity in advance
9. Increased fan-out
  1. When updating services that are executed through Cloud Tasks, increasing the number of remote calls can create production risks
  2. Use gradual rollout or traffic splitting to manage ramp up
10. Retries
  1. Code can retry on failure when making Cloud Tasks API calls
  2. When a significant proportion of requests are failing with server-side errors, a high rate of retries can overload queues even more and cause them to recover more slowly
  3. Google recommends capping the amount outgoing traffic if a client detects that a significant proportion of requests are failing with server-side errors