Warm tip: This article is reproduced from stackoverflow.com, please click
cdap google-cloud-data-fusion google-cloud-platform

GCP datafusion is too slow in executing the pipelines

发布于 2020-05-22 13:30:15

I understand that datafusion is a managed service on CDAP but the current 6.1.1 enterpise edition is too slow compared to CDAP OSS (which is in Google Market place). It is taking approx ~3 minutes for provisioning the dataproc nodes (whatever the compute profile is), approx 1.5 minutes to start and running mode and then the data will start flowing through nodes. Are there any ways to optimize this and bring up to the speed ?

Questioner
code tutorial
Viewed
43
Edwin Elia 2020-03-08 10:40

CDAP OSS that is in Google Market place is running in memory, and suggested only for development, as the execution engine cannot scale.

If you want to optimize the provisioning of Dataproc cluster, you can pre-provision Dataproc cluster yourself, and use the Remote Hadoop Provisioner compute profile to submit the job instead.