In a pyspark project we have pyspark dataframe.foreachPartition(func) and in that func we have some aiohttp call to transfer data. What type of monitor tools can be used to monitor the metrics like data rate, throughput, time elapsed...? Can we use statsd and graphite or grafana in this case(they're prefered if possible)? Thanks.
Here is my solution. I used PySpark's accumulators to collect the metrics(number of http calls, payload sent per call, etc.) at each partitions, at the driver node, assign these accumulators' value to statsD
gauge
variable, and send these metrics to Graphite
server and eventually visualized them in Grafana
dashboard. It works so far so good.
did you find an efficient way to extract counts from dataframes? like number of loaded or saved ones?
yes, like what I mentioned, I used Spark's accumulator to accumulate the metrics at each partition (executors) and then at driver node I can assign these accumulated metrics to
statsD
gauge
variable.