Warm tip: This article is reproduced from stackoverflow.com, please click
azure-cosmosdb azure-databricks pyspark pyspark-sql

How to overwrite/update a collection in Azure Cosmos DB from Databrick/PySpark

发布于 2020-04-11 22:36:38

I have the following PySpark code written on Databricks Notebook that sucessfully saves the results from the sparkSQL to Azure Cosmos DB with the line of code:

df.write.format("com.microsoft.azure.cosmosdb.spark").mode("overwrite").options(**writeConfig3).save()

The complete code is as follows:

test = spark.sql("""SELECT
  Sales.CustomerID AS pattersonID1
 ,Sales.InvoiceNumber AS myinvoicenr1
FROM Sales
limit 4""")


## my personal cosmos DB
writeConfig3 = {
    "Endpoint": "https://<cosmosdb-account>.documents.azure.com:443/",
    "Masterkey": "<key>==",
    "Database": "mydatabase",
    "Collection": "mycontainer",
    "Upsert": "true"
}

df = test.coalesce(1)

df.write.format("com.microsoft.azure.cosmosdb.spark").mode("overwrite").options(**writeConfig3).save()

Using the above code I have successfully written to my Cosmos DB database (mydatabase) and collection (mycontainer) enter image description here

When I try to overwrite the container with by changing SparkSQL with the following(just changing pattersonID1 to pattersonID2, and myinvoicenr1 to myinvoicenr2

test = spark.sql("""SELECT
  Sales.CustomerID AS pattersonID2
 ,Sales.InvoiceNumber AS myinvoicenr2
FROM Sales
limit 4""")

Instead overwriting/updating the collection with the new query Cosmos DB appends the container as follows:

enter image description here

And still leaves the original query in the collection:

enter image description here

Is there a way to completely overwrite or update cosmos DB?

Questioner
Carltonp
Viewed
89
David Makogon 2020-02-02 23:06

Your issue is that documents have a unique id (something you never specified, and is therefore auto-generated for you as a guid). When you write your new document, you've just renamed one of the non-id, non-unique properties, pattersonID1, to pattersonID2, and it's just creating a new document, as expected. There's no possible way to know that this new document is related to the original, since it's a completely new document, with its own set of properties.

You can update existing documents, by querying them (or reading them), modifying them, and then replacing them. Or you can choose to query for old documents and delete them (one by one, or a transactionally as a batch of deletes within a partition, via stored procedure). Lastly, you can delete and re-create a container, which will remove all documents currently stored in it.