I have the following PySpark code written on Databricks Notebook that sucessfully saves the results from the sparkSQL to Azure Cosmos DB with the line of code:
df.write.format("com.microsoft.azure.cosmosdb.spark").mode("overwrite").options(**writeConfig3).save()
The complete code is as follows:
test = spark.sql("""SELECT
Sales.CustomerID AS pattersonID1
,Sales.InvoiceNumber AS myinvoicenr1
FROM Sales
limit 4""")
## my personal cosmos DB
writeConfig3 = {
"Endpoint": "https://<cosmosdb-account>.documents.azure.com:443/",
"Masterkey": "<key>==",
"Database": "mydatabase",
"Collection": "mycontainer",
"Upsert": "true"
}
df = test.coalesce(1)
df.write.format("com.microsoft.azure.cosmosdb.spark").mode("overwrite").options(**writeConfig3).save()
Using the above code I have successfully written to my Cosmos DB database (mydatabase) and collection (mycontainer)
When I try to overwrite the container with by changing SparkSQL with the following(just changing pattersonID1 to pattersonID2, and myinvoicenr1 to myinvoicenr2
test = spark.sql("""SELECT
Sales.CustomerID AS pattersonID2
,Sales.InvoiceNumber AS myinvoicenr2
FROM Sales
limit 4""")
Instead overwriting/updating the collection with the new query Cosmos DB appends the container as follows:
And still leaves the original query in the collection:
Is there a way to completely overwrite or update cosmos DB?
Your issue is that documents have a unique id
(something you never specified, and is therefore auto-generated for you as a guid). When you write your new document, you've just renamed one of the non-id
, non-unique properties, pattersonID1
, to pattersonID2
, and it's just creating a new document, as expected. There's no possible way to know that this new document is related to the original, since it's a completely new document, with its own set of properties.
You can update existing documents, by querying them (or reading them), modifying them, and then replacing them. Or you can choose to query for old documents and delete them (one by one, or a transactionally as a batch of deletes within a partition, via stored procedure). Lastly, you can delete and re-create a container, which will remove all documents currently stored in it.
oh I see. I never thought of id - good catch. Is there a link showing how to update an existing document/collection?