温馨提示:本文翻译自stackoverflow.com，查看原文请点击：其他 - How to overwrite/update a collection in Azure Cosmos DB from Databrick/PySpark

azure-cosmosdb azure-databricks pyspark pyspark-sql

其他 - 如何从Databrick / PySpark覆盖/更新Azure Cosmos DB中的集合

发布于 2020-04-12 12:04:42

我在Databricks Notebook上编写了以下PySpark代码，使用代码行将结果从sparkSQL成功保存到Azure Cosmos DB：

df.write.format("com.microsoft.azure.cosmosdb.spark").mode("overwrite").options(**writeConfig3).save()

完整的代码如下：

test = spark.sql("""SELECT
  Sales.CustomerID AS pattersonID1
 ,Sales.InvoiceNumber AS myinvoicenr1
FROM Sales
limit 4""")


## my personal cosmos DB
writeConfig3 = {
    "Endpoint": "https://<cosmosdb-account>.documents.azure.com:443/",
    "Masterkey": "<key>==",
    "Database": "mydatabase",
    "Collection": "mycontainer",
    "Upsert": "true"
}

df = test.coalesce(1)

df.write.format("com.microsoft.azure.cosmosdb.spark").mode("overwrite").options(**writeConfig3).save()

使用上面的代码，我已经成功地写入了Cosmos DB数据库（mydatabase）和集合（mycontainer）

当我尝试通过以下更改SparkSQL来覆盖容器时（只需将pattersonID1更改为pattersonID2，将myinvoicenr1更改为myinvoicenr2

test = spark.sql("""SELECT
  Sales.CustomerID AS pattersonID2
 ,Sales.InvoiceNumber AS myinvoicenr2
FROM Sales
limit 4""")

而是使用新查询Cosmos DB覆盖/更新集合，将容器追加如下：

仍将原始查询保留在集合中：

有没有办法完全覆盖或更新cosmos DB？

提问者

Carltonp

被浏览

33

查看英文版

查看原文

David Makogon 2020-02-02 23:06

您的问题是文档具有唯一性id（您从未指定过的内容，因此会自动为您生成guid）。在编写新文档时，您刚刚将id非唯一性，非唯一属性重命名pattersonID1为pattersonID2，并且它只是按预期的方式创建了一个新文档。无法知道这个新文档与原始文档有关，因为它是一个全新的文档，具有自己的属性集。

您可以通过查询（或阅读它们），修改它们然后替换它们来更新现有文档。或者，您可以选择查询旧文档并删除它们（一个或一个，或者通过存储过程以事务方式作为分区中的一批删除操作）。最后，您可以删除并重新创建一个容器，该容器将删除当前存储在其中的所有文档。

Carltonp 2020-02-02 23:41:13

原来如此。我从来没有想过身份证-好抓住。是否存在显示如何更新现有文档/集合的链接？

相关问题

1

使用ArrayType列将UDF重写为pandas udf

2

使用PySpark合并没有重复的Spark模式？

3

Pyspark：提取数据框的行，其中值包含字符串

4

具有窗口功能的PySpark数据偏度

5

使用Pyspark将json RDD解析为数据帧

6

Scala和Pandas的 Spark

7

从PySpark连接到MSSQL

8

将pyspark数据框转换为python词典列表

9

当我尝试根据条件修改列时出现Pyspark错误

10

从pyspark加载数据帧

热门github

1

A command line tool and library for transferring data with URL syntax, supporting DICT, FILE, FTP, FTPS, GOPHER, GOPHERS, HTTP, HTTPS, IMAP, IMAPS, LDAP, LDAPS, MQTT, POP3, POP3S, RTMP, RTMPS, RTSP, SCP, SFTP, SMB, SMBS, SMTP, SMTPS, TELNET, TFTP, WS and WSS. libcurl offers a myriad of powerful features (翻译：Curl 是一个命令行工具，用于传输使用 URL 语法指定的数据。)

2

Detectron2 is a platform for object detection, segmentation and other visual recognition tasks.

3

Flutter makes it easy and fast to build beautiful apps for mobile and beyond (翻译：Flutter 可以轻松快速地为移动设备及其他应用构建漂亮的应用程序)

4

Powerful menu bar manager for macOS

5

Open-Source Chrome extension for AI-powered web automation. Run multi-agent workflows using your own LLM API key. Alternative to OpenAI Operator.

6

AI coding agent, built for the terminal.

7

Tongyi DeepResearch, the Leading Open-source DeepResearch Agent

8

An AI Hedge Fund Team

9

TimesFM (Time Series Foundation Model) is a pretrained time-series foundation model developed by Google Research for time-series forecasting.

10

基于大模型和 RAG 的智能问数系统。Text-to-SQL Generation via LLMs using RAG.

11

🔥 🔥 🔥 Open Source Airtable Alternative (翻译：将任何 MySQL、PostgreSQL、SQL Server、SQLite 和 MariaDB 转换为智能电子表格。)

12

Lightweight coding agent that runs in your terminal

13

Perplexica is an AI-powered search engine. It is an Open source alternative to Perplexity AI

14

Home of the WebKit project, the browser engine used by Safari, Mail, App Store and many other applications on macOS, iOS and Linux. (翻译：WebKit 项目的主页，Safari、Mail、App Store 和 macOS、iOS 和 Linux 上的许多其他应用程序使用的浏览器引擎。)

15