Warm tip: This article is reproduced from serverfault.com, please click

其他-在Azure Databricks的群集Spark Config中设置数据湖连接

(其他 - Setting data lake connection in cluster Spark Config for Azure Databricks)

发布于 2020-12-08 00:04:47

我正在尝试在连接到Azure Data Lake Gen2帐户的Azure Databricks工作区中为开发人员/数据科学家简化笔记本的创建。现在,每个笔记本的顶部都有:

    %scala
    spark.sparkContext.hadoopConfiguration.set("fs.azure.account.auth.type.<datalake.dfs.core.windows.net", "OAuth")
    spark.sparkContext.hadoopConfiguration.set("fs.azure.account.oauth.provider.type.<datalake>.dfs.core.windows.net",  "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider")
    spark.sparkContext.hadoopConfiguration.set("fs.azure.account.oauth2.client.id.<datalake>.dfs.core.windows.net", <SP client id>)
    spark.sparkContext.hadoopConfiguration.set("fs.azure.account.oauth2.client.secret.<datalake>.dfs.core.windows.net", dbutils.secrets.get(<SP client secret>))
    spark.sparkContext.hadoopConfiguration.set("fs.azure.account.oauth2.client.endpoint.<datalake>.dfs.core.windows.net", "https://login.microsoftonline.com/<tenant>/oauth2/token")

我们的实现想避免在DBFSS中挂载,因此我一直在尝试查看是否可以在集群上使用Spark Config来定义这些值(每个集群可以访问不同的数据湖)。

但是,我还无法使它正常工作。当我尝试各种口味时:

org.apache.hadoop.fs.azure.account.oauth2.client.id.<datalake>.dfs.core.windows.net <sp client id>
org.apache.hadoop.fs.azure.account.auth.type.<datalake>.dfs.core.windows.net OAuth
org.apache.hadoop.fs.azure.account.oauth.provider.type.<datalake>.dfs.core.windows.net "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider"
org.apache.hadoop.fs.azure.account.oauth2.client.secret.<datalake>.dfs.core.windows.net {{secrets/secret/secret}}
org.apache.hadoop.fs.azure.account.oauth2.client.endpoint.<datalake>.dfs.core.windows.net "https://login.microsoftonline.com/<tenant>"

我收到“初始化配置失败”的上面的版本,看起来像是默认情况下,尝试使用存储帐户访问密钥而不是SP凭据(这是通过使用一个简单的ls命令进行测试以确保它可以正常工作)。

ExecutionError: An error occurred while calling z:com.databricks.backend.daemon.dbutils.FSUtils.ls.
: Failure to initialize configuration
    at shaded.databricks.v20180920_b33d810.org.apache.hadoop.fs.azurebfs.services.SimpleKeyProvider.getStorageAccountKey(SimpleKeyProvider.java:51)
    at shaded.databricks.v20180920_b33d810.org.apache.hadoop.fs.azurebfs.AbfsConfiguration.getStorageAccountKey(AbfsConfiguration.java:412)
    at shaded.databricks.v20180920_b33d810.org.apache.hadoop.fs.azurebfs.AzureBlobFileSystemStore.initializeClient(AzureBlobFileSystemStore.java:1016)

我希望有解决此问题的方法,尽管如果唯一的答案是“你不能这样做”,那当然是可以接受的答案。

Questioner
Jason Whitish
Viewed
0
Jim Xu 2020-12-10 10:59:23

如果要在Azure Databricks群集Spark配置中添加Azure Data Lake Gen2配置,请参考以下配置

spark.hadoop.fs.azure.account.oauth2.client.id.<datalake>.dfs.core.windows.net <sp client id>
spark.hadoop.fs.azure.account.auth.type.<datalake>.dfs.core.windows.net OAuth
spark.hadoop.fs.azure.account.oauth.provider.type.<datalake>.dfs.core.windows.net org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider
org.apache.hadoop.fs.azure.account.oauth2.client.secret.<datalake>.dfs.core.windows.net {{secrets/secret/secret}}
spark.hadoop.fs.azure.account.oauth2.client.endpoint.<datalake>.dfs.core.windows.net https://login.microsoftonline.com/<tenant>/oauth2/token

有关更多详细信息,请参考此处此处 在此处输入图片说明