Warm tip: This article is reproduced from serverfault.com, please click

apache-spark azure-databricks filesystems hadoop

How to list file keys in Databricks dbfs without dbutils

发布于 2020-11-09 18:13:40

Apparently dbutils cannot be used in cmd-line spark-submits, you must use Jar Jobs for that, but I MUST use spark-submit style jobs due to other requirements, yet still have a need to list and iterate over file keys in dbfs to make some decisions about which files to use as input to a process...

Using scala, what lib in spark or hadoop can I use to retrieve a list of dbfs:/filekeys of a particular pattern?

import org.apache.hadoop.fs.Path
import org.apache.spark.sql.SparkSession

def ls(sparkSession: SparkSession, inputDir: String): Seq[String] = {
  println(s"FileUtils.ls path: $inputDir")
  val path = new Path(inputDir)
  val fs = path.getFileSystem(sparkSession.sparkContext.hadoopConfiguration)
  val fileStatuses = fs.listStatus(path)
  fileStatuses.filter(_.isFile).map(_.getPath).map(_.getName).toSeq
}

Using the above, if I pass in a partial key prefix like dbfs:/mnt/path/to/folder while the following keys are present in said "folder":

/mnt/path/to/folder/file1.csv
/mnt/path/to/folder/file2.csv

I get dbfs:/mnt/path/to/folder is not a directory when it hits val path = new Path(inputDir)

Questioner

Rimer

Viewed

0

Rimer 2020-11-30 23:51:19

Need to use the SparkSession to do it.

Here's how we did it:

import org.apache.commons.io.IOUtils
import org.apache.hadoop.fs.{FileSystem, Path}
import org.apache.spark.sql.SparkSession

def getFileSystem(sparkSession: SparkSession): FileSystem =
    FileSystem.get(sparkSession.sparkContext.hadoopConfiguration)

def listContents(sparkSession: SparkSession, dir: String): Seq[String] = {
  getFileSystem(sparkSession).listStatus(new path(dir)).toSeq.map(_.getPath).map(_.getName)
}

热门帖子

1

求推荐 3k 左右的烘干机

2

老板问我为什么不加班，我要怎么回他

3

RouterOS 如何切换 DNS 服务器

4

[开发者自荐] 又做了一个‘无用’的小玩具，根据出生日期，受孕月份预测性别， Just for Fun！

5

胖猫，你又不是第一次没人要

6

求助一个排查了半年没解决的 MySQL order by 子句导致索引失效的问题， 500 多万条记录的小表要查快两分钟

7

Surge ponte 如何实现异地访问家里的内网设备

8

[TestFlight] 月更：竹蜻蜓 0.17.4 现已释出

9

Macbook M1 升级 M3 MAX

10

Web3 工作

热门github

1

Implementation of paper - YOLOv9: Learning What You Want to Learn Using Programmable Gradient Information

2

A Windows and Office activator using HWID / Ohook / KMS38 / Online KMS activation methods, with a focus on open-source code and fewer antivirus detections.

3

Get up and running with Llama 2, Mistral, Gemma, and other large language models.

4

该项目可以让你通过订阅的方式使用Cloudflare WARP+，自动获取流量。This project enables you to use Cloudflare WARP+ through subscription, automatically acquiring traffic.

5

Multi functional app to find duplicates, empty folders, similar images etc.

6

Xray panel supporting multi-protocol multi-user expire day & traffic & ip limit (Vmess & Vless & Trojan & ShadowSocks & Wireguard)

7

The Free Software Media System

8

lightweight, standalone C++ inference engine for Google's Gemma models.

9

📚 Freely available programming books

10

A collective list of free APIs

11

1️⃣🐝🏎️ The One Billion Row Challenge -- A fun exploration of how quickly 1B rows from a text file can be aggregated with Java

12

🎓 Path to a free self-taught education in Computer Science!

13

Curso para aprender el lenguaje de programación Python desde cero y para principiantes. 75 clases, 37 horas en vídeo, código, proyectos y grupo de chat. Fundamentos, frontend, backend, testing, IA...

14

This repository contains System Design resources which are useful while preparing for interviews and learning Distributed Systems

15

Mamba is a new state space model architecture showing promising performance on information-dense data such as language modeling, where previous subquadratic models fall short of Transformers. It is based on the line of progress on structured state space models, with an efficient hardware-aware design and implementation in the spirit of FlashAttention.