Warm tip: This article is reproduced from stackoverflow.com, please click
apache-spark azure-databricks dataframe pyspark schema

How to get the schema definition from a dataframe in PySpark?

发布于 2020-04-23 13:19:21

In PySpark it you can define a schema and read data sources with this pre-defined schema, e. g.:

Schema = StructType([ StructField("temperature", DoubleType(), True),
                      StructField("temperature_unit", StringType(), True),
                      StructField("humidity", DoubleType(), True),
                      StructField("humidity_unit", StringType(), True),
                      StructField("pressure", DoubleType(), True),
                      StructField("pressure_unit", StringType(), True)
                    ])

For some datasources it is possible to infer the schema from the data-source and get a dataframe with this schema definition.

Is it possible to get the schema definition (in the form described above) from a dataframe, where the data has been inferred before?

df.printSchema() prints the schema as a tree, but I need to reuse the schema, having it defined as above,so I can read a data-source with this schema that has been inferred before from another data-source.

Questioner
Hauke Mallow
Viewed
19
community wiki 2019-02-03 21:06

Yes it is possible. Use DataFrame.schema property

schema

Returns the schema of this DataFrame as a pyspark.sql.types.StructType.

>>> df.schema
StructType(List(StructField(age,IntegerType,true),StructField(name,StringType,true)))

New in version 1.3.

Schema can be also exported to JSON and imported back if needed.