Question: Reading old data in Spark with backward compatible schema

Question

Reading old data in Spark with backward compatible schema

Answers 1
Added at 2017-11-07 16:11
Tags
Question

I already have some older data stored in parquet with a schema represented by

case class A(name: String)

I'd like to add a new non-mandatory field in

case class B(name: String, age: Option[Int])

and read in both the old and new data to the same DataFrame. Each time I'm trying to read the data with spark.read.parquet("test_path").as[B].collect(), I'm getting the following exception:

Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot resolve '`age`' given input columns: [name];

Is there a way to specify a backward compatible schema for all of my data?

Answers to

Reading old data in Spark with backward compatible schema

nr: #1 dodano: 2017-11-07 16:11

In order to read older data with a backward compatible schema, it's not enough to specify the new Encoder, you have to manually specify a StructType for the DataSet, and do not let Spark infer it based on either the . This way there isn't going to have missing fields during the conversion into a DataFrame:

spark.read.schema(Encoders.product[B].schema).parquet("test").as[B].collect()

Source Show
◀ Wstecz