【Apache Spark】spark-avro:A library for reading and writing Avro data from Spar

Lisrelchen

1431

收藏 2017-04-18

Avro Data Source for Apache Spark

A library for reading and writing Avro data from Spark SQL.

Requirements

This documentation is for version 3.2.0 of this library, which supports Spark 2.0+. For documentation on earlier versions of this library, see the links below.

This library has different versions for Spark 1.2, 1.3, 1.4 through 1.6, and 2.0+:

Spark VersionCompatible version of Avro Data Source for Spark
1.20.2.0
1.31.0.0
1.4-1.62.0.1
2.0+3.2.0 (this version)Linking

This library is cross-published for Scala 2.11, so 2.11 users should replace 2.10 with 2.11 in the commands listed below.

You can link against this library in your program at the following coordinates:

Using SBT:

libraryDependencies += "com.databricks" %% "spark-avro" % "3.2.0"

Using Maven:

<dependency> <groupId>com.databricks</groupId> <artifactId>spark-avro_2.10</artifactId> <version>3.2.0</version></dependency>
With spark-shell or spark-submit

This library can also be added to Spark jobs launched through spark-shell or spark-submit by using the --packagescommand line option. For example, to include it when starting the spark shell:

$ bin/spark-shell --packages com.databricks:spark-avro_2.11:3.2.0

Unlike using --jars, using --packages ensures that this library and its dependencies will be added to the classpath. The --packages argument can also be used with bin/spark-submit.

Features

Avro Data Source for Spark supports reading and writing of Avro data from Spark SQL.

Automatic schema conversion: It supports most conversions between Spark SQL and Avro records, making Avro a first-class citizen in Spark.
Partitioning: This library allows developers to easily read and write partitioned data witout any extra configuration. Just pass the columns you want to partition on, just like you would for Parquet.
Compression: You can specify the type of compression to use when writing Avro out to disk. The supported types areuncompressed, snappy, and deflate. You can also specify the deflate level.
Specifying record names: You can specify the record name and namespace to use by passing a map of parameters withrecordName and recordNamespace.