Apache Spark 2.2.0正式发布

ada89k

1997

收藏 2017-07-12

Apache Spark 2.2.0正式发布

Apache Spark 2.2.0 持续了半年的开发，从RC1 到 RC6 终于在今天正式发布了。本版本是 2.x 版本线的第三个版本。在这个版本 Structured Streaming 的实验性标记（experimental tag）已经被移除，这也意味着后面的 2.2.x 之后就可以放心在线上使用了。除此之外，这个版本的主要集中点是系统的可用性（usability）、稳定性（stability）以及代码的润色（polish），并没有什么其他重大更新。此版本的一些新功能：

支持 LATERAL VIEW OUTER explode()，
支持给表添加列(ALTER TABLE table_name ADD COLUMNS)；
支持从Hive metastore 2.0/2.1中读取数据；
支持解析多行的JSON 或 CSV 文件；
Structured Streaming为R语言提供的API；
R语言支持完整的Catalog API；
R语言支持 DataFramecheckpointing

此外，这个版本移除了对 Java 7 以及 Hadoop 2.5及其之前版本的支持。详细的更新如下：

Core and Spark SQL

· API updates

· SPARK-19107: Support creatinghive table with DataFrameWriter and Catalog

· SPARK-13721: Add support forLATERAL VIEW OUTER explode()

· SPARK-18885: Unify CREATE TABLEsyntax for data source and hive serde tables

· SPARK-16475: Added BroadcastHints BROADCAST, BROADCASTJOIN, and MAPJOIN, for SQL Queries

· SPARK-18350: Support sessionlocal timezone

· SPARK-19261: Support ALTERTABLE table_name ADD COLUMNS

· SPARK-20420: Add events to theexternal catalog

· SPARK-18127: Add hooks andextension points to Spark

· SPARK-20576: Support generichint function in Dataset/DataFrame

· SPARK-17203: Data sourceoptions should always be case insensitive

· SPARK-19139: AES-basedauthentication mechanism for Spark

· Performance and stability

· Cost-Based Optimizer

· SPARK-17075 SPARK-17076SPARK-19020 SPARK-17077 SPARK-19350: Cardinality estimation for filter, join,aggregate, project and limit/sample operators

· SPARK-17080: Cost-based joinre-ordering

· SPARK-17626: TPC-DS performanceimprovements using star-schema heuristics

· SPARK-17949: Introduce a JVMobject based aggregate operator

· SPARK-18186: Partialaggregation support of HiveUDAFFunction

· SPARK-18362 SPARK-19918: Filelisting/IO improvements for CSV and JSON

· SPARK-18775: Limit the maxnumber of records written per file

· SPARK-18761: Uncancellable /unkillable tasks shouldn’t starve jobs of resources

· SPARK-15352: Topology awareblock replication

· Other notable changes

· SPARK-18352: Support forparsing multi-line JSON files

· SPARK-19610: Support forparsing multi-line CSV files

· SPARK-21079: Analyze TableCommand on partitioned tables

· SPARK-18703: Drop StagingDirectories and Data Files after completion of Insertion/CTAS againstHive-serde Tables

· SPARK-18209: More robust viewcanonicalization without full SQL expansion

· SPARK-13446: [SPARK-18112]Support reading data from Hive metastore 2.0/2.1

· SPARK-18191: Port RDD API touse commit protocol

· SPARK-8425:Add blacklistmechanism for task scheduling

· SPARK-19464: Remove support forHadoop 2.5 and earlier

· SPARK-19493: Remove Java 7support

Programming guides: SparkProgramming Guide and Spark SQL, DataFrames and Datasets Guide.

Structured Streaming

· General Availablity

· SPARK-20844: The StructuredStreaming APIs are now GA and is no longer labeled experimental

· Kafka Improvements

· SPARK-19719: Support forreading and writing data in streaming or batch to/from Apache Kafka

· SPARK-19968: Cached producerfor lower latency kafka to kafka streams.

· API updates

· SPARK-19067: Support forcomplex stateful processing and timeouts using [flat]MapGroupsWithState

· SPARK-19876: Support for onetime triggers

· Other notable changes

· SPARK-20979: Rate source fortesting and benchmarks

Programming guide: StructuredStreaming Programming Guide.

MLlib

· New algorithms in DataFrame-based API

· SPARK-14709: LinearSVC (LinearSVM Classifier) (Scala/Java/Python/R)

· SPARK-19635: ChiSquare test inDataFrame-based API (Scala/Java/Python)

· SPARK-19636: Correlation inDataFrame-based API (Scala/Java/Python)

· SPARK-13568: Imputer featuretransformer for imputing missing values (Scala/Java/Python)

· SPARK-18929: Add Tweediedistribution for GLMs (Scala/Java/Python/R)

· SPARK-14503: FPGrowth frequentpattern mining and AssociationRules (Scala/Java/Python/R)

· Existing algorithms added to Python and R APIs

· SPARK-18239: Gradient BoostedTrees ®

· SPARK-18821: Bisecting K-Means®

· SPARK-18080: Locality SensitiveHashing (LSH) (Python)

· SPARK-6227: Distributed PCA andSVD for PySpark (in RDD-based API)

· Major bug fixes

· SPARK-19110:DistributedLDAModel.logPrior correctness fix

· SPARK-17975: EMLDAOptimizerfails with ClassCastException (caused by GraphX checkpointing bug)

· SPARK-18715: Fix wrong AICcalculation in Binomial GLM

· SPARK-16473: BisectingKMeansfailing during training with “java.util.NoSuchElementException: key not found”for certain inputs

· SPARK-19348:pyspark.ml.Pipeline gets corrupted under multi-threaded use

· SPARK-20047: Box-constrainedLogistic Regression

Programming guide: MachineLearning Library (MLlib) Guide.

SparkR

The main focus of SparkR in the2.2.0 release was adding extensive support for existing Spark SQL features:

· Major features

· SPARK-19654: StructuredStreaming API for R

· SPARK-20159: Support completeCatalog API in R

· SPARK-19795: column functionsto_json, from_json

· SPARK-19399: Coalesce onDataFrame and coalesce on column

· SPARK-20020: Support DataFramecheckpointing

· SPARK-18285: Multi-columnapproxQuantile in R

Programming guide: SparkR (Ron Spark).

GraphX

· Bug fixes

· SPARK-18847: PageRank givesincorrect results for graphs with sinks

· SPARK-14804: GraphvertexRDD/EdgeRDD checkpoint results ClassCastException

· Optimizations

· SPARK-18845: PageRank initialvalue improvement for faster convergence

· SPARK-5484: Pregel shouldcheckpoint periodically to avoid StackOverflowError

Programming guide: GraphXProgramming Guide.

Deprecations

· MLlib

· SPARK-18613: spark.ml LDAclasses should not expose spark.mllib in APIs. In spark.ml.LDAModel, deprecated oldLocalModel and getModel.

· SparkR

· SPARK-20195: deprecatecreateExternalTable

Changes of behavior

· MLlib

· SPARK-19787: DeveloperApiALS.train() uses default regParam value 0.1 instead of 1.0, in order to matchregular ALS API’s default regParam setting.

· SparkR

· SPARK-19291: This addedlog-likelihood for SparkR Gaussian Mixture Models, but doing so introduced a SparkRmodel persistence incompatibility: Gaussian Mixture Models saved from SparkR2.1 may not be loaded into SparkR 2.2. We plan to put in place backwardscompatibility guarantees for SparkR in the future.