Apache Spark 2.2.0正式发布
Apache Spark 2.2.0 持续了半年的开发,从RC1 到 RC6 终于在今天正式发布了。本版本是 2.x 版本线的第三个版本。在这个版本 Structured Streaming 的实验性标记(experimental tag)已经被移除,这也意味着后面的 2.2.x 之后就可以放心在线上使用了。除此之外,这个版本的主要集中点是系统的可用性(usability)、稳定性(stability)以及代码的润色(polish),并没有什么其他重大更新。此版本的一些新功能:
支持 LATERAL VIEW OUTER explode(),
支持给表添加列(ALTER TABLE table_name ADD COLUMNS);
支持从Hive metastore 2.0/2.1中读取数据;
支持解析多行的JSON 或 CSV 文件;
Structured Streaming为R语言提供的API;
R语言支持完整的Catalog API;
R语言支持 DataFramecheckpointing
此外,这个版本移除了对 Java 7 以及 Hadoop 2.5及其之前版本的支持。详细的更新如下:
Core and Spark SQL
· API updates
· SPARK-19107: Support creatinghive table with DataFrameWriter and Catalog
· SPARK-13721: Add support forLATERAL VIEW OUTER explode()
· SPARK-18885: Unify CREATE TABLEsyntax for data source and hive serde tables
· SPARK-16475: Added BroadcastHints BROADCAST, BROADCASTJOIN, and MAPJOIN, for SQL Queries
· SPARK-18350: Support sessionlocal timezone
· SPARK-19261: Support ALTERTABLE table_name ADD COLUMNS
· SPARK-20420: Add events to theexternal catalog
· SPARK-18127: Add hooks andextension points to Spark
· SPARK-20576: Support generichint function in Dataset/DataFrame
· SPARK-17203: Data sourceoptions should always be case insensitive
· SPARK-19139: AES-basedauthentication mechanism for Spark
· Performance and stability
· Cost-Based Optimizer
· SPARK-17075 SPARK-17076SPARK-19020 SPARK-17077 SPARK-19350: Cardinality estimation for filter, join,aggregate, project and limit/sample operators
· SPARK-17080: Cost-based joinre-ordering
· SPARK-17626: TPC-DS performanceimprovements using star-schema heuristics
· SPARK-17949: Introduce a JVMobject based aggregate operator
· SPARK-18186: Partialaggregation support of HiveUDAFFunction
· SPARK-18362 SPARK-19918: Filelisting/IO improvements for CSV and JSON
· SPARK-18775: Limit the maxnumber of records written per file
· SPARK-18761: Uncancellable /unkillable tasks shouldn’t starve jobs of resources
· SPARK-15352: Topology awareblock replication
· Other notable changes
· SPARK-18352: Support forparsing multi-line JSON files
· SPARK-19610: Support forparsing multi-line CSV files
· SPARK-21079: Analyze TableCommand on partitioned tables
· SPARK-18703: Drop StagingDirectories and Data Files after completion of Insertion/CTAS againstHive-serde Tables
· SPARK-18209: More robust viewcanonicalization without full SQL expansion
· SPARK-13446: [SPARK-18112]Support reading data from Hive metastore 2.0/2.1
· SPARK-18191: Port RDD API touse commit protocol
· SPARK-8425:Add blacklistmechanism for task scheduling
· SPARK-19464: Remove support forHadoop 2.5 and earlier
· SPARK-19493: Remove Java 7support
Programming guides: SparkProgramming Guide and Spark SQL, DataFrames and Datasets Guide.
Structured Streaming
· General Availablity
· SPARK-20844: The StructuredStreaming APIs are now GA and is no longer labeled experimental
· Kafka Improvements
· SPARK-19719: Support forreading and writing data in streaming or batch to/from Apache Kafka
· SPARK-19968: Cached producerfor lower latency kafka to kafka streams.
· API updates
· SPARK-19067: Support forcomplex stateful processing and timeouts using [flat]MapGroupsWithState
· SPARK-19876: Support for onetime triggers
· Other notable changes
· SPARK-20979: Rate source fortesting and benchmarks
Programming guide: StructuredStreaming Programming Guide.
MLlib
· New algorithms in DataFrame-based API
· SPARK-14709: LinearSVC (LinearSVM Classifier) (Scala/Java/Python/R)
· SPARK-19635: ChiSquare test inDataFrame-based API (Scala/Java/Python)
· SPARK-19636: Correlation inDataFrame-based API (Scala/Java/Python)
· SPARK-13568: Imputer featuretransformer for imputing missing values (Scala/Java/Python)
· SPARK-18929: Add Tweediedistribution for GLMs (Scala/Java/Python/R)
· SPARK-14503: FPGrowth frequentpattern mining and AssociationRules (Scala/Java/Python/R)
· Existing algorithms added to Python and R APIs
· SPARK-18239: Gradient BoostedTrees ®
· SPARK-18821: Bisecting K-Means®
· SPARK-18080: Locality SensitiveHashing (LSH) (Python)
· SPARK-6227: Distributed PCA andSVD for PySpark (in RDD-based API)
· Major bug fixes
· SPARK-19110:DistributedLDAModel.logPrior correctness fix
· SPARK-17975: EMLDAOptimizerfails with ClassCastException (caused by GraphX checkpointing bug)
· SPARK-18715: Fix wrong AICcalculation in Binomial GLM
· SPARK-16473: BisectingKMeansfailing during training with “java.util.NoSuchElementException: key not found”for certain inputs
· SPARK-19348:pyspark.ml.Pipeline gets corrupted under multi-threaded use
· SPARK-20047: Box-constrainedLogistic Regression
Programming guide: MachineLearning Library (MLlib) Guide.
SparkR
The main focus of SparkR in the2.2.0 release was adding extensive support for existing Spark SQL features:
· Major features
· SPARK-19654: StructuredStreaming API for R
· SPARK-20159: Support completeCatalog API in R
· SPARK-19795: column functionsto_json, from_json
· SPARK-19399: Coalesce onDataFrame and coalesce on column
· SPARK-20020: Support DataFramecheckpointing
· SPARK-18285: Multi-columnapproxQuantile in R
Programming guide: SparkR (Ron Spark).
GraphX
· Bug fixes
· SPARK-18847: PageRank givesincorrect results for graphs with sinks
· SPARK-14804: GraphvertexRDD/EdgeRDD checkpoint results ClassCastException
· Optimizations
· SPARK-18845: PageRank initialvalue improvement for faster convergence
· SPARK-5484: Pregel shouldcheckpoint periodically to avoid StackOverflowError
Programming guide: GraphXProgramming Guide.
Deprecations
· MLlib
· SPARK-18613: spark.ml LDAclasses should not expose spark.mllib in APIs. In spark.ml.LDAModel, deprecated oldLocalModel and getModel.
· SparkR
· SPARK-20195: deprecatecreateExternalTable
Changes of behavior
· MLlib
· SPARK-19787: DeveloperApiALS.train() uses default regParam value 0.1 instead of 1.0, in order to matchregular ALS API’s default regParam setting.
· SparkR
· SPARK-19291: This addedlog-likelihood for SparkR Gaussian Mixture Models, but doing so introduced a SparkRmodel persistence incompatibility: Gaussian Mixture Models saved from SparkR2.1 may not be loaded into SparkR 2.2. We plan to put in place backwardscompatibility guarantees for SparkR in the future.