Stratosphere
Developer(s) HPI, HU Berlin, TU Berlin
Stable release
0.4 / January 13, 2014 (2014-01-13)
Preview release
0.5-SNAPSHOT
Written inJava, Scala
Operating systemCross-platform
TypeDistributed Computing
LicenseApache License 2.0
Websitestratosphere.eu

Stratosphere is an open-source software framework for processing large data sets on clusters. It is developed by the Hasso Plattner Institute, HU Berlin, TU Berlin[1] and individual contributors. It is licensed under the Apache License 2.0.

Architecture

edit

Front ends

edit

Stratosphere provides a Java and a Scala API to write programs.[2]. The internals of Stratosphere are implemented in Java.

Runtime

edit

Operators

edit

Similar to MapReduce or Apache Hadoop the Stratosphere runtime provides second-order functions called Operators (in publications called PACTs (PArallelization ContracTs)) for processing records with UDFs.[3]

Function name Image Description
MAP
 
Stratosphere MAP
Receives one record stream and applies UDF to each record.
REDUCE
 
Stratosphere REDUCE
Receives one record stream and applies UDF to all records with the same key.
CO-GROUP
 
Stratosphere CO-GROUP
Receives two record streams and applies UDF to all records with the same key.
CROSS
 
Stratosphere CROSS
Receives two record streams and applies UDF to each element of their Cartesian product.
JOIN
 
Stratosphere JOIN
Receives two record streams and applies UDF to each record of an equi-join on the keys.

The runtime contains different primitives to execute these operators. Currently, there are two join algorithms implemented in Stratosphere: Sort-Merge join and Hybrid Hash Join. The REDUCE operator uses an external sort.[3]

Iterations

edit

In addition to the five second-order functions Stratosphere also provides two types of iterations, bulk and incremental.[4][5]

Stratosphere Compiler

edit

In contrast to other frameworks the pipeline is not fixed to a Map step followed by an optional Reduce step. A Stratosphere job can be an arbitrary DAG composed of multiple operators, bulk iterations, incremental iterations, data sources and data sinks. This enables optimizations like choosing a join strategy at runtime based on metadata, available system resources and compiler hints provided by the user.

The result of the optimized Stratosphere program is a job graph that can be executed on the cluster.

Nephele execution engine

edit

Nephele is the low-level parallel execution engine of Stratosphere. It handles cluster resource allocation and in-memory and network communication.[6].

File systems

edit

Stratosphere supports HDFS, HBase, Avro, Amazon S3, JDBC and local file systems as data sources and data sinks for Stratopshere jobs.

See also

edit

References

edit
  1. ^ "Stratosphere.eu Website". Retrieved 27 November 2013.
  2. ^ "Stratosphere Intro (Java and Scala Interface)".
  3. ^ a b Battré, Dominic; Ewen, Stephan; Hueske, Fabian; Kao, Odej; Markl, Volker; Warneke, Daniel (2010). "Nephele/PACTs". Proceedings of the 1st ACM symposium on Cloud computing. pp. 119–130. doi:10.1145/1807128.1807148. ISBN 9781450300360.
  4. ^ Ewen, Stephan; Schelter, Sebastian; Tzoumas, Kostas; Warneke, Daniel; Markl, Volker (2013). ""Iterative parallel data processing with stratosphere: an inside look"": 1053–1056. doi:10.1145/2463676.2463693. {{cite journal}}: Cite journal requires |journal= (help)
  5. ^ Ewen, Stephan; Tzoumas, Kostas; Kaufmann, Moritz; Markl, Volker (July 2012). "Spinning fast iterative data flows". Proceedings of the VLDB Endowment. 5 (11): 1268–1279. doi:10.14778/2350229.2350245. Retrieved 3 December 2013.{{cite journal}}: CS1 maint: date and year (link)
  6. ^ Warneke, Daniel; Kao, Odej (2009). "Nephele". Proceedings of the 2nd Workshop on Many-Task Computing on Grids and Supercomputers. pp. 1–10. doi:10.1145/1646468.1646476. ISBN 9781605587141.
edit

Category:Free software programmed in Java Category:Cloud computing Category:Cloud infrastructure Category:Free software for cloud computing