Skip to content
This repository has been archived by the owner on Jan 27, 2020. It is now read-only.
Utz Westermann edited this page May 8, 2018 · 149 revisions

Introduction

Schedoscope is a scheduling framework for agile development, testing, (re)loading, and monitoring of your Hadoop data warehouse.

Schedoscope makes the headache go away you are certainly going to get when frequently having to rollout and retroactively apply changes to computation logic and data structures in your data warehouse with traditional ETL job schedulers such as Oozie.

Scheduling with Schedoscope is based on three principles:

  1. Goal orientation: with Schedoscope, you specify the views you want and the scheduler takes care that the corresponding data are loaded.

  2. Self-sufficiency: Schedoscope has all information about views available: structure, dependencies, transformation logic. The scheduler thus can start out from an empty metastore and create all tables and partitions as data are loaded. Also, metadata management and lineage tracing is trivially as data structure and dependencies are explicitly specified.

  3. Reloading is loading: Schedoscope implements measures to automatically detect changes to view structure and computation logic; as it is self-sufficient, it can then automatically recompute potentially outdated views.

Getting Started

Get a glance of what Schedoscope does for you:

Build it:

 [~]$ git clone https://github.com/ottogroup/schedoscope.git
 [~]$ cd schedoscope
 [~/schedoscope]$  MAVEN_OPTS='-Xmx1G' mvn clean install

Follow the Open Street Map tutorial to install and run Schedoscope in a standard Hadoop distribution image:

Read the View DSL Primer for more information about the capabilities of the Schedoscope DSL:

Read more about how Schedoscope actually performs its scheduling work:

Check out Metascope! It's an add-on to Schedoscope for collaborative metadata management, data discovery and exploration, and data lineage tracing:

News

05/08/2018 - Release 0.10.2

We have released Version 0.10.2 as a Maven artifact to our Bintray repository (see Setting Up A Schedoscope Project for an example pom).

We have changed the materialization logic of materializeOnce views such that they no longer ask their child views to materialize if the materializeOnce views have been materialized already. This improves performance.

04/24/2018 - Release 0.10.1

We have released Version 0.10.1 as a Maven artifact to our Bintray repository (see Setting Up A Schedoscope Project for an example pom).

This is a bugfix release correcting the order of the TBLPROPERTIES and LOCATION clauses in the Hive DDL generated for views. Please do note that if you use the tblProperties clause in some views, this change affects the DDL checksum making Schedoscope drop and recreate the respective tables. Hence the version bump to 0.10.1.

Thanks to Julian Keppel for reporting the issue and providing the fix.

04/06/2018 - Release 0.9.13

We have released Version 0.9.13 as a Maven artifact to our Bintray repository (see Setting Up A Schedoscope Project for an example pom).

Removed derelict indirect CDH5.12.0 dependencies incurred by Cloudera's Spark 2.2.0-Cloudera2 dependency.

03/15/2018 - Release 0.9.11

We have released Version 0.9.11 as a Maven artifact to our Bintray repository (see Setting Up A Schedoscope Project for an example pom).

Added configuration parameter schedoscope.export.disableAll to globally disable all view exports. Useful in test environments.

03/09/2018 - Release 0.9.10

We have released Version 0.9.10 as a Maven artifact to our Bintray repository (see Setting Up A Schedoscope Project for an example pom).

Upgraded Cloudera dependencies to CDH 5.14.0.

01/25/2018 - Release 0.9.9

We have released Version 0.9.9 as a Maven artifact to our Bintray repository (see Setting Up A Schedoscope Project for an example pom).

Export your views to Google Cloud Platform's BigQuery via a simple exportAs() statement.

BigQuery export now compresses view data before sending it off to Google Cloud Storage.

01/24/2018 - Release 0.9.7

We have released Version 0.9.7 as a Maven artifact to our Bintray repository (see Setting Up A Schedoscope Project for an example pom).

Optimized performance of BigQuery export by moving more work to the map phase of the export job.

01/23/2018 - Release 0.9.6

We have released Version 0.9.6 as a Maven artifact to our Bintray repository (see Setting Up A Schedoscope Project for an example pom).

Corrected a problem with command line argument construction within BigQuery exportAs() clauses in a Kerberized cluster.

01/22/2018 - Release 0.9.5

We have released Version 0.9.5 as a Maven artifact to our Bintray repository (see Setting Up A Schedoscope Project for an example pom).

Export your views to Google Cloud Platform's BigQuery via a simple exportAs() statement.

10/12/2017 - Release 0.9.4

We have released Version 0.9.4 as a Maven artifact to our Bintray repository (see Setting Up A Schedoscope Project for an example pom).

Emergency bug fix for Schedoscope crashing upon exports. Do not use 0.9.3!

10/11/2017 - Release 0.9.3

We have released Version 0.9.3 as a Maven artifact to our Bintray repository (see Setting Up A Schedoscope Project for an example pom).

Minor bug fix. Show view name in resource manager also for transformations of views that have exportAs statements.

09/21/2017 - Release 0.9.2

We have released Version 0.9.2 as a Maven artifact to our Bintray repository (see Setting Up A Schedoscope Project for an example pom).

Minor bug fixes. Improved Metascope performance by optionally circumventing the Hive Metastore API and accessing the Metastore DB directly.

08/17/2017 - Release 0.9.1

We have released Version 0.9.1 as a Maven artifact to our Bintray repository (see Setting Up A Schedoscope Project for an example pom).

We fixed a bug in the Spark driver that could lead to incomplete consumption of the error stream of the Spark submit subprocess resulting in transformation freezes.

08/11/2017 - Release 0.9.0

We have released Version 0.9.0 as a Maven artifact to our Bintray repository (see Setting Up A Schedoscope Project for an example pom).

This release upgrades Spark transformations from Spark version 1.6.0 to Spark version 2.2.0 based on Cloudera's CDH 5.12 Spark 2.2 beta parcel. As a consequence, Schedoscope has been lifted to Scala 2.11 and JDK8 as well.

This is an incompatible change likely requiring adaptation of Spark jobs, dependencies, and build pipelines of existing Schedoscope projects - hence the incrememtation of the minor release number.

08/04/2017 - Release 0.8.9

We have released Version 0.8.9 as a Maven artifact to our Bintray repository (see Setting Up A Schedoscope Project for an example pom).

This release contains the following enhancements and changes:

  • Cloudera client libraries updated to CDH-5.12.0;
  • a DistCp transformation for view materialization by parallel, cross-cluser file copying;
  • a new development mode setup that helps developers to easily copy data from a production environment to the direct dependencies of the view they are developing;
  • shell transformations had to be moved back into schedoscope-core to facilitate development mode;
  • a versioning issue with the Scala Maven compiler plugin with regard to Scala 2.10 was fixed so that finally Schedoscope compiles and runs under JDK8 as well.
07/04/2017 - Release 0.8.7

We have released Version 0.8.7 as a Maven artifact to our Bintray repository (see Setting Up A Schedoscope Project for an example pom).

This version contains a critical Metascope bugfix introduced with the last version preventing startup. Also, finally Metascope field lineage documentation has been provided in the View DSL Primer and the Metascope Primer.

06/23/2017 - Release 0.8.6

We have released Version 0.8.6 as a Maven artifact to our Bintray repository (see Setting Up A Schedoscope Project for an example pom).

This version includes support for field level data lineage - automatically inferred from Hive transformations, declaratively specifyable for other transformations - in Metascope. Also, Metascope lineage graph rendering has been reworked. Extensive documentation to come.

Schedoscope now fails immediately if a driver specified in schedoscope.conf cannot be found on the classpath.

05/26/2017 - Release 0.8.5

We have released Version 0.8.5 as a Maven artifact to our Bintray repository (see Setting Up A Schedoscope Project for an example pom).

This version adds support for float view fields to JDBC exports

05/24/2017 - Release 0.8.4

We have released Version 0.8.4 as a Maven artifact to our Bintray repository (see Setting Up A Schedoscope Project for an example pom).

This version removes a race condition the file system driver initialization that seems to have been introduced with CDH-5.10. Also, we have changed the way how we delete and recreate output folders for Map/Reduce transformations to avoid Hive partitions pointing to temporarily non-existing folders.

04/24/2017 - Release 0.8.3

We have released Version 0.8.3 as a Maven artifact to our Bintray repository (see Setting Up A Schedoscope Project for an example pom).

This version has been built against Cloudera's CDH 5.10.1 client libraries. The test framework no longer artificially sets the storage formats of views under test to text, making testing of Spark jobs writing Parquet files simpler. The robustness of the Schedoscope HTTP service has been improved in face of invalid view parameters.

03/24/2017 - Release 0.8.2

We have released Version 0.8.2 as a Maven artifact to our Bintray repository (see Setting Up A Schedoscope Project for an example pom).

This version provides significant performance improvements when initializing the scheduling state for a large number of views.

03/18/2017 - Release 0.8.1

We have released Version 0.8.1 as a Maven artifact to our Bintray repository (see Setting Up A Schedoscope Project for an example pom).

This fixes a critical bug that could result in applying commands to all views in a table and not just the ones addressed. Do not use Release 0.8.0

03/17/2017 - Release 0.8.0

We have released Version 0.8.0 as a Maven artifact to our Bintray repository (see Setting Up A Schedoscope Project for an example pom).

Schedoscope 0.8.0 includes, among other things:

  • significant rework of Schedoscope's actor system that supports testing and uses significantly fewer actors reducing stress for poor Akka;
  • support for a lot more Hive storage formats;
  • definition of arbitrary Hive table properties / SerDes;
  • stability, performance, and UI improvements to Metascope;
  • the names of views being transformed appear as the job name in the Hadoop resource manager.

Please note that Metascope's database schema has changed with this release, so back up your database before deploying.

(more)

Clone this wiki locally