Oozie Datasets

Thu 10 March 2016

Datasets

Lecture notes, Apache Oozie Essentials

A Dataset is a collection of data, which is identified by some logical name. For example, the press release can be defined as follows:

<dataset name="pressrelease" 
    frequency="${coord:days(1)}" 
    initial-instance="2015-08-15T00:00Z" 
    timezone="Australia/Sydney">
        <uri-template>
            {nameNode}/learn_oozie/ch04/input/pressrelease/${YEAR}/${MONTH}/${DAY}
        </uri-template>
    <done-flag>_SUCCESS</done-flag>
</dataset>

The logical name is resolved by filling the values of the YEAR, MONTH, and DAY variables. The data in Dataset is immutable and produced at regular intervals with defined frequency; in the preceding example, it is one day ( coord:days(1) ). We will see it in detail in the Frequency and time section.

Datasets (note plural) is a group of individual datasets. They are defined once in the Hadoop platform and those definitions can be reused among any number of coordinator jobs. For example, we define the Dataset pressrelease , specify that it comes once a day, and tell that it can be found with a path represented by . In the same way, we can define other Datasets like job postings, public interactions, and so on.

Since the Datasets is a collection of datasets, they can be represented by a collection of individual Datasets and include file representation.

As best practice for organizing data inside a cluster, you will represent them as Dataset and store all the logical datasets.xml files in one place on HDFS so that all Coordinator jobs can reuse them.

Dataset definition needs the following details:

Field Definition Example
name Dataset name pressrelease
frequency Indicates time in minutes at which data is created and frequency uses EL expressions to represent time ${coord:days(1)
initial-instance The UTC date and time of the initial instance of the Dataset 2015-08-15T00:00Z
timezone The timezone of the Dataset Australia/Sydney
url-template The URI template that identifies the Dataset is made up of constants ( YEAR ) and variables namenode/learn_oozie/ch04/input/pressrelase/
done-flag instance is ready to be consumed _SUCCESS file

The tag

The <done-flag> tag indicates when a dataset instance is ready to be processed. Here are the various options for <done-flag>:

  • If the tag is not present, the Coordinator will not start till the file named _SUCCESS is present in the directory. In our file-based load use case the process that is copying the files to HDFS will also create the _SUCCESS flag using the hdfs fs -touchz command.
  • If the tag is there but the tag is empty as , then the directory (if present) gives a signal to the Coordinator that the Dataset is ready.
  • If the tag is there but has some filename, Oozie will check for the presence of the file with given name in the tag in the directory, for example, if the filename is _DATA_LOAD_DONE, Oozie will check the existence of the file _DATA_LOAD_DONE.