Datasets
Lecture notes, Apache Oozie Essentials
A Dataset is a collection of data, which is identified by some logical name. For example, the press release can be defined as follows:
<dataset name="pressrelease"
frequency="${coord:days(1)}"
initial-instance="2015-08-15T00:00Z"
timezone="Australia/Sydney">
<uri-template>
{nameNode}/learn_oozie/ch04/input/pressrelease/${YEAR}/${MONTH}/${DAY}
</uri-template>
<done-flag>_SUCCESS</done-flag>
</dataset>
The logical name is resolved by filling the values of the YEAR, MONTH, and DAY variables. The data in Dataset is immutable and produced at regular intervals with defined frequency; in the preceding example, it is one day ( coord:days(1) ). We will see it in detail in the Frequency and time section.
Datasets (note plural) is a group of individual datasets. They are defined once in the Hadoop platform and those definitions can be reused among any number of coordinator jobs. For example, we define the Dataset pressrelease , specify that it comes once a day, and tell that it can be found with a path represented by
Since the Datasets is a collection of datasets, they can be represented by a collection of individual Datasets and include file representation.
As best practice for organizing data inside a cluster, you will represent them as Dataset and store all the logical datasets.xml files in one place on HDFS so that all Coordinator jobs can reuse them.
Dataset definition needs the following details:
Field | Definition | Example |
---|---|---|
name | Dataset name | pressrelease |
frequency | Indicates time in minutes at which data is created and frequency uses EL expressions to represent time | ${coord:days(1) |
initial-instance | The UTC date and time of the initial instance of the Dataset | 2015-08-15T00:00Z |
timezone | The timezone of the Dataset | Australia/Sydney |
url-template | The URI template that identifies the Dataset is made up of constants ( YEAR ) and variables | namenode/learn_oozie/ch04/input/pressrelase/ |
done-flag | instance is ready to be consumed | _SUCCESS file |
The tag
The <done-flag>
tag indicates when a dataset instance is ready to be processed. Here are the various options for <done-flag>
:
- If the
tag is not present, the Coordinator will not start till the file named _SUCCESS is present in the directory. In our file-based load use case the process that is copying the files to HDFS will also create the _SUCCESS flag using the hdfs fs -touchz
command. - If the
tag is there but the tag is empty as , then the directory (if present) gives a signal to the Coordinator that the Dataset is ready. - If the
tag is there but has some filename, Oozie will check for the presence of the file with given name in the tag in the directory, for example, if the filename is _DATA_LOAD_DONE , Oozie will check the existence of the file _DATA_LOAD_DONE.