Defining the Spark connection in a Job script
addElementParameters {} properties
Properties relevant to selecting the Spark cluster to be used:
Function/parameter | Description | Mandatory? |
---|---|---|
SPARK_LOCAL_MODE |
Enter "true" to run your Spark Job in the local mode. By default, the value is "false", which means to use a remote cluster. In the local mode, the Studio builds the Spark environment in itself on the fly in order to run the Job in. Each processor of the local machine is used as a Spark worker to perform the computations. In this mode, your local file system is used; therefore, deactivate the configuration components such as tS3Configuration or tHDFSConfiguration that provides connection information to a remote file system, if you have placed these components in your Job. You can launch your Job without any further configuration. |
Yes |
SPARK_LOCAL_VERSION |
Enter the Spark version to be used in the local mode. This property is relevant only when you have entered "true" for SPARK_LOCAL_MODE. The Studio does not support Spark versions below 2.0 in the local mode. For example, enter the value "SPARK_2_1_0". |
Yes when Spark local mode is used. |
DISTRIBUTION |
Enter the name of the provider of your distribution. Depending on your distribution, enter one of the following values:
|
Yes when you are using neither the Spark local mode nor the Amazon EMR distribution. |
SPARK_VERSION |
Enter the version of your distribution. The following list provides example formats for each available distribution:
For more information about the distribution versions supported by Talend, see the section called Supported Big Data platform distribution versions for Talend Job in Talend Installation Guide. |
Yes when you are not using Spark local mode. |
SUPPORTED_SPARK_VERSION |
Enter the Spark version used by your distribution. For example, "SPARK_2_1_0". |
Yes when you are not using Spark local mode. |
SPARK_API_VERSION |
Enter "SPARK_200", the Spark API version used by Talend. |
Yes. |
SET_HDP_VERSION |
Enter "true" if your Hortonworks cluster is using the hdp.version variable to store its version; otherwise, enter "false". Contact the administrator of your cluster if you are not sure about this information. |
Yes when you are using Hortonworks. |
HDP_VERSION |
Enter Hortonwork version to be used, for example, "\"2.6.0.3-8\"". Contact the administrator of your cluster if you are not sure about this information. You must add the version number to the yarn-site.xml file of your cluster, too. In this example, add hdp.version=2.6.0.3-8. |
Yes when you have entered "true" for SET_HDP_VERSION. |
SPARK_MODE |
Enter the mode your Spark cluster has been implemented. Depending on your situation, enter one of the following values:
|
Yes when you are not using the Spark local mode. |
Properties relevant to configuring the connection to Spark:
Function/parameter | Description | Mandatory? |
---|---|---|
RESOURCE_MANAGER |
Enter the address of the ResourceManager service of the Hadoop cluster to be used. |
Yes when you are using the Yarn client mode. |
SET_SCHEDULER_ADDRESS |
Enter "true" if your cluster possesses a ResourceManager scheduler; otherwise, enter "false". |
Yes when you are using the Yarn client mode. |
RESOURCEMANAGER_SCHEDULER_ADDRESS |
Enter the Scheduler address. |
Yes when you have entered "true" for SET_SCHEDULER_ADDRESS. |
SET_JOBHISTORY_ADDRESS |
Enter "true" if your cluster possesses a JobHistory service; otherwise, enter "false". |
Yes when you are using the Yarn client mode. |
JOBHISTORY_ADDRESS |
Enter the location of the JobHistory server of the Hadoop cluster to be used. This allows the metrics information of the current Job to be stored in that JobHistory server. |
Yes when you have entered "true" for SET_JOBHISTORY_ADDRESS. |
SET_STAGING_DIRECTORY |
Enter "true" if your cluster possesses a staging directory to store the temporary files created by running programs; otherwise, enter "false". |
Yes when you are using the Yarn client mode. |
STAGING_DIRECTORY |
Enter this directory, for example, "\"/user\"". Typically, this directory can be found under the yarn.app.mapreduce.am.staging-dir property in the configuration files such as yarn-site.xml or mapred-site.xml of your distribution. |
Yes when you have entered "true" for SET_STAGING_DIRECTORY. |
HDINSIGHT_ENDPOINT |
Enter the endpoint of your HDInsight cluster. For example, "\"https://mycluster.azurehdinsight.net\"". |
Yes when you are using the related distribution. |
HDINSIGHT_USERNAME and HDINSIGHT_PASSWORD |
For example, "\"talendstorage\"" as username and "my_password" as password. |
Yes when you are using the related distribution. |
LIVY_HOST |
|
Yes when you are using the related distribution, HDInsight. |
LIVY_PORT |
Enter the port number of your Livy service. By default, the port number is "\"443\"". |
Yes when you are using the related distribution, HDInsight. |
LIVY_USERNAME |
Enter your HDinsight username, for example, "\"my_hdinsight_account\"". |
Yes when you are using the related distribution, HDInsight. |
HDINSIGHT_POLLING_INTERVAL_DURATION |
Enter the time interval (in milliseconds) at the end of which you want the Studio to ask Spark for the status of your Job. By default, the time interval is 30000, therefore 30 seconds. |
No. If you don't specify this parameter, the default value is used with the related distribution, HDInsight. |
HDINSIGHT_MAX_MISSING_STATUS |
Enter the maximum number of times the Studio should retry to get a status when there is no status response. By default, the number of retries is 10. |
No. If you don't specify this parameter, the default value is used with the related distribution, HDInsight. |
WASB_HOST |
Enter the address of your Windows Azure Storage blob, for example, "\"https://my_storage_account_name.blob.core.windows.net\"". |
Yes when you are using the related distribution, HDInsight. |
WASB_CONTAINER |
Enter the name of the container to be used, for example, "\"talend_container\"". |
Yes when you are using the related distribution, HDInsight. |
REMOTE_FOLDER |
Enter the location in which you want to store the current Job and its dependent libraries in this Azure Storage account, for example, "\"/user/ychen/deployment_blob\"". |
Yes when you are using the related distribution, HDInsight. |
SPARK_HOST |
Enter the URI of the Spark Master of the Hadoop cluster to be used, for example, "\"spark://localhost:7077\"". |
Yes when you are using the Spark Standalone mode. |
SPARK_HOME |
Enter the location of the Spark executable installed in the Hadoop cluster to be used, for example, "\"/usr/lib/spark\"". |
Yes when you are using the Spark Standalone mode. |
DEFINE_HADOOP_HOME_DIR |
If you need to launch from Windows, it is recommended to specify where the winutils.exe program to be used is stored. If you know where to find your winutils.exe file and you want to use it, enter "true"; otherwise, enter "false". |
Yes when you are using a distribution that is not running on cloud. |
HADOOP_HOME_DIR |
Enter the directory where your winutils.exe is stored, for example, "\"C:/Talend/winutils\"". |
Yes when you have entered "true" for DEFINE_HADOOP_HOME_DIR. |
DEFINE_SPARK_DRIVER_HOST |
In the Yarn client mode of Spark, if the Spark cluster cannot recognize by itself the machine in which the Job is launched, enter "true"; otherwise, enter "false". |
Yes when you are using a distribution that is not running on cloud and the Spark mode is Yarn client. |
SPARK_DRIVER_HOST |
Enter the host name or the IP address of this machine, for example, "\"127.0.0.1\"". This allows the Spark master and its workers to recognize this machine to find the Job and thus its driver. Note that in this situation, you also need to add the name and the IP address of this machine to its host file. |
Yes when you have entered "true" for DEFINE_SPARK_DRIVER_HOST. |
GOOGLE_PROJECT_ID |
Enter the ID of your Google Cloud Platform project. For example, "\"my-google-project\"". |
Yes when you are using the related distribution. |
GOOGLE_CLUSTER_ID |
Enter the ID of your Dataproc cluster to be used. For example, "\"my-cluster-id\"". |
Yes when you are using the related distribution. |
GOOGLE_REGION |
Enter the geographic zones in which the computing resources are used and your data is stored and processed. If you do not need to specify a particular region, enter "\"global\"". |
Yes when you are using the related distribution. |
GOOGLE_JARS_BUCKET |
As a Talend Job expects its dependent jar files for execution, specify the Google Storage directory to which these jar files are transferred so that your Job can access these files at execution. The directory to be entered must end with a slash (/). If not existing, the directory is created on the fly but the bucket to be used must already exist. For example, "\"gs://my-bucket/talend/jars/\"". |
Yes when you are using the related distribution. |
DEFINE_PATH_TO_GOOGLE_CREDENTIALS |
When you launch your Job from a given machine in which Google Cloud SDK has been installed and authorized to use your user account credentials to access Google Cloud Platform, enter "false". In this situation, this machine is often your local machine. When you launch your Job from a remote machine, such as a Jobserver, enter "true". |
Yes when you are using the related distribution. |
PATH_TO_GOOGLE_CREDENTIALS |
Enter the directory in which this JSON file is stored in the remote machine. Very often, it is the Jobserver. For example, "\"/user/ychen/my_credentials.json\"". |
Yes when you have entered "true" for DEFINE_PATH_TO_GOOGLE_CREDENTIALS. |
ALTUS_SET_CREDENTIALS |
If you want to provide the Altus credentials with your Job, enter "true". If you want to provide the Altus credentials separately, for example manually using the command altus configure in your terminal, enter "false". |
Yes when you are using the related distribution. |
ALTUS_ACCESS_KEY and ALTUS_SECRET_KEY |
Enter your Altus access key and the directory pointing to your Altus secret key file. For example, "\"my_access_key\"" and "\"/user/ychen/my_secret_key_file. |
Yes when you have entered "true" for ALTUS_SET_CREDENTIALS. |
ALTUS_CLI_PATH |
Enter the path to the Cloudera Altus client, which must have been installed and activated in the machine in which your Job is executed. In production environments, this machine is typically a Talend Jobserver. For example, "\"/opt/altuscli/altusclienv/bin/altus\"". |
Yes when you are using the related distribution. |
ALTUS_REUSE_CLUSTER |
Enter "true" to use a Cloudera Altus cluster already existing in your Cloud service. Otherwise, enter "false" to allow the Job to create a cluster on the fly. |
Yes when you are using the related distribution. |
ALTUS_CLUSTER_NAME |
Enter the name of the cluster to be used. For example, "\"talend-altus-cluster\"". |
Yes when you are using the related distribution. |
ALTUS_ENVIRONMENT_NAME |
Enter the name of the Cloudera Altus environment to be used to describe the resources allocated to the given cluster. For example, "\"talend-altus-cluster\"". |
Yes when you are using the related distribution. |
ALTUS_CLOUD_PROVIDER |
Enter the Cloud service that runs your Cloudera Altus cluster. Currently, only AWS is supported. So enter "\"AWS\"". |
Yes when you are using the related distribution. |
ALTUS_DELETE_AFTER_EXECUTION |
Enter "true" if you want to remove the given cluster after the execution of your Job. Otherwise, enter "false". |
Yes when you are using the related distribution. |
ALTUS_S3_ACCESS_KEY and ALTUS_S3_SECRET_KEY |
Enter the authentication information required to connect to the Amazon S3 bucket to be used. |
Yes when you have entered "\"AWS\"" for ALTUS_CLOUD_PROVIDER. |
ALTUS_S3_REGION |
Enter the AWS region to be used. For example "\"us-east-1\"". |
Yes when you have entered "\"AWS\"" for ALTUS_CLOUD_PROVIDER. |
ALTUS_BUCKET_NAME |
Enter the name of the bucket to be used to store the dependencies of your Job. This bucket must already exist. For example "\"my-bucket\"". |
Yes when you have entered "\"AWS\"" for ALTUS_CLOUD_PROVIDER. |
ALTUS_JARS_BUCKET |
Enter the directory in which you want to store the dependencies of your Job in this given bucket, for example, "\"altus/jobjar\"". This directory is created if it does not exist at runtime. |
Yes when you have entered "\"AWS\"" for ALTUS_CLOUD_PROVIDER. |
ALTUS_USE_CUSTOM_JSON |
Enter "true if you need to manually edit JSON code to configure your Altus cluster. Otherwise, enter "false". |
Yes when you are using the related distribution. |
ALTUS_CUSTOM_JSON |
Enter your custom json code, for example, "{my_json_code}". |
Yes when you have entered "true for ALTUS_USE_CUSTOM_JSON. |
ALTUS_INSTANCE_TYPE |
Enter the instance type for the instances in the cluster. All nodes that are deployed in this cluster use the same instance type. For example, "\"c4.2xlarge\"". |
Yes when you are using the related distribution. |
ALTUS_WORKER_NODE |
Enter the number of worker nodes to be created for the cluster. For example, "\"10\"". |
Yes when you are using the related distribution. |
ALTUS_CLOUDERA_MANAGER_USERNAME |
Enter the authentication information to your Cloudera Manager service. For example, "\"altus\"". |
Yes when you are using the related distribution. |
SPARK_SCRATCH_DIR |
Enter the directory to stores in the local system the temporary files such as the Job dependencies to be transferred, for example, "\"/tmp\"". |
Yes. |
STREAMING_BATCH_SIZE |
Enter the time interval (ms) at the end of which the Job reviews the source data to identify changes and processes the new micro batches, for example, "1000". |
Yes when you are developing a Spark Streaming Job. |
DEFINE_DURATION |
If you need to define a streaming timeout (ms), enter "true". Otherwise, enter "false". |
Yes when you are developing a Spark Streaming Job. |
STREAMING_DURATION |
Enter the time frame (ms) at the end of which the streaming Job automatically stops running, for example, "10000". |
Yes when you have entered "true for DEFINE_DURATION. |
SPARK_ADVANCED_PROPERTIES |
Enter the code to use other Hadoop or Spark related properties. For example:
|
No. |
Properties relevant to defining the security configuration:
Function/parameter | Description | Mandatory? |
---|---|---|
USE_KRB |
Enter "true" if the cluster to be used is secured with Kerberos. Otherwise, enter "false". |
Yes |
RESOURCEMANAGER_PRINCIPAL |
Enter the Kerberos principal names for the ResourceManager service, for example, "\"yarn/_HOST@EXAMPLE.COM\"". |
Yes when you are using Kerberos and the Yarn client mode. |
JOBHISTORY_PRINCIPAL |
Enter the Kerberos principal names for the JobHistory service, for example, "\"mapred/_HOST@EXAMPLE.COM\"". |
Yes when you are using Kerberos and the Yarn client mode. |
USE_KEYTAB |
If you need to use a Kerberos keytab file to log in, enter "true". Otherwise, enter "false". |
Yes when you are using Kerberos. |
PRINCIPAL |
Enter the principal to be used, for example, "\"hdfs\"". |
Yes when you are using a Kerberos keytab file. |
KEYTAB_PATH |
Enter the access path to the keytab file itself. This keytab file must be stored in the machine in which your Job actually runs, for example, on a Talend Jobserver. For example, "\"/tmp/hdfs.headless.keytab\"". |
Yes when you are using a Kerberos keytab file. |
USERNAME |
Enter the login user name for your distribution. If you leave it empty, that is to say "\"\"", the user name of the machine in which your Job actually runs will be used. |
Yes when you are not using Kerberos. |
USE_MAPRTICKET |
If the MapR cluster to be used is secured with the MapR ticket authentication mechanism, enter "true". Otherwise, enter "false". |
Yes when you are using a MapR cluster. |
MAPRTICKET_PASSWORD |
Enter the password to be used to log into MapR, for example, "my_password". |
Yes when you are not using Kerberos but are using MapR ticket authentication mechanism. |
MAPRTICKET_CLUSTER |
Enter the name of the MapR cluster you want to connect to, for example, "\"demo.mapr.com\"". |
Yes when you are using MapR ticket authentication mechanism. |
MAPRTICKET_DURATION |
Enter the length of time (in seconds) during which the ticket is valid, for example, "86400L". |
Yes when you are using MapR ticket authentication mechanism. |
SET_MAPR_HOME_DIR |
If the location of the MapR configuration files has been changed to somewhere else in the cluster, that is to say, the MapR Home directory has been changed, enter "true". Otherwise, enter "false". |
Yes when you are using MapR ticket authentication mechanism. |
MAPR_HOME_DIR |
Enter the new Home directory, for example, "\"/opt/mapr/custom/\"". |
Yes when you have entered "true for SET_MAPR_HOME_DIR. |
SET_HADOOP_LOGIN |
If the login module to be used has been changed in the MapR security configuration file, mapr.login.conf, enter "true". Otherwise, enter "false". |
Yes when you are using MapR ticket authentication mechanism. |
HADOOP_LOGIN |
Enter the module to be called from the mapr.login.conf file, for example, "\"kerberos\"" means to call the hadoop_kerberos module. |
Yes when you have entered "true for SET_HADOOP_LOGIN. |
Properties relevant to tuning Spark:
Function/parameter | Description | Mandatory? |
---|---|---|
ADVANCED_SETTINGS_CHECK |
Enter "true" if you need to optimize the allocation of the resources to be used to run your Jobs. Otherwise, enter "false". |
Yes. |
SPARK_DRIVER_MEM and SPARK_DRIVER_CORES |
Enter the allocation size of memory and the number of cores to be used by the driver of the current Job, for example, "\"512m\"", for memory and "\"1\"" for the number of cores. |
Yes when you are tuning Spark in the Standalone mode. |
SPARK_YARN_AM_SETTINGS_CHECK |
Enter "true" to define the ApplicationMaster tuning properties of your Yarn cluster. Otherwise, enter "false". |
Yes when you are tuning Spark in the Yarn client mode. |
SPARK_YARN_AM_MEM and SPARK_YARN_AM_CORES |
Enter the allocation size of memory to be used by the ApplicationMaster, for example, "\"512m\"", for memory and "\"1\"" for the number of cores. |
Yes when you have entered "true" for SPARK_YARN_AM_SETTINGS_CHECK. |
SPARK_EXECUTOR_MEM |
Enter the allocation size of memory to be used by each Spark executor, for example, "\"512m\"". |
Yes when you are tuning Spark. |
SET_SPARK_EXECUTOR_MEM_OVERHEAD |
Enter "true" if you need to allocate the amount of off-heap memory (in MB) per executor. Otherwise, enter "false". |
Yes when you are tuning Spark in the Yarn client mode. |
SPARK_EXECUTOR_MEM_OVERHEAD |
Enter the amount of off-heap memory (in MB) to be allocated per executor. |
Yes when you have entered "true" for SET_SPARK_EXECUTOR_MEM_OVERHEAD. |
SPARK_EXECUTOR_CORES_CHECK |
If you need to define the number of cores to be used by each executor, enter "true". Otherwise, enter "false". |
Yes when you are tuning Spark. |
SPARK_EXECUTOR_CORES |
Enter the number of cores to be used by each executor, for example, "\"1\"". |
Yes when you have entered "true" for SPARK_EXECUTOR_CORES_CHECK. |
SPARK_YARN_ALLOC_TYPE |
Enter how you want Yarn to allocate resources among executors. Enter one of the following values:
|
Yes when you are tuning Spark in the Yarn client mode. |
SPARK_EXECUTOR_INSTANCES |
Enter the number of executors to be used by Yarn, for example, "\"2\"". |
Yes when you have entered "FIXED" for SPARK_YARN_ALLOC_TYPE. |
SPARK_YARN_DYN_INIT, SPARK_YARN_DYN_MIN and SPARK_YARN_DYN_MAX |
Define the scale of the dynamic allocation by defining these three properties. For example, "\"1\"" as the number of initial executor, "\"0\"" as the minimum number and "\"MAX\"" as the maximum number. |
Yes when you have entered "DYNAMIC" for SPARK_YARN_ALLOC_TYPE. |
WEB_UI_PORT_CHECK |
If you need to change the default port of the Spark Web UI, enter "true". Otherwise, enter "false". |
Yes when you are tuning Spark. |
WEB_UI_PORT |
Enter the port number you want to use for the Spark Web UI, for example, "\"4040\"". |
Yes when you have entered "true" for WEB_UI_PORT_CHECK. |
SPARK_BROADCAST_FACTORY |
Enter the broadcast implementation to be used to cache variables on each worker machine. Enter one of the following values:
|
Yes when you are tuning Spark. |
CUSTOMIZE_SPARK_SERIALIZER |
If you need to import an external Spark serializer, enter "true". Otherwise, enter "false". |
Yes when you are tuning Spark. |
SPARK_SERIALIZER |
Enter the fully qualified class name of the serializer to be used, for example, "\"org.apache.spark.serializer.KryoSerializer\"". |
Yes when you have entered "true" for CUSTOMIZE_SPARK_SERIALIZER. |
ENABLE_BACKPRESSURE |
If you need to enable the backpressure feature of Spark, enter "true". Otherwise, enter "false". The backpressure feature is available in the Spark verson 1.5 and onwards. With backpress enabled, Spark automatically finds the optimal receiving rate and dynamically adapts the rate based on current batch scheduling delays and processing time, in order to receive data only as fast as it can process. |
Yes when you are tuning Spark for a Spark Streaming Job. |
Properties relevant to logging the execution of your Jobs:
Function/parameter | Description | Mandatory? |
---|---|---|
ENABLE_SPARK_EVENT_LOGGING |
Enter "true" if you need to enable the Spark application logs of this Job to be persistent in the file system of your Yarn cluster. Otherwise, enter "false". |
Yes when you are using Spark in the Yarn client mode. |
COMPRESS_SPARK_EVENT_LOGS |
If you need to compress the logs, enter "true". Otherwise, enter "false". |
Yes when you have entered "true" for ENABLE_SPARK_EVENT_LOGGING. |
SPARK_EVENT_LOG_DIR |
Enter the directory in which Spark events are logged, for example, "\"hdfs://namenode:8020/user/spark/applicationHistory\"". |
Yes when you have entered "true" for ENABLE_SPARK_EVENT_LOGGING. |
SPARKHISTORY_ADDRESS |
Enter the location of the history server, for example, "\"sparkHistoryServer:18080\"". |
Yes when you have entered "true" for ENABLE_SPARK_EVENT_LOGGING. |
USE_CHECKPOINT |
If you need the Job to be resilient to failure, enter "true" to enable the Spark checkpointing operation. Otherwise, enter "false". |
Yes. |
CHECKPOINT_DIR |
Enter the directory in which Spark stores, in the file system of the cluster, the context data of the computations such as the metadata and the generated RDDs of this computation. For example, "\"file:///tmp/mycheckpoint\"". |
Yes when you have entered "true" for SET_SPARK_EXECUTOR_MEM_OVERHEAD. |
Properties relevant to configuring Cloudera Navigator:
If you are using Cloudera V5.5+ to run your Apache Spark Batch Jobs, you can make use of Cloudera Navigator to trace the lineage of given data flow to discover how this data flow was generated by a Job.
Function/parameter | Description | Mandatory? |
---|---|---|
USE_CLOUDERA_NAVIGATOR |
Enter "true" if you want to use Cloudera Navigator. Otherwise, enter "false". |
Yes when you are using Spark on Cloudera. |
CLOUDERA_NAVIGATOR_USERNAME and CLOUDERA_NAVIGATOR_PASSWORD |
Enter the credentials you use to connect to your Cloudera Navigator. For example, "\"username\"" as username and "password" as password. |
Yes when you have entered "true" for USE_CLOUDERA_NAVIGATOR. |
CLOUDERA_NAVIGATOR_URL |
Enter the location of the Cloudera Navigator to connect to, for example, "\"http://localhost:7187/api/v8/\"". |
Yes when you have entered "true" for USE_CLOUDERA_NAVIGATOR. |
CLOUDERA_NAVIGATOR_METADATA_URL |
Enter the location of the Navigator Metadata, for example, "\"http://localhost:7187/api/v8/metadata/plugin\"". |
Yes when you have entered "true" for USE_CLOUDERA_NAVIGATOR. |
CLOUDERA_NAVIGATOR_CLIENT_URL |
Enter the location of the Navigator client, for example, "\"http://localhost\"". |
Yes when you have entered "true" for USE_CLOUDERA_NAVIGATOR. |
CLOUDERA_NAVIGATOR_AUTOCOMMIT |
If you want to make Cloudera Navigator generate the lineage of the current Job at the end of the execution of your Job, enter "true". Otherwise, enter "false". |
Yes when you have entered "true" for USE_CLOUDERA_NAVIGATOR. |
CLOUDERA_NAVIGATOR_DISABLE_SSL_VALIDATION |
If you do not want to use the SSL validation process when your Job connects to Cloudera Navigator, enter "true". Otherwise, enter "false". |
Yes when you have entered "true" for USE_CLOUDERA_NAVIGATOR. |
CLOUDERA_NAVIGATOR_DIE_ON_ERROR |
If you want to stop the execution of the Job when the connection to your Cloudera Navigator fails, enter "true". Otherwise, enter "false". |
Yes when you have entered "true" for USE_CLOUDERA_NAVIGATOR. |
Properties relevant to configuring Hortonworks Atlas:
If you are using Hortonworks Data Platform V2.4 onwards to run your Spark Batch Jobs and Apache Atlas has been installed in your Hortonworks cluster, you can make use of Atlas to trace the lineage of given data flow to discover how this data flow was generated by a Job.
Function/parameter | Description | Mandatory? |
---|---|---|
USE_ATLAS |
Enter "true" if you want to use Atlas. Otherwise, enter "false". |
Yes when you are using Spark on Hortonworks. |
ATLAS_USERNAME and ATLAS_PASSWORD |
Enter the credentials you use to connect to your Atlas. For example, "\"username\"" as username and "password" as password. |
Yes when you have entered "true" for USE_ATLAS. |
ATLAS_URL |
Enter the location of the Atlas to connect to, for example, "\"http://localhost:21000\"" |
Yes when you have entered "true" for USE_ATLAS. |
SET_ATLAS_APPLICATION_PROPERTIES |
If your Atlas cluster contains custom properties such as SSL or read timeout, enter "true". Otherwise, enter "false". |
Yes when you have entered "true" for USE_ATLAS. |
ATLAS_APPLICATION_PROPERTIES |
Enter a directory in your local machine, then place the atlas-application.properties file of your Atlas in this directory, for example, "\"/user/atlas/atlas-application.properties\"". This way, your Job is enabled to use these custom properties. |
Yes when you have entered "true" for SET_ATLAS_APPLICATION_PROPERTIES. |
ATLAS_DIE_ON_ERROR |
If you want to stop the Job execution when Atlas-related issues occur, enter "true". Otherwise, enter "false". |
Yes when you have entered "true" for USE_ATLAS. |