tHiveWarehouseInput properties for Apache Spark Batch
These properties are used to configure tHiveWarehouseInput running in the Spark Batch Job framework.
The Spark Batch tHiveWarehouseInput component belongs to the Storage family.
The component in this framework is available in all Talend products with Big Data and Talend Data Fabric.
Basic settings
Property Type |
Select the way the connection details will be set.
|
Hive Storage Configuration | Select the tHiveWarehouseConfiguration component from which you want Spark to use the configuration details to connect to Hive. |
HDFS Storage Configuration |
Select the tHDFSConfiguration component from which you want Spark to use the configuration details to connect to a given HDFS system and transfer the dependent jar files to this HDFS system. This field is relevant only when you are using an on-premises distribution. |
Schema and Edit Schema |
A schema is a row description. It defines the number of fields (columns) to be processed and passed on to the next component. When you create a Spark Job, avoid the reserved word line when naming the fields. Always use lowercase when naming a field because the processing behind the scene could force the field names to be lowercase. Select the type of schema you want to use from the
Schema drop-down list:
Click Edit schema to make changes to the schema. If the current schema is of the Repository type, three options are available:
|
Input source |
Select the type of the input data you want
tHiveWarehouseInput to read:
For further information about the Hive query language, see https://cwiki.apache.org/confluence/display/Hive/LanguageManual. Information noteNote: Compressed data in the form of Gzip or Bzip2 can be processed through the query
statements. For details, see https://cwiki.apache.org/confluence/display/Hive/CompressedStorage.
Hadoop provides different compression formats that help reduce the space needed for storing files and speed up data transfer. When reading a compressed file, the Studio needs to uncompress it before being able to feed it to the input flow. |
Advanced settings
Register Hive UDF jars |
Add the Hive user-defined function (UDF) jars you want tHiveInput to use. Note that you must define a function alias for each UDF to be used in the Temporary UDF functions table. Once you add one row to this table, click it to display the [...] button and then click this button to display the jar import wizard. Through this wizard, import the UDF jar files you want to use. A registered function is often used in a Hive query that you edit in the Hive Query field in the Basic settings view. Note that this Hive Query field is displayed only when you select Hive query from the Input source list. |
Temporary UDF functions |
Complete this table to give each imported UDF class a temporary function name to be used in the Hive query in the current tHiveInput component. |
Global Variables
Global Variables |
ERROR_MESSAGE: the error message generated by the component when an error occurs. This is an After variable and it returns a string. This variable functions only if the Die on error check box is cleared, if the component has this check box. A Flow variable functions during the execution of a component while an After variable functions after the execution of the component. To fill up a field or expression with a variable, press Ctrl + Space to access the variable list and choose the variable to use from it. For further information about variables, see Talend Studio User Guide. |
Usage
Usage rule |
This component is used as a start component and requires an output link. This component should use a tHiveWarehouseConfiguration component present in the same Job to connect to Hive. This component, along with the Spark Batch component Palette it belongs to, appears only when you are creating a Spark Batch Job. |