tSqlRow properties for Apache Spark Batch
These properties are used to configure tSqlRow running in the Spark Batch Job framework.
The Spark Batch tSqlRow component belongs to the Processing family.
The component in this framework is available in all subscription-based Talend products with Big Data and Talend Data Fabric.
Basic settings
Schema and Edit schema |
A schema is a row description. It defines the number of fields (columns) to be processed and passed on to the next component. When you create a Spark Job, avoid the reserved word line when naming the fields.
When the schema to be reused has default values that are integers or functions, ensure that these default values are not enclosed within quotation marks. If they are, you must remove the quotation marks manually. For more information, see Retrieving table schemas. Click Edit schema to make changes to the schema. If you make changes, the schema automatically becomes built-in.
This component offers the advantage of the dynamic schema feature. This allows you to retrieve unknown columns from source files or to copy batches of columns from a source without mapping each column individually. For further information about dynamic schemas, see Dynamic schema. This dynamic schema feature is designed for the purpose of retrieving unknown columns of a table and is recommended to be used for this purpose only; it is not recommended for the use of creating tables. With dynamic schema, you can read and query complex schema in Parquet files (containing struct and map for example) with the tSQLRow component using Spark SQL. |
SQL context |
Select the query languages you want tSqlRow to use.
|
Query |
Enter your query paying particularly attention to properly sequence the fields in order to match the schema definition. The tSqlRow component uses the label of its input link to name the registered table that stores the datasets from the same input link. For example, if a input link is labeled to row1, this row1 is automatically the name of the table in which you can perform queries. |
Advanced settings
Register UDF jars |
Add the Spark SQL or Hive SQL UDF (user-defined function) jars you want tSqlRow to use. If you do not want to call your UDF using its FQCN (Fully-Qualified Class Name), you must define a function alias for this UDF in the Temporary UDF functions table and use this alias. It is recommended to use the alias approach, as an alias is often more practical to use to call a UDF from the query. Once you add one row to this table, click it to display the [...] button and then click this button to display the jar import wizard. Through this wizard, import the UDF jar files you want to use. |
Temporary UDF functions |
Complete this table to give each imported UDF class a temporary function name to be used in the query in tSqlRow. If you have selected SQL Spark Context from the SQL context list, the UDF output type column is displayed. In this column, you need to select the data type of the output of the Spark SQL UDF to be used. |
Usage
Usage rule |
This component is used as an intermediate step. This component, along with the Spark Batch component Palette it belongs to, appears only when you are creating a Spark Batch Job. Note that in this documentation, unless otherwise explicitly stated, a scenario presents only Standard Jobs, that is to say traditional Talend data integration Jobs. |
Spark Connection |
In the Spark
Configuration tab in the Run
view, define the connection to a given Spark cluster for the whole Job. In
addition, since the Job expects its dependent jar files for execution, you must
specify the directory in the file system to which these jar files are
transferred so that Spark can access these files:
This connection is effective on a per-Job basis. |