Kafka and AVRO in a Job
- The regular Kafka components read and write JSON and AVRO formats. If your Kafka produces or consumes AVRO data, you can use tKafkaInput and tKafkaOutput with Producer and Consumer records along with schema registry in your Standard Job.
- The Kafka components in the Spark framework handle data directly in the AVRO format. If
your Kafka cluster produces and consumes AVRO data, you can use
tKafkaInputAvro to read data directly from Kafka and
tWriteAvroFields to send AVRO data to tKafkaOutput.
However, these components do not handle the AVRO data created by an avro-tools library, because the avro-tools libraries and the components for AVRO do not use the same approach provided by AVRO.
- AVRO files are generated with the embedded AVRO schema in each file (via org.apache.avro.file.{DataFileWriter/DataFileReader}). The avro-tools libraries use this approach.
- AVRO records are generated without embedding the schema in each record (via
org.apache.avro.io.{BinaryEncoder/BinaryDecoder}). The Kafka
components for AVRO use this approach.
This approach is highly recommended and favored when AVRO encoded messages are constantly written to a Kafka topic, because in this approach, no overhead is incurred to re-embed the AVRO schema in every single message. This is a significant advantage over the other approach when using Spark Streaming to read data from or write data to Kafka, since records (messages) are usually small while the size of the AVRO schema is relatively large, so embedding the schema in each message is not cost-effective.
The outputs of the two approaches cannot be mixed in the same read-write process.