Kafka and AVRO in a Job
- The regular Kafka components read and write the JSON format only. Therefore, if your Kafka
produces or consumes AVRO data and for some reason, the Kafka components for AVRO are not
available, you must use an avro-tools library to convert your data between AVRO and JSON
outside your Job. For example,You can download the avro-tools-1.8.2.jar library used in this example from the MVN Repository. This command converts the out.avro file to json.
java -jar C:\2_Prod\Avro\avro-tools-1.8.2.jar tojson out.avro
OrThis command converts the twitter.json file to twitter.avro using the schema from twitter.avsc.java -jar avro-tools-1.8.2.jar fromjson --schema-file twitter.avsc twitter.json > twitter.avro
- The Kafka components for AVRO are available in the Spark
framework only; they handle data directly in the AVRO format. If your Kafka cluster produces
and consumes AVRO data, use tKafkaInputAvro to read data directly from
Kafka and tWriteAvroFields to send AVRO data to
tKafkaOutput.
However, these components do not handle the AVRO data created by an avro-tools library, because the avro-tools libraries and the components for AVRO do not use the same approach provided by AVRO.
- AVRO files are generated with the embedded AVRO schema in each file (via org.apache.avro.file.{DataFileWriter/DataFileReader}). The avro-tools libraries use this approach.
- AVRO records are generated without embedding the schema in each record (via
org.apache.avro.io.{BinaryEncoder/BinaryDecoder}). The Kafka
components for AVRO use this approach.
This approach is highly recommended and favored when AVRO encoded messages are constantly written to a Kafka topic, because in this approach, no overhead is incurred to re-embed the AVRO schema in every single message. This is a significant advantage over the other approach when using Spark Streaming to read data from or write data to Kafka, since records (messages) are usually small while the size of the AVRO schema is relatively large, so embedding the schema in each message is not cost-effective.
The outputs of the two approaches cannot be mixed in the same read-write process.