Alfresco: spark kafka integration - instantiating code at worker side

Tuesday, September 3, 2019

Best tutorial for spark and kafka integration using scala

https://www.rittmanmead.com/blog/2019/03/analysing-kafka-data-in-scala-spark/

https://spark.apache.org/docs/latest/streaming-programming-guide.html

Design Patterns for using foreachRDD

dstream.foreachRDD is a powerful primitive that allows data to be sent out to external systems. However, it is important to understand how to use this primitive correctly and efficiently. Some of the common mistakes to avoid are as follows.

Often writing data to external system requires creating a connection object (e.g. TCP connection to a remote server) and using it to send data to a remote system. For this purpose, a developer may inadvertently try creating a connection object at the Spark driver, and then try to use it in a Spark worker to save records in the RDDs. For example (in Scala),

dstream.foreachRDD
{ rdd =>

  rdd.foreachPartition {
partitionOfRecords =>

    // ConnectionPool
is a static, lazily initialized pool of connections

    val connection = ConnectionPool.getConnection()

    partitionOfRecords.foreach(record => connection.send(record))

    ConnectionPool.returnConnection(connection)  // return to the
pool for future reuse

  }

}

Very important links

https://jaceklaskowski.gitbooks.io/mastering-spark-sql/spark-sql-dynamic-partition-inserts.html

http://allaboutscala.com/big-data/spark/

http://allaboutscala.com/

https://www.rittmanmead.com/blog/2019/03/analysing-kafka-data-in-scala-spark/

https://spark.apache.org/docs/latest/streaming-programming-guide.html

https://www.michael-noll.com/blog/2014/10/01/kafka-spark-streaming-integration-example-tutorial/

http://blog.jaceklaskowski.pl/2015/07/20/real-time-data-processing-using-apache-kafka-and-spark-streaming.html

https://supergloo.com/spark-streaming/spark-streaming-kafka-example/

https://www.rittmanmead.com/blog/2019/02/spark-streaming-and-kafka/

https://www.rittmanmead.com/blog/2019/02/kafka/

https://www.tutorialspoint.com/apache_kafka/apache_kafka_integration_spark.htm

https://oyermolenko.blog/2017/12/03/streaming-data-to-hive-using-spark/