Powered By Blogger

Tuesday, September 3, 2019

spark kafka integration - instantiating code at worker side


Best tutorial for spark and kafka integration using scala


Design Patterns for using foreachRDD
dstream.foreachRDD is a powerful primitive that allows data to be sent out to external systems. However, it is important to understand how to use this primitive correctly and efficiently. Some of the common mistakes to avoid are as follows.
Often writing data to external system requires creating a connection object (e.g. TCP connection to a remote server) and using it to send data to a remote system. For this purpose, a developer may inadvertently try creating a connection object at the Spark driver, and then try to use it in a Spark worker to save records in the RDDs. For example (in Scala),
dstream.foreachRDD { rdd =>
  rdd
.foreachPartition { partitionOfRecords =>
   
// ConnectionPool is a static, lazily initialized pool of connections
   
val connection = ConnectionPool.getConnection()
    partitionOfRecords
.foreach(record => connection.send(record))
   
ConnectionPool.returnConnection(connection)  // return to the pool for future reuse
 
}
}


Very important links

https://jaceklaskowski.gitbooks.io/mastering-spark-sql/spark-sql-dynamic-partition-inserts.html

http://allaboutscala.com/big-data/spark/

http://allaboutscala.com/










No comments:

Post a Comment