Best tutorial for
spark and kafka integration using scala
Design Patterns for using foreachRDD
dstream.foreachRDD is a
powerful primitive that allows data to be sent out to external systems.
However, it is important to understand how to use this primitive correctly and
efficiently. Some of the common mistakes to avoid are as follows.
Often writing data to external system requires
creating a connection object (e.g. TCP connection to a remote server) and using
it to send data to a remote system. For this purpose, a developer may
inadvertently try creating a connection object at the Spark driver, and then
try to use it in a Spark worker to save records in the RDDs. For example (in
Scala),
dstream.foreachRDD
{ rdd =>
rdd.foreachPartition { partitionOfRecords =>
// ConnectionPool is a static, lazily initialized pool of connections
val connection = ConnectionPool.getConnection()
partitionOfRecords.foreach(record => connection.send(record))
ConnectionPool.returnConnection(connection) // return to the pool for future reuse
}
}
rdd.foreachPartition { partitionOfRecords =>
// ConnectionPool is a static, lazily initialized pool of connections
val connection = ConnectionPool.getConnection()
partitionOfRecords.foreach(record => connection.send(record))
ConnectionPool.returnConnection(connection) // return to the pool for future reuse
}
}
Very important links
http://allaboutscala.com/big-data/spark/
http://allaboutscala.com/
No comments:
Post a Comment