Alfresco: August 2019

Tuesday, August 27, 2019

HDFS file formats

File formats

Avro is row based structure. Good if we are having multiple columns for analysis. Good for Kafka and Druid. Write is fast read is slow

Perquet : columnar db. Read is fast, write can be slow. Good for querting.spark uses this format.

ORC: columnar db, read is fast , write can be slow. Specially used in hive

https://www.datanami.com/2018/05/16/big-data-file-formats-demystified/

Different Hadoop applications also have different affinities for the three file formats. ORC is commonly used with Apache Hive, and since Hive is essentially managed by engineers working for Hortonworks, ORC data tends to congregate in companies that run Hortonworks Data Platform (HDP). Presto is also affiliated with ORC files.

Similarly, Parquet is commonly used with Impala, and since Impala is a Cloudera project, it’s commonly found in companies that use Cloudera’s Distribution of Hadoop (CDH). Parquet is also used in Apache Drill, which is MapR‘s favored SQL-on-Hadoop solution; Arrow, the file-format championed by Dremio; and Apache Spark, everybody’s favorite big data engine that does a little of everything.

Avro, by comparison, is the file format often found in Apache Kafka clusters, according to Nexla. Avro is also the favored big data file format used by Druid, the high performance big data storage and compute platform that came out of Metamarkets and was eventually picked up by Yahoo, the Nexla folks say.

What file format you use to architect your big data solution is important, but it’s just one consideration among many. For more information on the differences among ORC, Avro, and Parquet, check out Nexla’s white paper here.

Thursday, August 22, 2019

Hive connection using java

import java.io.FileNotFoundException;
import java.net.URI;
import java.net.URISyntaxException;
import java.sql.*;
import java.text.DateFormat;
import java.text.SimpleDateFormat;

public class HiveConnect {

private Connection conn;

final static DateFormat dateFormat = new SimpleDateFormat("yyyy-MM-dd");

public HiveConnect(String password) throws ClassNotFoundException, SQLException {
String driverName = "org.apache.hive.jdbc.HiveDriver";
URI hiveUri = null;
try {

hiveUri= new URI("jdbc:hive2://basan:8443/;ssl=true;sslTrustStore=/Users/basan/Documents/workspace-sts-3.8.4.RELEASE/g-conumption-spring/ssl-config/prod/truststore.jks;trustStorePassword=password;transportMode=http;httpPath=gateway/re/hive");

} catch (URISyntaxException e) {
e.printStackTrace();
}
String dbHost = hiveUri.toString();
String dbUser = "basan";

Class.forName(driverName);
conn = DriverManager.getConnection(dbHost, dbUser, password);
}

public ResultSet query(String qString) throws SQLException {
Statement statement = conn.createStatement();
return statement.executeQuery(qString);
}

public Connection getConnection() {
return conn;
}

public void queryHive(String query) throws SQLException {
System.out.println(query);
ResultSet rs = this.query(query);
ResultSetMetaData rsmd = rs.getMetaData();
int columnsNumber = rsmd.getColumnCount();
while (rs.next()) {
for (int i = 1; i <= columnsNumber; i++) {
if (i > 1) System.out.print(", ");
String columnValue = rs.getString(i);
System.out.print(columnValue);
}
System.out.println("");
}
}

public void close() {
if (null != conn) {
try {
conn.close();
} catch (SQLException e) {
e.printStackTrace();
}
}
}

public static void main(String args[]) throws FileNotFoundException {

try {
String password = args[0];
HiveConnect h = new HiveConnect(password);

h.queryHive("select * from fnd.basn_pkg_track_hdr limit 10");
h.close();
}
catch(Exception e)
{
e.printStackTrace();
}
}
}

Monday, August 19, 2019

Spark

When I was learning Spark I felt sparc.apache.org is the better location.

Follow and use the below links in this sequence

It will explain how to start with spark and introcuces basic code using scala

https://spark.apache.org/docs/2.2.0/quick-start.html

https://spark.apache.org/docs/2.2.0/rdd-programming-guide.html

https://spark.apache.org/docs/2.2.0/configuration.html

https://spark.apache.org/docs/2.2.0/tuning.html

Spark distribution comes withn the examples which is good material to start with programming in any language.

Learning spark book is also very good

Scala is the preferrred langauge

Scala IDE

http://scala-ide.org/