Alfresco: Hive custom UDF and the built in functions

Complex data types
Arrays, map , struct

Arrays : similar set of data
map : key value pair.
struct : similar to class. similar to struct in c.
more than one field with various data types. ex : Employee class
logical grouping of data. struct can have another struct, map

Built in functions :
functions given by the framework
UDF : User defined functions
takes one input row and one output row. ex: length , trim, concat , round, floor
UDAF : User defined aggregate functions
works on multiple row outputs a single row
sum(), avg() , count(*)

UDTF : User defined table generating functions
works on single row and outputs multiple rows

explode()
posexplode()
Works only on complex data types.

We can explode array, map

posexplode - gives the position of the element after explode.

select explode(subordinatelist) from table

to have the first column

select manager , explode(subordinatelist) from table will not work

virtual table we get from explode is called as lateral view. We should join the lateral view with the actual table to get all the columns.

--------------------------------------------------
--------------------------------------------------
Load table from the path and create custom UDF

create table if not exists table2(word string,count int) row format delimited
fields terminated by ',' lines terminated by '\n' stored as textfile;

load data local inpath '/home/cloudera/Downloads/rankfunctions.txt' overwrite into table table2;

rankfunctions.txt
John,1500
Albert,1500
Mark,1000
Frank,1150
Loopa,1100
Lui,1300
John,1300
John,900
Lesa,1500
Lesa,900
Pars,800
leo,700
leo,1500
lock,650
Bhut,800
Lio,500

[cloudera@quickstart Downloads]$ vi rankfunctions.txt
[cloudera@quickstart Downloads]$ pwd
/home/cloudera/Downloads
[cloudera@quickstart Downloads]$ ls /home/cloudera/Downloads/rankfunctions.txt
/home/cloudera/Downloads/rankfunctions.txt
[cloudera@quickstart Downloads]$

hive> [cloudera@quickstart Desktop]$
[cloudera@quickstart Desktop]$ hive

Logging initialized using configuration in file:/etc/hive/conf.dist/hive-log4j.properties
WARNING: Hive CLI is deprecated and migration to Beeline is recommended.
hive> show databases;
OK
basan
default
Time taken: 0.532 seconds, Fetched: 2 row(s)
hive> use basan
> ;
OK
Time taken: 0.027 seconds
hive> create table if not exists table2(word string,count int) row format delimited
> fields terminated by ',' lines terminated by '\n' stored as textfile;
OK
Time taken: 0.162 seconds
hive>

Time taken: 0.162 seconds
hive> load data local inpath '/home/cloudera/Downloads/rankfunctions.txt' overwrite into table table2;
Loading data to table basan.table2
Table basan.table2 stats: [numFiles=1, numRows=0, totalSize=153, rawDataSize=0]
OK
Time taken: 0.713 seconds
hive>

Time taken: 0.713 seconds
hive> select * from table2
> ;
OK
John 1500
Albert 1500

Time taken: 0.266 seconds, Fetched: 16 row(s)
hive> select word from table2
> ;
OK
John
Albert
Mark
Frank

Time taken: 0.09 seconds, Fetched: 16 row(s)
hive>

Now download the jar hive-exec-1.2.2.jar
https://mvnrepository.com/artifact/org.apache.hive/hive-exec/1.2.2

set up the project with below code

package udf_example;

import org.apache.hadoop.hive.ql.exec.UDF;

public class DataStandardization extends UDF {

public String evaluate(String input){

if(input==null)
{
return null;
}
return (input.toString().toUpperCase());

}

}

step 5: add the jar in hive
add jar /home/cloudera/Desktop/my_udf.jar;

Register the function with hive

step 6: create a function in hive
create temporary function f1 as 'hiveUDF.DataStandardization';
select f1(word) from table2;

hive> add jar /home/cloudera/Desktop/my_udf.jar;
Added [/home/cloudera/Desktop/my_udf.jar] to class path
Added resources: [/home/cloudera/Desktop/my_udf.jar]

hive> create temporary function f1 as 'hiveUDF.DataStandardization';
OK
Time taken: 0.007 seconds

hive> use basan
> ;
OK
Time taken: 0.034 seconds
hive> select f1(word) from table2;
OK
JOHN
ALBERT
MARK
FRANK
LOOPA
LUI

------------------------------
Creating mountpoint in cloudera
------------------------------
Create the folder in mac/windows which you want to share
put a dummy file

In the cloudera settings , add the mount point.
This mount point can be mapped as below
mount -t vboxsf

[cloudera@quickstart ~]$ pwd
/home/cloudera
[cloudera@quickstart ~]$ cd Desktop/
[cloudera@quickstart Desktop]$ mkdir Mount-virtaul
[cloudera@quickstart Desktop]$ su
Password:
[root@quickstart Desktop]# mount -t vboxsf Mount-virtaul Mount-virtaul
[root@quickstart Desktop]#

Alfresco

Friday, November 22, 2019

Hive custom UDF and the built in functions

No comments:

Post a Comment