Alfresco: HDFS Loading files and creating TEXTFORMAT and ORC files

Saturday, October 19, 2019

HDFS Loading files and creating TEXTFORMAT and ORC files

Create the Text Format table

       

CREATE TABLE IF NOT EXISTS country_input(
    longitide float,
    lattitude float,
    number int,
    street string,
    unit string,
    city string,
    district string,
    region string,
    postcode int,
    id string,
    hash string,
    country string  
) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE

//loading the table from the file located in the path
load data local inpath '/Users/basan/datasetnew' into table country_input


select * from country_input where country='belgium' and street='Rue Ketels';

       

//Create table in ORC FORMAT

CREATE TABLE IF NOT EXISTS country_input_orc(
    longitide float,
    lattitude float,
    number int,
    street string,
    unit string,
    city string,
    district string,
    region string,
    postcode int,
    id string,
    hash string,
    country string
)ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS orc


ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' is not needed in ORC file format.

For populating into ORC table from the text file
insert into table country_input_orc select * from country_input

 /user/hive/warehouse - location for ORC tables

ORC tables will occupy less disk space and selection will be much faster compared to TEXTFORMAT file. If we want to still make the queries faster then we have to use partitioning and bucketing. For each partition there will be a folder created, and for each bucket there will be a file created under the partitioned folder. Once we define the size of bucket we cannot change, if needs to be changed then we have to create fresh table and import data from old table

       


CREATE TABLE IF NOT EXISTS country_bucket(
   longitide float,
    lattitude float,
    number int,
    street string,
    unit string,
    city string,
    district string,
    region string,
    postcode int,
    id string,
    hash string,
    country string
 )PARTITIONED BY (country string) CLUSTERED BY (street)
 into 32 buckets
 ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE


We have created table with country partitioning and 32 buckets

//Getting size of the underlaying folders 
 hadoop fs -du -h /user/hive/warehouse/trendytech.db
 hadoop fs -ls /user/hive/warehouse/trendytech.db/country_bucket

For each partition there will be one folder created

 hadoop fs -ls /user/hive/warehouse/trendytech.db/country_bucket/country=estonia
 will list all the files under the partitioned folder

//Getting size of the underlaying folders
hadoop fs -du -h /user/hive/warehouse/trendytech.db
hadoop fs -ls /user/hive/warehouse/trendytech.db/country_bucket

For each partition there will be one folder created

hadoop fs -ls /user/hive/warehouse/trendytech.db/country_bucket/country=estonia
will list all the files under the partitioned folder

Alfresco

Saturday, October 19, 2019

HDFS Loading files and creating TEXTFORMAT and ORC files

No comments:

Post a Comment