Loading and Storing Data : How to Load and Store Data in Apache Pig
Автор: BigDataElearning
Загружено: 2016-09-28
Просмотров: 3490
Описание:
Official Website: http://bigdataelearning.com
This video explains on how to use operators like LOAD, STORE and DUMP.
Loading data : how to load the data from a file into an alias ?
Storing data : How to write contents of an alias back into a file?
Dump operator : how to use the DUMP operator for debugging purposes
By the end of this video, you will be able to start working on files by loading the data from the file into Pig's relation.
You will also be able to store the data back into a file and thereby you will be able to persist the processed data.
Loading Data : How to load the data from a file into an alias
---------------------
LOAD operator is used to load data from input files.
For e.g. if the input file input.txt has the following data,
(Chris 32 7000)
(peter 30 6000)
(John 34 6500)
Then we can load the data using the statement as highlighted below. Here 'A' is the relation.
Relation is like an alias, which contains the data.
A = LOAD 'input.txt' USING PigStorage('\t') AS (f1:chararray, f2:int, f3:int);
Here LOAD is the operator that loads the data from the input file.
input.txt is the input file from where we are going to load the data from. The file name is specified within single quotes.
When loading data from hdfs , the hdfs file can be specified within single quotes.
USING is the keyword that is used to specify the load function.
PigStorage(\t) is the load function, which is used when the fields in the input file are delimited by tab or backslash-t.
You can also use different other load functions available such as JsonLoader to load the Json data , TextLoader to load the unstructured text file,
and HBaseStorage to load the data from hbase table.
As we have already seen, The highlighted section is the way to specify the schema for the data. This indicates,
the first field is assigned with chararray data type. The second and third fields are assigned with int data type.
The f1,f2,f3 is the user defined field name. we can give any name.
Here the USING load function is optional.
similarly AS schema is also optional.
Storing Data : How to store contents of an alias back into a file?
--------------------
STORE operator is used to save the results to output files.
For e.g. After performing operations in pig, say the salary part in our previous example is incremented by $1000 for each tuple and is in an alias B.
(Chris 32 8000)
(peter 30 7000)
(John 34 7500)
The contents of the 'B' alias can be stored into a file using STORE operator as shown below.
STORE B INTO 'output.txt' USING PigStorage('|');
This will persist the results in the file system. Remember, the contents of the alias will only be available during that Pig session.
However once you store it into a reliable file system like HDFS using the STORE command , it will be persisting permanently.
DUMP operator
-------------------------
DUMP operator is used to dump the results to the screen, when you are in grunt console.
This is especially useful for debugging purposes. Note that dump operator will not persist the results permanently.
The results displayed will only be available during that pig session.
DUMP is very useful to make sure that each alias contain the data as we expected, especially when we are running the Pig commands line by line, within the console.
in our example, the alias 'B' had the below values. we can verify the same using DUMP B; statement.
(Chris 32 8000)
(peter 30 7000)
(John 34 7500)
In this video we saw about ,
using the LOAD operator to load the contents of a file into Pig's relation or alias.
we saw how to use the STORE operator to store the contents of Pig's relation into a hdfs file.
we also understood the usage of dump operator in debugging the Pig commands in a grunt console
Повторяем попытку...

Доступные форматы для скачивания:
Скачать видео
-
Информация по загрузке: