Friday, October 25, 2013

Pentaho Data Integration 4.4 and Hadoop 1.0.4

Prerequisites:

  • Copy the hadoop-20 folder to a hadoop-104 folder(created by the user manually) in the /opt/pentaho/design-tools/data-integration/plugins/pentaho-big-data-plugin/hadoop-configurations/ directory.
  • Replace the following JARs in the client (subfolder) with the versions from the Apache Hadoop 1.0.4 distribution:
    • commons-codec-1.0.4.jar
    • hadoop-core-1.0.4.jar
  • Add the following JAR from the Hadoop 1.0.4 distribution to the client (subfolder) as well:
    • commons-configuration-1.0.6.jar
  • Then change the property in plugins.properties to point to the new folder:
    • active.hadoop.configuration=hadoop-104
  • Start hadoop with the user created while hadoop installation. Note: Hadoop credentials provided in the page 4 step number 12
  • Start PDI

Transformation [CSV → Hadoop]:

Follow the instructions below to begin creating your transformation.
  • Click New in the upper left corner of Spoon.
  • Select Transformation from the list.
  • Under the Design tab, expand the Input node; then, select and drag a CSV file input step onto the canvas on the right.
  • Expand the Big Data node; click and drag a Hadoop File Output step onto the canvas..
  • To connect the steps to each other, you must add a hop. Hops are used to describe the flow of data between steps in your transformation. To create the hop, click theCSV file input step, then press and hold the <SHIFT> key then draw a line to the Hadoop File Output step.
  • Double click the CSV file input step to open its edit properties dialog box.
  • In the Filename field, click on the Browse button and navigate to the input file location
  • Select the desired input file. (e.g) sample.csv
  • Click the Get fields button to get the columns of the input file and click OK button.
  • Double click the Hadoop File Output step to open its edit properties dialog box.
  • In the Filename field, click on the Browse button and Open File dialog box appears as shown below
  • Enter the following credentials to connect with HDFS:
    • Look In – Check whether you have selected HDFS
    • In Connection,
      • Server – localhost
      • Port - 54310
      • User ID - hduser
      • Password - password
  • Click Connect button to connect with HDFS and Open File dialog box appears as shown below:
  • Click OK button.
  • Provide the desired output file name next to the path selected in the Filename field
  • Navigate to the Fields tab, click the Get Fields button to get the columns of the input file and click OK button.
  • Click the Save icon and save the transformation you have created.
  • Click on the Run icon in the right panel to execute the transformation.
  • The Execute a Transformation dialog box appears.
  • NoteLocal Execution is enabled by default. Select Detailed logging.
  • Click Launch.

Transformation [ Hadoop → Text File]:

Follow the instructions below to begin creating your transformation.

  • Click New in the upper left corner of Spoon.
  • Select Transformation from the list.
  • Under the Design tab, expand the Big Data node; then, select and drag a Hadoop File Input step onto the canvas on the right.
  • Expand the Output node; click and drag a Text file output step onto the canvas..
  • To connect the steps to each other, you must add a hop. Hops are used to describe the flow of data between steps in your transformation. To create the hop, click theHadoop File input step, then press and hold the <SHIFT> key then draw a line to the Text file output step.
  • Double click the Hadoop File Input step to open its edit properties dialog box.
  • In the File or directory field, click on the Browse button and Open File dialog box appears as shown below
  • Enter the following credentials to connect with HDFS:
    • Look In – Check whether you have selected HDFS
    • In Connection,
      • Server – localhost
      • Port - 54310
      • User ID - hduser
      • Password – password
  • Click Connect button to connect with HDFS and Open File dialog box appears as shown below:
  • Select the desired input file from HDFS. Click OK button.
  • Click ADD button corresponds to the File or directory field as shown below
  • Navigate to the Fields tab, click the Get Fields button to get the columns of the input file and click OK button.
  • Double click the Text file output step to open its edit properties dialog box.
  • In the Filename field, click on the Browse button and navigate to the desired location where the output file to be placed
  • Provide the desired output file name next to the path selected in the Filename field
  • Navigate to the Fields tab, click the Get Fields button to get the columns of the input file and click OK button.
  • Click the Save icon and save the transformation you have created.
  • Click on the Run icon in the right panel to execute the transformation.
  • Click Launch.

5 comments: