Quantcast
Channel: Teradata Developer Exchange - Hadoop

Hadoop DFS to Teradata

$
0
0
Short teaser: 
This article discusses how to use Table UDF to load and use Hadoop DFS data
Cover Image: 
AttachmentSize
Package iconHDFS_UDF.zip4.89 KB

Hadoop systems [1], sometimes called Map Reduce, can coexist with the Teradata Data Warehouse allowing each subsystem to be used for its core strength when solving business problems. Integrating the Teradata Database with Hadoop turns out to be straight forward using existing Teradata utilities and SQL capabilities. There are a few options for directly integrating data from a Hadoop Distributed File System (HDFS) with a Teradata Enterprise Data Warehouse (EDW), including using SQL and Fastload. This document focuses on using a Table Function UDF to both access and load HDFS data into the Teradata EDW. In our examples, there is historical data already in Teradata EDW, presumably derived from HDFS for trend analysis. We will show examples where the Table Function UDF approach is used to perform inserts or joins from HDFS with the data warehouse.

Hadoop DFS is an open source distributed file system implementation from the Apache Software Foundation [1]. HDFS is designed to run on clusters of nodes built from low-cost hardware. It is not uncommon for the clustered nodes to number in the dozens or hundreds --and sometimes thousands. An HDFS file is chopped into blocks (usually 64 MB), each of which is replicated multiple times across the nodes in the HDFS system for fault-tolerance and performance. HDFS is increasingly being used by companies to store large amounts of data, especially by Dot.Com companies with enormous server farms.

The Table Function UDF approach

In this Table UDF approach, the Table UDF pulls data from HDFS into the EDW. Each Table UDF instance running on an AMP is responsible for retrieving a portion of the HDFS file. Data filtering and transformation can be done by the UDF as the rows are delivered by the UDF to the SQL processing step. The HDFS file is accessed by every UDF running in an AMP in the Teradata EDW in parallel.

As an example, the following SQL query calls the table UDF named HDFSUDF to load data from a HDFS file named mydfsfile.txt to a table Tab1 in Teradata In this example, imagine that mydfsfile.txt is a 1 terabyte file spread across fifty Hadoop nodes. We could then use a SQL statement invoking the Table UDF instances to move data from Hadoop into the data warehouse. The UDF sample code is included in the Appendix.

insert into Tab1 SELECT * FROM TABLE(HDFSUDF (‘mydfsfile.txt’)) AS T1;

 

How it works
Notice that once the Table UDF is written, it is called just like any other UDF. How the data flows from HDFS to Teradata is transparent to the users of this Table UDF. Typically the Table UDF is written to be run by every AMP in a Teradata system when the Table UDF is called in a SQL query. However, we have the choice of writing the Table UDF to run on a single AMP or a group of AMPs when it is called in a SQL query.

When the UDF instance is invoked on an AMP, the Table UDF instance communicates with the NameNode in HDFS which manages the metadata about mydfsfile.txt. The Hadoop NameNode metadata includes information such as which blocks of the HDFS file are stored and replicated on which nodes. In this example, each UDF talks to the NameNode and finds the total size of mydfsfile.txt to be 1TB. The Table UDF then inquires into the Teradata Database to discover its own numeric AMP identity and that there are 100 total AMPs. With these facts, a simple calculation is done by each UDF instance to identify the offset into mydfsfile.txt that it will start reading data from HDFS.

For any request from the UDF instances to the Hadoop API, the HDFS NameNode identifies which DataNodes in the HDFS system are responsible for returning the data requested. The Table UDF instance running on an AMP will receive data directly from those DataNodes in HDFS which hold the requested data block. Note that no data from the HDFS file is ever routed through the NameNode –its all done directly node to node for better performance. In the sample program we provide in the Appendix, we simply make the N-th AMP in the system load the N-th portion of the HDFS file. Other types of UDF-AMP mapping to HDFS can be done depending on an application’s needs..

The following figure illustrates this approach.

Issues

The default JVM shipped with Teradata 13 is 1.5 from IBM. The current Hadoop version is 0.20.0 which requires Java 1.6. Depending on your needs, both JVM versions will work. In the first test example, we installed an earlier version of Hadoop (0.18.0) which requires Java 1.5. In the second solution, we download and installed IBM JVM 1.6 on every node in the Teradata system. Then we used the cufconfig tool to make Teradata DBMS use the JVM 1.6 version. The following shows the detailed steps:

  • a) We download IBM ibm-java-x86_64-sdk.rpm on every node in the Teradata system.
  • b) Install IBM JVM 1.6 on every Teradata node

       rpm –ivh ibm-java-x86_64-sdk.rpm (which is installed under /opt/ibm/java-x86_64-60/jre/)

  • c) Use the following command on any Teradata node (does not have to be run on all nodes)

            cufconfig –f 1.txt

a) 1.txt specifies the path under which the desired JVM should be used by Teradata DBMS.

b) 1.txt contains a single line shown below
                JREPath: /opt/ibm/java-x86_64-60/jre/

When deciding what portion of the HDFS file every AMP should load via the Table UDF approach, we should make sure that every byte in the DFS file should be read exactly once in the end by all UDF instances. Since each AMP asks for data block from HDFS by sending the offset of the bytes it should load in its request to HDFS, we need to make sure that the last row read by every AMP is a complete line, not a partial line if the UDF processes the input file in a line by line mode. In the example UDF in the Appendix, the input HDFS file to be loaded has fixed row size; therefore we can easily compute the starting offset and the ending offset of the bytes each AMP should read. Depending on the input file’s format and an application’s needs, extra care should be made in assigning which portion of the HDFS file should be loaded by which AMPs.


Joining relational data and HDFS data

Once HDFS data is load into Teradata, we can analyze HDFS data like as any other data stored in EDW. However more interestingly we can perform integrated BI over relational data stored in Teradata and external data originally stored in HDFS, without actually first creating a table and loading HDFS data to the table, as shown in the following example.

Assume a Telecommunication company owns an HDFS file called packets.txt which stores information about networking packets and has rows in the format <source-id, dest-id, timestamp>. The source and destination ID fields are TCP/IP addresses being used to find spammers and hackers. They tell us who sent a request to what destination. Now assume there is a watchlist table stored in Teradata which stores a list of source-ids to be monitored and used in trend analysis. The following standard SQL query joins the packets.txt file and the watchlist table to find the list of source-ids in the watchlist table who have sent packets to more than 1 million unique destination ids.            

Select watchlist.source-id, count(distinct (T.dest-id)) as Total 
From watchlist,  TABLE(packets(“packets.txt”)) AS T
Where watchlist.source-id=T.source-id
Group by watchlist.source-id
Having Total > 1000000

Note: in the appendix UDF code example, the TCP/IP addresses were simplified to be integers to make the example easier to understand.

Conclusion

Teradata table function UDFs can be used to directly access data from a Hadoop distributed file system. Whether the goal is to load new data into the data warehouse or simply join the Hadoop data to existing tables to produce a report is up to the programmer. These examples show that we can use the table UDF approach to apply complex BI available through the SQL engine on both HDFS data and relational data.

 References

[1] Hadoop DFS http://hadoop.apache.org/hdfs/


Appendix

This section contains a sample Table UDF Java program and associated BTEQ/SQL statements to illustrate the Table UDF approach discussed in this document. The Java program has been intentionally simplified to focus on HDFS. It can read a DFS file of fixed-size lines containing two integers. The UDF asks Teradata to get the number of AMPs, asks the DFS to get the size of the DFS file, and makes the N-th AMP read the N-th portion of the DFS file. Detailed comments are provided in the Java program. The sample Java program is aslo attached as HDFS_UDF.zip.

import com.teradata.fnc.AMPInfo;
import com.teradata.fnc.NodeInfo;
import com.teradata.fnc.Tbl;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.FSDataInputStream;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.fs.FileStatus;

import java.io.*;
import java.sql.SQLException;

/* Demonstration steps:
 * 1. Install and configure Hadoop DFS.
 * 		1.1 This demo has been tested with Hadoop 0.20.0 and Teradata 13.0
 * 		1.2 Get Hadoop from http://hadoop.apache.org/core/releases.html
 * 		1.3 Follow the instructions and tutorials on Hadoop's website (http://hadoop.apache.org/common/docs/r0.20.0/quickstart.html) 
 * 
 * 2. Load data into HDFS. An example command copying a local file to Hadoop is:  $hadoop-0.20.0/bin/hadoop dfs -put mylocal.txt mydfstbl.txt
 *     If you upload the file as root, then the file's DFS path is likely '/user/root/mydfstbl.txt'
 * 3. Hadoop 0.20.0 or higher version requirs Java 1.6. The JVM included with TD 13.0 is 1.5. If Hadoop 0.20.0 or higher version is used
 *    , IBM JVM 1.6 should be first download and installed on every Teradata node.  In our testing, the following commands were used to intall
 *		IBM JVM 1.6 and make Teradata DBMS use the new installed IBM JVM 1.6
 *	    a)	We download IBM ibm-java-x86_64-sdk.rpm on every node in the Teradata system.
		b)	Install IBM JVM 1.6 on every Teradata node 
 *				rpm -ivh ibm-java-x86_64-sdk.rpm  (which is installed under /opt/ibm/java-x86_64-60/jre/)
		c)	Use the following command on any Teradata node (does not have to be run on all nodes) 
			cufconfig -f  1.txt 
			a)	1.txt specifies the path under which the desired JVM should be used by Teradata DBMS.
			b)	1.txt contains a single line shown below 
 					JREPath: /opt/ibm/java-x86_64-60/jre/
 			c)use cufconfig -o  to display the configurations and check if "JREPath: /opt/ibm/java-x86_64-60/jre/" is in the output.

 * 4. Prepare the .jar file.
 *       $/opt/ibm/java-x86_64-60/bin/javac HDFS_UDF.java     (make sure Teradata javFnc.jar and hadoop-0.20.0-core.jar 
 *					can be found by javac or explicilty include the two jar files in the javac command)
         $/opt/ibm/java-x86_64-60/bin/jar -cf hdfsudf.jar HDFS_UDF.class GenCtx.class
 * 
 * 5. Use the follwing bteq script to set up a test database and install the jar files in Teradata DBMS (change the directories if your Hadoop installation and Java UDF directories are different).
 */

/* bteq scripts */

/*

 .logon  NodeId/dbc

 CREATE USER testdb AS PERM = 600000000000 PASSWORD = testdb; 
 GRANT CREATE	PROCEDURE ON testdb TO testdb WITH GRANT OPTION;
 GRANT DROP	PROCEDURE ON testdb TO testdb WITH GRANT OPTION;
 GRANT EXECUTE	PROCEDURE ON testdb TO testdb WITH GRANT OPTION;
 GRANT CREATE    PROCEDURE ON testdb TO testdb WITH GRANT OPTION;
 GRANT CREATE	EXTERNAL PROCEDURE ON testdb TO testdb WITH GRANT OPTION;
 GRANT ALTER     EXTERNAL PROCEDURE ON testdb TO testdb WITH GRANT OPTION;
 GRANT ALTER     PROCEDURE on testdb TO testdb;
 grant all on testdb to testdb with grant option;
 grant all on SQLJ to testdb with grant option;
 grant all on testdb to dbc with grant option;
 
 
 * 
 .logoff

 
 *
 //For debugging, the following diagnostics should be set. The output files are under /tmp.
 .logon NodeId/testdb;
database testdb;
 diagnostic JAVALANGUAGE on for session;
 diagnostic JAVA32 on for session;
 diagnostic javalogging on for session; 
 diagnostic nocache on for session;
 
call sqlj.install_jar('CJ!/home2/tableudf/hdfsudf.jar', 'hdfsudf', 0); 
Call sqlj.replace_jar('CJ!/home2/tableudf/hdfsudf.jar', 'hdfsudf'); 
call sqlj.install_jar('cj!/hadoop-0.20.0/hadoop-0.20.0-core.jar','newhadoop',0);
call sqlj.install_jar('cj!/hadoop-0.20.0/lib/commons-logging-1.0.4.jar','hadooploggingjar',0); 
  
call SQLJ.ALTER_JAVA_PATH('hdfsudf','(*,newhadoop) (*,hadooploggingjar)');

REPLACE FUNCTION hdfs_udf( filename VARCHAR(250), hdfsname VARCHAR(250) )
RETURNS TABLE (c1 integer, c2 integer)
LANGUAGE JAVA
NO SQL
PARAMETER STYLE JAVA
EXTERNAL NAME 'hdfsudf:HDFS_UDF.GetDFSFileData';
 

CREATE  TABLE testdb.mytab,NO FALLBACK ,
	 NO BEFORE JOURNAL,
	 NO AFTER JOURNAL,
	 CHECKSUM = DEFAULT
	 (
	  c1 integer,
	  c2 INTEGER)
NO PRIMARY INDEX ;

 

insert into mytab SELECT * FROM TABLE (hdfs_udf('/user/root/mydfstbl.txt','hdfs://HDFS_server.mycompany.com:19000')) AS t1; 
 */

/**
 * This class contains a Table UDF function to get data from Hadoop DFS.  
 * 
 * A static cache of context objects is maintained. Only the indices are stored
 * in the TD scratchpad for each running UDF instance.
 */

/**GenCtx stores information necessary to build the next row by HDFS_UDF.GetDFSFileData 
 */
class GenCtx implements Serializable
{

	public int id;   //the index to the cache element which contains the FSDataInputStream used by this AMP.
	public long startpos; // the first byte in the DFS file to be read by this AMP
	public long currentpos;  //the next byte in the DFS file to be read by this AMP in the next round of building a database row
	private long DfsRowsCnt;   //the number of rows should be read from DFS by this AMP
	private long DfsRowsRead = 0; //the number of rows have been read from DFS by this AMP
	private int rowsize;     // size of each line in DFS including the last newline character('\n')


	public GenCtx()
	{
	}

	/**
	 * 
	 * @param id
	 *            the index to the cache element which contains the FSDataInputStream used by this AMP.
	 * @param startpos
	 *            the first byte in the DFS file to be read by this AMP
	 * @param DfsRowsCnt
	 *            the number of rows to be retreived by this AMP
	 * @rowsize   the size of the row 
	 * @throws  
	 */

	public GenCtx(int id, long startpos, long DfsRowsCnt, int rowsize)
	{
		this.id = id;
		this.startpos = startpos;
		this.DfsRowsCnt = DfsRowsCnt;
		currentpos = startpos;
		this.rowsize = rowsize;


	}

	/**Create a database row from reading a line in the DFS file
	 * 
	 * 
	 * @param in
	 *             the FSDataInputStream used by the AMP.
	 * @param c1
	 *            the array containing the first column vlaue to be retured to DBS
	 * @param c2
	 *            the array containing the second column vlaue to be retured to DBS
	 * @rowsize    
	 * @throws  IOException
	 */

	public int CreateRow(FSDataInputStream in, int[] c1, int[] c2) throws IOException
	{

		if (DfsRowsRead == DfsRowsCnt) return 0; // No more rows; This AMP has loaded all rows assigned to it.


		in.seek(currentpos);
		BufferedReader bufferIn = new BufferedReader(new InputStreamReader(in));
		String line;

		//read a line from the DFS
		line = bufferIn.readLine();

		//parse the two integers in the line read
		String[] yb = line.split("\\|");
		c1[0] = Integer.parseInt(yb[0].trim());
		c2[0] = Integer.parseInt(yb[1].trim());

		currentpos += rowsize;
		DfsRowsRead++;

		return 1;
	}

}

public class HDFS_UDF
{

	/*We use static array called cache to store the list of FSDataInputStream used by AMPs to read from HDFS. We support multiple queries executed in parallel calling the same Table UDF.
	* Thus we don't want a SQL query calling the UDF on a AMP to override the content of the static cache array used by another SQL query calling the same UDF
	 * on the same AMP.  last_id indicates the last used array element in the cache array. If we reach the end of the array, we come back to the begining of the array. 
	 * Therefore, the total number of supported concurrent queries (all running at the same time) calling the same UDF is max_ids/(#-of-AMPs per node).
	 */
	private static int last_id = 0;   //the last used cell in the cache array
	private static final int max_ids = 1000;

	//The array keeps the list of FSDataInputStream opened by all AMPs to access HDFS. Each AMP uses a FSDataInputStream to access HDFS.
	private static final FSDataInputStream[] cache = new FSDataInputStream[max_ids];

	
	
	private static int ROWSIZE = 25; //size of each line in DFS including the last newline character('\n')

	/**GetDFSFileData is the table UDF accesing DFS file and return rows
	 *
	 * 
	 *@param filename
	 *          the DFS file name
	 *@param hdfsname
	 *          the HDFS system to be accessed (Example, "hdfs://HDFS_server.mycompany.com:19000")
	 *@param c1
	 *            the array containing the first column vlaue to be retured to DBS
	 * @param c2
	 *            the array containing the second column vlaue to be retured to DBS            
	 * @throws SQLException
	 * @throws IOException
	 * @throws ClassNotFoundException 
	 */

	public static void GetDFSFileData(String filename, String hdfsname, int[] c1, int[] c2)
			throws SQLException, IOException, ClassNotFoundException
	{
		int status;
		int[] phase = new int[1];
		GenCtx obj;
		Tbl tbl = new Tbl();



		/* make sure the function is called in the supported context */
		switch (tbl.getPhase(phase))
		{
			case Tbl.TBL_MODE_CONST:
				/* depending on the phase decide what to do */
				switch (phase[0])
				{
					case Tbl.TBL_PRE_INIT:

						//HDFS related setup
						Configuration conf = new Configuration();
						conf.setClassLoader(Configuration.class.getClassLoader());

						conf.set("fs.default.name", hdfsname);
						conf.set("user", "root");
						FileSystem fs = FileSystem.get(conf);


						Path inFile = new Path(filename);

						// Check if input is valid
						if (!fs.exists(inFile))
							throw new IOException(filename + "  does not exist");
						if (!fs.isFile(inFile))
							throw new IOException(filename + " is not a file");
						return;

					case Tbl.TBL_INIT:
						/* get scratch memory to keep track of things */

						// Create ID for this particular SQL+AMP instance. 
						int id = getNewId();

						// set up the information needed to build a row by this AMP 
						obj = InitializeGenCtx(filename, hdfsname, id);

						//store the GenCtx object created which will be used to create the first row
						tbl.allocCtx(obj);
						tbl.setCtxObject(obj);

						break;
					case Tbl.TBL_BUILD:

						// Get the GenCtx from the scratch pad from the last time.
						obj = (GenCtx)tbl.getCtxObject();
						int myid = obj.id;
						status = obj.CreateRow(cache[myid], c1, c2);

						if (status == 0)
							throw new SQLException("no more data", "02000");

						tbl.setCtxObject(obj);
						break;

					case Tbl.TBL_END:
						int my_id = ((GenCtx)tbl.getCtxObject()).id;
						cache[my_id].close();
						cache[my_id] = null;

						break;
				}
				return;

			case Tbl.TBL_MODE_VARY:
				throw new SQLException("Table VARY mode is not supported.");

		}

	}


	/**
	 * Given the data path, decide which parts of the DFS file are to be read by this AMP. 
	 *  
	 * 
	 *@param filename
	 *          the DFS file name
	 *@param hdfsname
	 *          the HDFS system to be accessed (Example, "hdfs://HDFS_server.mycompany.com:19000")
	 *@param id
	 *          cache[id] is the FSDataInputStream to be used by this AMP.
	 * @throws SQLException
	 * @throws IOException
	 */
	private static GenCtx InitializeGenCtx(String filename, String hdfsname, int id) throws IOException, SQLException
	{


		//HDFS setup
		Configuration conf = new Configuration();
		conf.setClassLoader(Configuration.class.getClassLoader());
		conf.set("fs.default.name", hdfsname);
		conf.set("user", "root");

		FileSystem fs = FileSystem.get(conf);

		Path inFile = new Path(filename);
		FileStatus fstatus = fs.getFileStatus(inFile);

		FSDataInputStream in = fs.open(inFile);


		/* get the number of AMPs, compute the AMP id of this AMP
		 * The N-th AMP reads the N-th portion of the DFS file
		 */

		AMPInfo amp_info = new AMPInfo();
		NodeInfo node_info = new NodeInfo();

		int[] amp_ids = node_info.getAMPIds();
		int ampcnt = node_info.getNumAMPs(); // the number of AMPs in the Teradata system
		int amp_id = amp_info.getAMPId();  //the id of this AMP

		long size = fstatus.getLen();       //the size of the DFS file
		long totalRowsCnt = size / ROWSIZE;  // the total number of lines in the DFS file

		// some "heavy" AMPs will read one more line than other "light" AMPs. avgRowCnt is the number of lines "light" AMPs will read.
		long avgRowCnt = totalRowsCnt / ampcnt;

		int heavyAMPId = (int)(totalRowsCnt % ampcnt); // the id of the first "heavy" AMP 

		long myRowsCnt2Read = avgRowCnt;  //how many rows this AMP should load
		long[] rowCntAMP = new long[ampcnt];  // this array records how many rows each AMP should load

		for (int k = 0; k < heavyAMPId; k++)
		{
			rowCntAMP[k] = avgRowCnt + 1;
			if (amp_id == amp_ids[k]) myRowsCnt2Read++;
		}
		for (int k = heavyAMPId; k < ampcnt; k++)
			rowCntAMP[k] = avgRowCnt;

		long rowCntBeforeMe = 0; //total number of DFS lines (counting from the begining of the DFS file) other AMPs before this AMP will read, i.e., the number of DFS lines this AMP should skip
		for (int k = 0; k < ampcnt; k++)
		{
			if (amp_id == amp_ids[k])
			{
				break;
			}
			else
				rowCntBeforeMe += rowCntAMP[k];


		}

		long startpos = rowCntBeforeMe * ROWSIZE; //the first byte in the DFS file this AMP should read.		

		cache[id] = in;
		return new GenCtx(id, startpos, myRowsCnt2Read, ROWSIZE);
	}

	private synchronized static int getNewId()
	{
		last_id = (last_id + 1) % max_ids;
		return last_id;
	}

}

//a sample DFS file contains fixed size 2 columns which can be read by the Table UDF
/*
          1|          1|
          2|          2|
          3|          3|
          4|          4|
          5|          5|
          6|          6|
          7|          7|
          8|          8|
          9|          9|
         10|         10|
         11|         11|
         12|         12|
         13|         13|
         14|         14|
         15|         15|
         16|         16|
         17|         17|
         18|         18|
         19|         19|
         20|         20|
         21|         21|
         22|         22|
         23|         23|
         24|         24|
         25|         25|
         26|         26|
         27|         27|
         28|         28|
         29|         29|
         30|         30|
*/

 

Ignore ancestor settings: 
0
Channel: 
Apply supersede status to children: 
0

Hadoop MapReduce Connector to Teradata EDW

$
0
0
Short teaser: 
Describe how MapReduce programs access the Teradata EDW using the TeradataDBInputFormat class
Additional contributors: 
Cover Image: 
AttachmentSize
Package iconTeradataDBFormatter.zip58.69 KB

Hadoop MapReduce programmers often find that it is more convenient and productive to have direct access from their MapReduce programs to data stored in a RDBMS such as Teradata Enterprise Data Warehouse (EDW) because:

  1. There is no benefit to exporting relational data into a flat file.
  2. There is no need to upload the file into the Hadoop Distributed File System (HDFS).
  3. There is no need to change and rerun the scripts/commands in the first two steps when they need to use different tables/columns in their MapReduce programs.

1. Introduction

Connectivity between a Java or other language program to JDBC and ODBC is well understood by most programmers.  But when dealing with Hadoop and MPP databases, two factors outside the domain of connectors become crucial: scale and balance.  Hadoop and MPP databases often run into the dozens or hundreds of server nodes, consuming 10s or hundreds of terabytes per execution. For the shared-nothing architecture to perform at maximum throughput, the processing workload and data must be partitioned across execution threads.  Otherwise, one server will have an inordinate amount of work compared to others, causing the total elapsed time to be slower. Consequently, a Hadoop connector is needed to support parallel efficiency. 

In this document we first describe how MapReduce programs (a.k.a. mappers) can have parallel access to the Teradata EDW data using the TeradataDBInputFormat approach discussed in Section 2. In Section 3, we provide a complete example with accompanying source code which can be directly used to access the EDW without any changes required in the data warehouse or Hadoop. The TeradataDBInputFormat class can be directly used by programmers without any changes for many applications. Overall, readers can get the following out of this article:

  1. Architecturally how MapReduce programs get direct and parallel access to the Teradata EDW.
  2. How TeradataDBInputFormat class can be used without changes to Hadoop or the EDW. Step-by-step installation and deployment is included in the example.
  3. How programmers can extend the TeradataDBInputFormat approach for specific application needs.

2. The TeradataDBInputFormat approach

A common approach for a MapReduce program to access relational data is to first use the DBMS export utility to pass SQL answer sets to a local file and then load the local file to Hadoop.

However, there are several use cases where the export and load into HDFS is inconvenient for the programmer. Recognizing the need to access relational data in MapReduce programs, the Apache.org open source project for Hadoop provides the DBInputFormat class library.  The DBInputFormat and DBOutputFormat Java class libraries allow MapReduce programs to send SQL queries through the standard JDBC interface to the EDW in parallel. Teradata provides a version of DBInputFormat [3] that will be part of the Cloudera Distribution for Hadoop. Note that Cloudera has a good explanation of DBInputFormaton their website. The TeradataDBInputFormat approach is inspired by but not based on the Apache DBInputFormat approach. 

DBInputFormat and DBOutputFormat along with their Teradata versions are good interfaces for ad hoc medium or small volume data transfers.   They make it easy to copy tables from the EDW into HDFS and vice versa.  One good use of these interfaces is when a mapper program needs to do table look-ups but has no need to persist the data fetched.  These interfaces are not efficient for high volume data transfers where bulk data movement tools like Teradata Parallel Transporter are more appropriate.   In many cases queries and bulk data movement is better optimized inside the database itself.   While it's an oversimplification, think of the input and output format class libraries similar to workloads processed by BTEQ.  They are very flexible and useful, but do not support every workload. 

 2.1 DBInputFormat with JDBC

DBInputFormat uses JDBC to connect to relational databases, typically MySQL or Oracle. The basic idea is that a MapReduce programmer provides a SQL query via the DBInputFormat class. The DBInputFormat class associates a modified SQL query with each mapper started by Hadoop.  Then each mapper sends a query through a standard JDBC driver to the DBMS and gets back a portion of the query results and works on the results in parallel. The DBInputFormat approach is correct because the union of all queries sent by all mappers is equivalent to the original SQL query.

The DBInputFormat approach provides two interfaces for a MapReduce program to directly access data from a DBMS. The underlying implementation is the same for the two interfaces. In the first interface, a MapReduce program provides a table name T, a list P of column names to be retrieved, optional filter conditions C on the table and column(s) O to be used in the Order-By clause, in addition to user name, password and DBMS URL values. The DBInputFormat implementation first generates a “count” query:

SELECT count(*) from T where C

and sends it to the DBMS to get the number of rows (R) in the query result. At runtime, the DBInputFormat implementation knows the number of mappers (M) started by Hadoop and associates the following query Q with each mapper. Each mapper will connect to the DBMS and send Q over JDBC connection and get back the results.

SELECT P FROM T WHERE C ORDER BY O
LIMIT  L                                                               (Q)               
OFFSET X

The above Query Q asks the DBMS to evaluate the query. 

SELECT P FROM T WHERE C ORDER BY O,

but only return L number of rows starting from the offset X.  In total M queries are sent to the DBMS by the M mappers and they are almost identical except that the values of L and X are different. For the ith mapper (where 1 ≤ i ≤ M − 1) which is not the last mapper,  and. For the last mapper, and . 

Basically all mappers except the last one will receive an average number of rows and the last mapper will get more rows than other mappers when the total number of rows in the result cannot be evenly divided by the number of mappers.

In the second interface of the DBInputFormat class, a mapper program can provide an arbitrary SQL select query SQ (which could involve multiple tables) whose results are the input to the mappers. The mapper has to provide a count query QC which returns an integer which is the number of rows returned by the query SQ. The DBInputFormat class sends the query QC to the DBMS to get the number of rows (R), and the rest of the processing is the same as described for the first interface.

While this DBInputFormat approach clearly streamlines the process of accessing relational data from products like MySQL, the performance cannot scale. There are several performance issues with the DBInputFormat approach. In both interfaces, each mapper sends essentially the same SQL query to the DBMS but with different LIMIT and OFFSET clauses to get a subset of the relational data. Sorting is required at the DBMS side for every query sent by a mapper because of the ORDER-BY clause introduced into each query, even if the program itself does not need sorted input. This is how parallel processing of relational data by mappers is achieved in the DBInputFormat class. Furthermore, the DBMS has to execute as many queries as the number of mappers in the Hadoop system which is not efficient -- especially when the number of mappers is large. The above performance issues are especially serious for a parallel DBMS such as Teradata EDW which tends to have high number of concurrent queries and larger datasets. Also the required ordering/sorting is an expensive operation in parallel DBMS because the rows in a table are not stored on a single node and sorting requires row redistribution across nodes.

DBInputFormat cannot be used to access Teradata EDW since LIMIT and OFFSET clauses are not in ANSI Standard SQL and are not supported by Teradata EDW.  However, a newer Apache Hadoop class named DataDrivenDBInputFormat derived from DBInputFormat can read input data from a Teradata EDW table. DataDrivenDBInputFormat operates like DBInputFormat. The only difference is that  instead of using non-standard LIMIT and OFFSET to demarcate splits, DataDrivenDBInputFormat generates WHERE clauses which separate the data into roughly equivalent shards. DataDrivenDBInputFormat has all of the same DBInputFormat performance issues we have discussed above.

Figure 1 - DBInputFormat

2.2 TeradataDBInputFormat

The Teradata connector for Hadoop -- TeradataDBInputFormat -- sends the SQL query Q provided by a MapReduce program only once to Teradata EDW. Q is executed only once and the results are stored in a Partitioned Primary Index table (PPI) T[4]. Then each mapper from Hadoop sends a new query Qi which just asks for the ith partition on every AMP.  Depending on the number of mappers, the complexity of the SQL query provided by a MapReduce program and the amount of data involved in the SQL query, the performance of the TeradataDBInputFormat approach can obviously be orders of magnitudes better than the DBInputFormat approach, as we have seen in our internal testing.

Now we describe the architecture behind TeradataDBInputFormat. First, the TeradataDBInputFormat class sends the query P to the EDW based on the query Q provided by the mapper program.

CREATE TABLE T AS (Q) WITH DATA
PRIMARY INDEX ( c1 )                                             (P)   
PARTITION BY (c2 MOD M) + 1

The above query asks the EDW to evaluate Q and store the results – table layout and data -- in a new PPI table T. The hash value of the primary index column c1 of each row in the query results determines which AMP should store that row. Then the value of the partition-by expression determines the physical partition (location) of each row on a particular AMP.  This is done using modulo M which means divide by m and take the remainder.

All rows on the same AMP with the same partition-by value are physically stored together and can be directly and efficiently located by the Teradata optimizer. There are different ways to automatically choose the primary index column and partition-by expression.    

After the query Q is evaluated and the table T is created, each AMP has M partitions numbered from 1 to M (M is the number of mappers started in Hadoop). Then each mapper sends the following query Qi(1 ≤ i ≤ M) to the EDW,

SELECT * FROM T WHERE PARTITION = i                              (Qi)

Teradata EDW will directly locate all rows in the ith partition on every AMP in parallel and return them to the mapper. This operation is done in parallel for all mappers. After all mappers retrieve their data, the table T is automatically deleted. Notice that if the original SQL query just selects data from a base table which is a PPI table, then we do not need to create another PPI table since we can directly use the existing partitions to partition the data each mapper should receive.

As mentioned in the beginning of Section 2.2,  if the number of mappers is large and the complexity of the SQL query provided by a MapReduce program is high (for example involving multi-table join and grouping), the performance of the TeradataDBInputFormat approach can obviously be orders of magnitudes better than the DBInputFormat approach. This is because the DBMS system has to execute the same user SQL query as many times as the number of mappers, sort the results and send back only a portion of the final results to each mapper. However in the TeradataDBInputFormat approach, the complex SQL query is executed only once and the results are stored in a PPI table’s multiple partitions each of which is sent to a different mapper. As mentioned before, the discussed TeradataDBInputFormat does not require any change to the Teradata EDW codebase. We have investigated a few areas where we can significantly improve the performance of the TeradataDBInputFormat approach with new enhancements to Teradata EDW, which we will probably discuss in a separate article.

 

Figure 2 - TeradataDBInputFormat

2.3 Known Issues

Notice that the data retrieved by a MapReduce program via the TeradataDBInputFormat approach or the DBInputFormat approach are not stored in Hadoop after the MapReduce program is finished, unless the MapReduce program intentionally does so. Therefore if some Teradata EDW data are frequently used by many MapReduce programs, it will be more efficient to copy these data and materialize them in HDFS.  One approach to store a Teradata table permanently in Hadoop DFS is to use Cloudera’s Sqoop [5] which we have integrated TeradataDBInputFormat into.

One potential issue in the current implementation provided in Appendix is that we could have column name conflict. For example, assume the business query in the DDL P in Section 2.2 is "select * from T1, T2 where T1.a=T2.b" and that T1 and T2 have columns of the same names. Currently Teradata DBMS will complain about column name conflict if we simply create a table to store the above query result. Either the EDW can be enhanced to automatically resolve the column name conflict or the TeradataDBInputFormat class can be enhanced to automatically resolve the column name conflict so that users do not need to change the query  “select * from T1, T2 where T1.a=T2.b" to explicitly uniquely name each column in the results which is a workaround solution for now.

2.4 DBOutputFormat

The DBOutputFormat provided by Cloudera writes to the database by generating a set of INSERT statements in each reducer.  The current DBOutputFormat approach while multiple Reducers sending batches of INSERT statements to DBMS can work with the Teradata EDW without modification. For more detail, please refer to [3].

3. Example using TeradataDBInputFormat

In the section, we first describe the requirements for running our TeradataDBInputFormat class, and then an example is used to explain how to use the TeradataDBInputFormat approach.

3.1 Requirements

The TeradataDBInputFormat class is implemented in Java. The enclosed package can be compiled and run in an environment with the following features:

  • Sun JDK 1.6, update 8 or later versions
  • Hadoop version 0.20.2 +228 or later versions
  • Teradata version V2R5 or later release

Note that the Hadoop DFS and the Teradata DBMS may or may not be installed on the same hardware platform.   

You should start by downloading the TeradataDBInputFormat connector.  The JAR file (i.e., TeradataDBInputFormat.jar) should be placed into the $HADOOP_HOME/lib/ directory on your Hadoop TaskTracker machines.  It is a good idea to also include it on any server you launch Hadoop jobs from.

3.2 Sample code using TeradataDBInputFormat class to access Teradata EDW data

Table Schema

Assume a MapReduce program needs to access some transaction data stored in the following table table_1, defined by the following CREATE TABLE statement:

CREATE TABLE table_1 (
   transaction_id int,
   product_id int,
   sale_date date,
   description varchar(64)
) PRIMARY INDEX(transaction_id).

To access the transaction and product information, a MapReduce program can simply provide a SQL query, like "select transaction_id, product_id from table_1 where transaction_id > 1000".

Configuring the MapReduce job

The following code shows how a MapReduce job using TeradataDBInputFormat class is configured and run.

public class TransactionsMapReduceJob extends Configured implements Tool
{
	private String query = "";
	private String output_file_path = "";
 
	/**
	* Constructor
	*/
	public TransactionsMapReduceJob(final String query_, final String output_)
	{
		query = query_;
		output_file_path = output_;
	}
	 
	@Override
	public int run(String[] args) throws Exception
	{
		Configuration conf = getConf();
		Job myJob = new Job(conf, conf.get("mapreduce.job.name"));
		
		// the following statement is very important!!!
		///1. Set the class as the record reader 
		myJob.getConfiguration().setClass("record.reader.class", TransactionTDBWritable.class, TeradataDBWritable.class);
		
		///2. Store the query	
		TeradataDBInputFormat.setInput(myJob.getConfiguration(), query, TransactionTDBWritable.class);
		
		///3. Specify the input format class 
		myJob.setInputFormatClass(TransactionDBInputFormat.class);
			 
		myJob.setJarByClass(TransactionsMapReduceJob.class);
			
		myJob.setOutputKeyClass(LongWritable.class);
		myJob.setOutputValueClass(LongWritable.class);
			
		myJob.setMapperClass(TransactionMapper.class);
		myJob.setReducerClass(TransactionReducer.class);
			
		myJob.setOutputFormatClass(TextOutputFormat.class);
							
		FileOutputFormat.setOutputPath(myJob, new  Path(output_file_path));
			
		int ret = myJob.waitForCompletion(true) ? 0 : 1;
			
		// clean up ...
		 TeradataDBInputFormat.cleanDB(myJob.getConfiguration());
			
		return ret; 
	}
	 
	public static void main(String[] args) throws Exception
	{
		int res = 0;
		try{
			int args_num = args.length;
		 
			// Assumption 1: The second to last parameter: output file path
			String output_file_path = args[args_num-2];

			// Assumption 2: The last parameter: the query
			String query = args[args_num-1]; 
			 
			Tool mapred_tool = new TransactionsMapReduceJob(query, output_file_path);
			res = ToolRunner.run(new Configuration(), mapred_tool, args);
			 
		} catch (Exception e)
		{
			e.printStackTrace();
		} finally
		{
			System.exit(res);
		}
	}
}

Defining Java Class to represent DBMS data to be used in MapReduce program

The following class TDBWritable is defined to describe data from Teradata EDW to be used by the MapReduce program.

public class TransactionTDBWritable implements TDBWritable, Writable
{
	private long transaction_id = 0;
	private long product_id = 0;
	 
	// Static code: the programmer should explicitly declare the attributes (name and type) related to the query. 
	static private List<String> attribute_names = new Vector<String>();
	static private List<String> attribute_types = new Vector<String>();
	static
	{
		// The corresponding query: 
		/// SELECT transaction_id, product_id FROM ... WHERE ...
		//1. for the first item
		attribute_names.add("transaction_id");
		attribute_types.add("int");
		//2. for the second item
		attribute_names.add("product_id");
		attribute_types.add("int");
	}

	/**
	* Default constructor
	*/
	public TransactionTDBWritable(){super();}

	@Override
	public void readFields(ResultSet resultSet) throws SQLException
	{
		transaction_id = resultSet.getLong("transaction_id");
		product_id = resultSet.getLong("product_id");
	}
	 
	…
	 
	@Override
	public void readFields(DataInput in) throws IOException
	{
		transaction_id = in.readLong();
		product_id = in.readLong();
	}
	 
	@Override
	public void write(DataOutput out) throws IOException
	{
		out.writeLong(transaction_id);
		out.writeLong(product_id);
	}
	 
	@Override
	public void addAttrbute(String name, String type)
	{
		attribute_names.add(name);
		attribute_types.add(type);
	}
	  
	@Override
	public List<String> getAttributeNameList()
	{
		return attribute_names;
	}
	 
	@Override
	public List<String> getAttributeValueList()
	{
		return attribute_types;
	} 
}

Note that the TeradataDBInputFormat can be enhanced such that the above class does not need to be manually created since it can be automatically generated by looking at the resulting query’s schema.

A dummy class inheriting from TeradataDBInputFormat<T> is needed:

public class TransactionDBInputFormat extends TeradataDBInputFormat<TransactionTDBWritable>
{
	//NEED DO NOTHING!
	//Transfer the type information of “TransactionTDBWritable” down to TransactionDBInputFormat’s constructor. 
}

Using data in a Mapper

The TeradataDBInputFormat will read from the database and populate the retrieved data to the fields in the TransactionTDBWritable class.  A mapper then receives an instance of the TransactionTDBWritable implementation as its input value and can use the retrieved DBMS data as desired.  The following code simply shows how a mapper has direct access to DBMS data passed to it as an instance of the TransactionTDBWritable class.

public class TransactionMapper extends Mapper<LongWritable, TransactionTDBWritable, LongWritable, LongWritable>
{
	public TransactionMapper(){}
	
	protected void map(LongWritable k1, TransactionTDBWritable v1, Context context)
	{
		try
		{
			context.write(new LongWritable(v1.getTransactionID()), new LongWritable(v1.getProductID()));			
		} catch (IOException e)
		{
			...
		} catch (InterruptedException e)
		{
			...
		}
	}
}

Prepare the configuration properties

To enable the access to the Teradata DB, the mapper program also needs to know the information of the database connection, like DB URL, user account, password and so on. This information can be stored in a property XML file as shown in the following file TeraProps.xml.

<?xmlversion="1.0"?>
<?xml-stylesheettype="text/xsl" href="configuration.xsl"?>
<configuration>

	<!-- Parameters about the Teradata RDBMS -->
	<property>
		<name>teradata.db.url</name>
		<value>127.1.2.1</value>
		<description>
			The URL of the Teradata DB the program is going to interact with
		</description>
	</property>

	<property>
		<name>teradata.db.account</name>
		<value>user1</value>
		<description>The account name</description>
	</property>

	<property>
		<name>teradata.db.password</name>
		<value>b123</value>
		<description>Password</description>
	</property>
	…

</configuration>

Deploy and run

The whole source package should be first compiled and compressed as a jar file, for example, MyMapReduceJob.jar, before it can be run on the Hadoop DFS. Assume that we put the property XML file under the same directory as the jar file. Then, we may start the MapReduce job with the following command:

HADOOP_HOME/bin/hadoop jar MyMapReduceJob.jar TransactionsMapReduceJob -Dmapred.map.tasks=32 -conf TeraProps.xml output.tbl "select transaction_id, product_id from table_1 where transaction_id > 1000"

where:

  1. HADOOP_HOME stands for the Hadoop installation directory
  2. -Dmapred.map.tasks=32 sets the number of map tasks to 32
  3. -conf TeraProps.xml tells where to find the parameters about the Teradata DBMS
  4. the file output.tbl contains the job's output, and
  5. "select transaction_id, product_id from table_1 where transaction_id > 1000" is the user query.

4. Conclusion

MapReduce related research and development continues to be active and attract interests from both industry and academia. MapReduce is particularly interesting to parallel DBMS vendors since both MapReduce and Teradata Data Warehouses use clusters of nodes and shared-nothing scale-out technology for large scale data analysis.  The TeradataDBInputFormat approach in this article show how MapReduce programs can efficiently and directly have parallel access to Teradata EDW data without external steps of exporting and loading data from Teradata EDW to Hadoop.

5. References

[1] http://www.cloudera.com/blog/2009/03/database-access-with-hadoop/

[2] http://www.cloudera.com

[3] DBInputFormathttp://www.cloudera.com/blog/2009/03/databaseaccess-with-hadoop

[4] Teradata Online Documentation http://www.info.teradata.com

[5] Cloudera Scoop http://www.cloudera.com/blog/2009/06/introducing-sqoop/

6. Appendix

The TeradataDBInputFormat class is implemented in Java, and the related source code is included in the attached zip file. The source code is composed of three parts:

The core implementation of the TeradataDBInputFormat classes is built based on the PPI strategy in Section 2.2. Five classes are defined:

  • TeradataDBInputFormat
  • TeradataDBSplit
  • TeradataDBRecordReader
  • TeradataDBWritable
  • DummyDBWritable

The generation of the internal intermediate table in Teradata DBMS according to the user query.

  • IntermediateTableQueryGenerator
  • IntermediateTableGenerator

An example to show how to use the TeradataDBInputFormat class.

  • TransactionTDBWritable
  • TransactionDBInputFormat
  • TransactionMapper
  • TransactionReducer
  • TransactionsMapReduceJob
Ignore ancestor settings: 
0
Channel: 
Apply supersede status to children: 
0

Teradata Studio 14.02 now available

$
0
0
Short teaser: 
Teradata Studio 14.02 now available
Cover Image: 

Teradata Studio 14.02 is now available for download. Teradata Studio is an administration tool for creating and administering database objects. It can be run on multiple operating system platforms, such as Windows, Linux, and Mac OSX. It is built on top of the Eclipse Rich Client Platform (RCP) which allows Teradata Studio Express to benefit from the many high quality Eclipse features available while focusing on value-add for the Teradata Database.

What's New:

  • Aster Database Support

Support is provided for Teradata’s Aster Database. Users can create connections for an Aster Database using the Aster Connection Profile Wizard. The Data Source Explorer will display Aster Databases, Schemas, Tables, Views, and Map Reduce Functions. SQL Templates are provided to help users create Aster SQL commands. SQL statements can be executed against an Aster Database, with result set data displayed in the Result Set Viewer and history entries stored in the Teradata SQL History. The Teradata View also supports the display of more detailed information about Aster Database objects.

  • Collect Statistics

The Collect Statistics Wizard provides a user interface to help users perform COLLECT STATISTICS operations on tables and columns in Teradata Database 14.0 and higher. The user can select a table or column in the Data Source Explorer and choose the Collect Statistics option. The Statistics View is opened where the user can specify additional information for statistics collection and choose the columns for single or multi stats collection.

  • Data Lab Support (Smart Load, Copy Table)

Teradata Data Lab provides the DBA the flexibility to provision space within the data warehouse. Teradata Studio provides an option for Data Lab customers to easily load data from an Excel or CSV file (Smart Load) as well as copy tables from their Teradata system into their Data Lab. Tables can be dragged from the Data Source Explorer and dropped in the Data Lab View, invoking the Data Lab Copy Table wizard. The Copy Table wizard will copy the table in addition to the table data to the Data Lab. A filtering option is provided to filter out unwanted columns or certain data within the columns. The Smart Load option is provided by selecting the Tables folder within the Data Lab.

  • Enhanced DDL Generator

The DDL Generator has been enhanced to provide better options and DDL generation for Schemas, Tables, Views, Macros, Indexes, Stored Procedures, User Defined Functions, and User Defined Types.

  • Excel 2007 (.xlsx) Support

Support for Excel 2007 (.xlsx) has been added when exporting table data or result set data to Excel. Previously, only Excel 97-2003 was supported. This allows users to export over 65,536 rows of data to an Excel spreadsheet.

  • Hadoop Data Transfer

Teradata Studio 14.02 provides a feature to easily transfer tables and data from Hadoop to Teradata (Import) and from Teradata to Hadoop (Export). The Hadoop Transfer Perspective allows users to drag Teradata tables from the Data Source Explorer and drop them on the Database node in the Hadoop View and visa versa, drag tables from the Hadoop view and drop them on the Tables folder in the Data Source Explorer. Users can also right click on a Teradata table and choose the option in the Data menu, Export to Hadoop. Or right click on the Tables folder in the Data Source Explorer and choose the Teradata menu option Import from Hadoop. A dialog is opened to help you map the Teradata columns to Hadoop column types or Hadoop columns to Teradata column types.  A transfer job is started and, when completed, its status is displayed in the Transfer History View. Users must create a connection profile for the Hadoop database, providing the profile name, HCatalog hostname, port number, system username and password.

In order to use the Hadoop Data Transfer feature, users must enable the option by setting the Data Transfer Preference 'Enable Hadoop Views'. Refer to the Teradata Studio Help for more information on this feature.

  • Quick Tour Help Option

When bringing up Teradata Studio for the first time, a Quick Tour slide show is presented to help the user familiarize themselves with the basic features of Teradata Studio.

  • Result Set Viewer Performance Improvement

Performance improvements have been made to the Result Set Viewer to significantly reduce the amount of time it takes to display the result set data.

  • Teradata Load and Export

Teradata Load and Export provides an enhanced load and export feature to quickly load or export the data to and from tables within the Data Source Explorer. The Teradata Load feature will use criteria such as number of rows, column types, and existing table data to determine whether the Teradata JDBC FastLoad option will be used, otherwise JDBC batch operations are performed. Teradata Load and Export support both Excel 97-2003 .xls and Excel 2007 .xlsx file formats.

Filtering has also been added to the Teradata Export wizard. The user can choose to filter out certain columns, as well as filter out certain data within a column.

  • Upgrade Eclipse 3.7.2 and DTP 1.10.1

This release of Teradata Studio has upgraded its version of Eclipse to version 3.7.2 and its version of Eclipse Data Tools Platform to 1.10.1.

Need Help?

For more information on using Teradata Studio, refer to the article, Teradata Studio. To ask questions or discuss issues, refer to the Teradata Studio Forum and post your question.

Online Help can be accessed within Teradata Studio:

  • From the main menu: Help > Help Contents > Teradata Studio
  • Context sensitive help: When a user is in a dialog, they can hit the F1 key to retrieve help text sensitive to where they are within the dialog.

Reference Documentation can be found on the download page or at: www.info.teradata.com

  • Title: Teradata Studio Release Definition Publication ID: B035-2040-122C
Ignore ancestor settings: 
0
Channel: 
Apply supersede status to children: 
0

Smart Loader for Hadoop

$
0
0
Short teaser: 
Smart Loader for Hadoop gives users the ability to load data between Hadoop and Teradata.
Cover Image: 

Teradata Studio provides a Smart Loader for Hadoop feature that allows users to transfer data from Teradata to Hadoop, Hadoop to Teradata, and Hadoop to Aster. When transfering between Teradata and Hadoop, the Hadoop Smart Loader uses the Teradata Connector for Hadoop MapReduce Java classes as the underlying technology for data movement. It requires the HCatalog metadata layer to browse the Hadoop objects and uses Oozie workflow to manage the data transfer.  Currently, the Smart Loader for Hadoop feature in Teradata Studio is certified to use the Teradata Connector for Hadoop (TDCH) version 1.3.4, and the Hortonworks and Cloudera distributions of Hadoop. The Teradata Connector for Hadoop needs to be installed on the Hadoop System.

                  

NOTE: You must have the Teradata Connector for Hadoop (TDCH) installed on your Hadoop system. You can download the TDCH version 1.3.4  on the Developer Exchange Download site. You must also download the Configure Oozie script and run it on your Hadoop system. Refer to the Readme on the Teradata Studio download page for instructions on running the Configure Oozie script.

For Hadoop to Aster data transfers, the Smart Loader for Hadoop uses the Aster Map Reduce Function, load_from_hcatalog. The data transfer is initiated from the Aster Database to remotely access the Hadoop System, via SQL-H, and pull the data across.

With bi-directional data loading, users can easily perform ad hoc data transfer between their Teradata, Aster, and Hadoop systems. The Hadoop Smart Loader can be invoked by drag and drop of a table between the Transfer View and Data Source Explorer or by selecting a table in the Data Source Explorer and choosing the option Data>Export Data... or Data>Load Data.... This will invoke the Data Transfer Wizard for you to select the Source or Destination Type.

               

Create Hadoop Connection Profile.

You can create connections to your Hadoop System using the Connection Profile Wizard. The wizard is invoked from the Data Source Explorer by right clicking on the Database Connections folder or selecting the 'New Connection Profile' button, , from the Data Source Explorer toolbar.

          

There are two options for creating Hadoop Connection Profiles:

  • Hadoop Generic System - The Hadoop Generic System profile supports migrating Hadoop connections from Studio releases prior to Studio 15.10. It is also used to support Cloudera Hadoop connections.  Hadoop Generic System connections are created using the WebHCat protocol to connect and discover database and tables information. Enter the WebHDFS, WebHDFS Port number, and System Username. This connection requires that the ConfigureOozie script is run on the Hadoop System.

             

  • Hadoop Hortonworks - The Hadoop Hortonworks connection profile provides additional options for connecting to Hortonworks Hadoop systems. It is based on the desired functionality between Studio and your Hadoop System: Knox Gateway (Secure connection), TDCH (Teradata data transfers), JDBC (creating and running SQL), or SQL-H (Hadoop to Aster data transfers). Note that the Knox Gateway option also supports JDBC connections. Click next to enter the Host name, Port number, User name, and Password, if required. The TDCH option is equivalent to the Hadoop Generic System connection profile described above and requires the ConfigureOozie script to be run on the Hadoop System.

                        

Once you have your Hadoop connection profile, you can browse the Hadoop database and table objects in the Data Source Explorer.

You can also run HiveQL SQL commands against your Hadoop system if you have configured a Hadoop JDBC or Knox connection profile.

Transfer Tables between Teradata and Hadoop

Before invoking the Hadoop Transfer Wizard (aka Smart Loader for Hadoop), switch to the Data Transfer perspective and choose your Hadoop connection profile in the Transfer View.

There are two ways to invoke the Hadoop Transfer Wizard. One way to transfer a table from Teradata to Hadoop is to drag a Teradata table from the Data Source Explorer and drop it on a Hadoop database in the Transfer View. You can also transfer from Hadoop to Teradata by dragging the table from the Hadoop system in the Transfer View and dropping on a Teradata Database in the DSE.

    

The other way is from the Data Transfer Wizard. Choose the Teradata table from the Data Source Explorer, right click and choose Data>Export.... This will invoke the Data Transfer Wizard. Choose Hadoop as the Destination Type and click Launch. NOTE: You can also choose Data>Load... and Source Type as Hadoop to transfer data from Hadoop to Teradata.

    

This will launch the Hadoop Transfer wizard for you to choose the destination Hadoop system for the transfer. You can transfer the Teradata table as a 'New Table' or the data to an existing table in Hadoop.

    

From either drag and drop or from the Data Transfer Wizard, the Hadoop Transfer wizard will next prompt the user to choose the file options and column mappings. The Hadoop Transfer wizard will attempt to choose Hadoop columns types based on the source Teradata column data types. The user can override the destination column type by selecting a new column type from the drop down list. You can also choose to filter out columns you don't want in the destination Hadoop table by unchecking the column. If you are transfering to an existing Hadoop table, you will need to map the source columns to the destination columns. You will also be given an option to append or replace the table data.

    

Press finish to complete the Hadoop data transfer and submit a Data Transfer job to perform the data copy. As with data load, the status of the Data Transfer job is displayed in the Transfer Progress View. When the job has completed, an entry is placed in the Transfer History View.

Transfer Tables from Hadoop to Aster

There are two ways to invoke the Hadoop to Aster Transfer Wizard. One way is to drag a Hadoop table from the Data Source Explorer and drop it on an Aster Database in the Transfer View. You can also invoke the wizard by selecting the Aster Tables folder in the Data Source Explorer, right click and choosing Aster>Data Transfer... option. This will invoke the Data Transfer Wizard for you to select Hadoop as the Source Type.

    

Press the Launch button to launch the Hadoop Table to Aster Wizard. Choose the Hadoop Connection Profile to locate the database and table to transer. Next it will display the columns and column types of the Hadoop table. You can filter out columns and select whether the column can contain nulls and whether it is unique. The Hadoop Table to Aster Wizard will only create Aster Fact tables.

Press finish to complete the Hadoop data transfer and submit a Data Transfer job to perform the data copy. As with data load, the status of the Data Transfer job is displayed in the Transfer Progress View. When the job has completed, an entry is placed in the Transfer History View.

Hadoop Data Transfer Job

A transfer job is created to transfer the data to and from Teradata and Hadoop. You can view the progress of the transfer job in the Transfer Progress View of the Data Transfer perspective. NOTE: With the Oozie workflow, the status of the job is not available until the job has finished. Once the job is complete, an entry is placed in the Transfer History and displayed in the Transfer History View.

Select the entry in the Transfer History and click on the Show Job Output toolbar button to view the output from the Hadoop job transfer.

                     

Help

Teradata Studio provides Help information. Click on Help>Help Contents in the main toolbar. You can also get additional information on the Hadoop Transfer Wizard by clicking the question mark, '?' at the lower left hand corner of the wizard page.

     

Conclusion

Teradata Studio Hadoop Smart Loader provides an ad hoc data movement tool to transfer data between Teradata and Hadoop. It provides a point and click GUI where no scripting is required. You can download Teradata Studio and the Teradata Connector for Hadoop on the Teradata Download site. For more information about other Teradata Studio features, refer to the article called Teradata Studio.

 

Ignore ancestor settings: 
0
Channel: 
Apply supersede status to children: 
0

Teradata Connector for Hadoop now available

$
0
0
Short teaser: 
Teradata Connector for Hadoop: High-performance bi-directional data movement between TD and Hadoop.
Cover Image: 

This forum has moved to: https://community.teradata.com/t5/Connectivity/Teradata-Connector-for-Hadoop-Now-Available/td-p/17138

Please visit the new forum for updates.

The Teradata Connector for Hadoop (TDCH) is a map-reduce application that supports high-performance parallel bi-directional data movement between Teradata systems and various Hadoop ecosystem components.

Overview

The Teradata Connector for Hadoop (Command Line Edition) is freely available and provides the following capabilities:

o    End-user tool with its own CLI (Command Line Interface).

o    Designed and implemented for the Hadoop user audience. 

o    Provides a Java API, enabling integration by 3rd parties as part of of an end-user tool. Hadoop vendors such as Hortonworks, Cloudera, IBM and MapR use TDCH's Java API in their respective Sqoop implementations, which are distributed and supported by the Hadoop vendors themselves. There is a Java API document available upon request.

o    Includes an installation script which setups up TDCH such that it can be launched remotely by Teradata Studio's Smart Loader for Hadoop and Teradata DataMover. For more information about these products see: 

·         Smart Loader for Hadoop article

·         Teradata Studio 14.02 now available article 

Need Help? 

For more detailed information on the Teradata Connector for Hadoop, please see the attached Tutorial document as well as the README file in the appropriate TDCH download packages.  The download packages are for use on commodity hardware.  For Teradata Hadoop Appliance hardware, it will be distributed with the appliance.  TDCH is supported by Teradata CS in certain situations where the user is a Teradata customer.

Teradata Connector for Hadoop 1.5.1 is now available.

For more information about Hadoop Product Management (PM), Teradata employees can go to Teradata Connections Hadoop PM.

Ignore ancestor settings: 
0
Channel: 
Apply supersede status to children: 
0

Big Data - Big Changes

$
0
0
Course Number: 
50782
Training Format: 
Recorded webcast

Behind the Big Data hype and buzzword storm are some fundamental additions to the analytic landscape.

What new tools do you need to deal with the new storm of data ripping up your IT infrastructure? What do these new tools do and what are they not good at? How do you choose among these tools to solve the business challenges your company is facing? And how do you tie all the tools together to make them work as an overall analytic ecosystem?

Presenter: Todd Walter, Chief Technologist – Teradata Corporation

Audience: 
Database Administrator, Designer/Architect
Price: 
$195
Credit Hours: 
2
Channel: 

Teradata Connector for Hadoop 1.0.7 now available

$
0
0
Short teaser: 
Teradata Connector for Hadoop: High-performance bi-directional data movement between TD and Hadoop.
Cover Image: 

The Teradata Connector for Hadoop (TDCH) provides scalable, high performance bi-directional data movement between the Teradata database system and Hadoop system.

Some new features included in this release include:

  1. Added an access lock option for importing data from Teradata to improve concurrency.  If one chooses to use lock-for-access, the import job will not be blocked by other concurrent accesses against the same table. 
  2. Added the support for importing data into an existing hive partitioned table.
  3. Allow a Hive configuration file path to be specified by the -hiveconf parameter, so the connector can access it in either HDFS or a local file System. This feature would enable users to run hive importor/export jobs on any node of a Hadoop cluster (see section 8.5 of the REAME file for more information).
  4. With Teradata Database Release 14.10, a new split.by.amp import method is supported (see section 7.1(d) of the README file for more information).

Some problems fixed in this release include:

  1. Inappropriate exceptions reported from a query-based import job. Only the split.by.partition method supports a query as an import source. A proper exception will be thrown if a non split.by.partition import job is issued with the "sourcequery" parameter. 
  2. One gets an error when the user account used to start Templeton is different from the user account used by Templeton to run a Connector job.  A time-out issue for large data import jobs.  In the case of a large-size data import, the Teradata database may need a long time to produce the results in a spool table before the subsequent data transfer.  If this exceeds the time-out limitation of a mapper before the data transfer starts, the mapper would be killed. With this fix, the mapper would be kept alive instead. 
  3. A timeout issue for export jobs using internal.fastload. The internal.fastload export method requires synchronization of all mappers at the end of their execution. If one mapper finishes its data transfer earlier than some others, it has to wait for other mappers to complete their work.  If the wait exceeds the time-out of an idle task, the mapper would be killed by its task tracker.  With this fix, that mapper would be kept alive instead. 
  4. Fix the limitation that the user should have authorization to create local directory while executing Hive job on one node without Hive configuration (hive-site.xml) file. Before the bug fixing, the TDCH needs to copy the file from HDFS to local file system. 
  5. Case-sensitivity problems with the following parameters: "-jobtype", "-fileformat", and "-method". With this fix, values of these parameters do not have to be case-sensitive any more.
  6. Incorrect delimiters used by an export job for Hive tables in RCFileFormat. 

Need Help? 

For more detailed information on the Teradata Connector for Hadoop, please see the Tutorial document in the Teradata Connector for Hadoop Now Available article as well as the README file in the appropriate TDCH download packages.  The Tutorial document mainly discusses the TDCH (Command Line Edition).  The download packages are for use on commodity hardware.  For Teradata appliance hardware, it will be distributed with the appliance.  TDCH is supported by Teradata CS in certain situations where the user is a Teradata customer.

For more information about Hadoop Product Management (PM), Teradata employees can go to Teradata Connections Hadoop PM.

Ignore ancestor settings: 
0
Channel: 
Apply supersede status to children: 
0

Teradata Studio Usage Videos

$
0
0
Additional contributors: 
Cover Image: 

Get a quick start on Teradata Studio and Teradata Studio Express with these usage videos. The overview and connection videos apply to both Teradata Studio and Teradata Studio Express. The others apply to Teradata Studio only: Create Database, Create Table, Move Space, Smart Loader, Copy Objects, and Transfer Data. Sritypriya Verma introduces you to the interface and demonstrates each of the tasks.

To view the videos in full screen, click on the full screen button in the lower-right corner of each video.

NOTE: For security reasons, information such as employee ID numbers and system names have been blurred out in post-recording editing.

Teradata Studio and Teradata Studio Express Overview

Here is an introduction to the Teradata Studio and Teradata Studio Express interface.

  • For full screen mode, double-click the image during playback; press Escape to reduce the window.
  • Skip to the section(s) of your choice using the table of contents on the left.
  • To hide the table of contents, click on the Contents button in the lower right panel:

 

Creating a Connection Profile

This recording shows you how to create a connection to the database in Teradata Studio.

 

 

 

 

The following recordings pertain to Teradata Studio only (not Teradata Studio Express).

How to Create a Database

This recording shows you how to create a database using the Create Database Wizard in Teradata Studio.

 

 

 

How to Create a Table

This recording shows you how to create a database using the Create Table Wizard in Teradata Studio.

 

 

 

 

How to Move Space

This recording shows you how to use the Move Space wizard to move space from one database or user to another in Teradata Studio. Note that you are not moving preallocated space, but you are just adjusting the maximum space that is available to the databases and users for storing table data, indexes, stored procedures, and journals.

 

 

How to Compare Objects

This recording shows you to compare Teradata objects (such as tables, views, macros, stored procedures, etc.) in two different databases or two different systems.  This feature was made available starting in Teradata Studio 15.00.

 

 

How to Transfer Data

This recording shows how to transfer data between Teradata, Aster, Hadoop, and external files.  You will see how to load and export data, copy objects (with or without the source data), and use the Smart Loader (which creates a target table if it does not exist yet, using the source data).  The video also identifies which capabilities were already available in Teradata Studio 14.10, and which were added in Teradata Studio 15.00.

  • For full screen mode, double-click the screen image during playback; press Escape to reduce the window.
  • Skip to the section(s) of your choice using the table of contents on the left.
  • To hide the table of contents, click on the Contents button in the lower right panel:

 

Ignore ancestor settings: 
0
Channel: 
Apply supersede status to children: 
0

Monitoring ETL applications with Unity Ecosystem Manager

$
0
0
Short teaser: 
Monitoring unstructured data applications with Unity Ecosystem Manager
Cover Image: 

The Analytical Ecosystem can be quite complex. It usually consists of multiple managed servers containing instances of databases, ETL servers, infrastructure services and application servers. Monitoring and managing applications in this environment can be a very challenging task. Knowing at any moment what is the state of your hardware, how your applications are doing, how many jobs have finished successfully, how many failed and why have they failed are the types of questions database administrators typically ask themselves. Now with the addition of Hadoop infrastructure components within an ecosystem, monitoring has become even harder. Unity Ecosystem Manager helps users to answer those questions and perform and necessary maintenance tasks.

Environment

Today a job can consist of steps that process both structured and semi-structured data at the same time. Imagine a user application that needs to process constantly arriving web logs, extract critical user data and insert it into a database.  The user needs to move files into HDFS, run a Map-Reduce job to extract the data, convert it into a relational DB format and finally insert the data into a database table.

The 14.10 release of Unity Ecosystem Manager allows monitoring of all aspects of a user application including jobs, tables, hardware (servers) and software (daemons).  In its latest version Unity Ecosystem Manager provides a unified view of the entire Analytical Ecosystem in a single web interface. The new Ecosystem Explorer portlet deployed on Teradata Viewpoint shows user applications, jobs, tables, servers and daemons with ability to see all application dependencies at the same time.

A user can configure and monitor a web log processing application from the Ecosystem Manager User Interface. Here’s a sample configuration for a Web Log processing application viewed from the Ecosystem Manager Explorer Application perspective. Using dependency buttons a user can view applications and its dependencies within a single view:

Unity Ecosystem Manager Sendevent API supports the ability to pass information about a job, table, server or application to the Ecosystem Manager repository.  This simple mechanism allows user to monitor the operation aspects of these components and so that they can perform necessary management tasks.  At the same time, Ecosystem Manager can self-discover many pieces of the Analytical Ecosystem infrastructure automatically. For example, installing the Ecosystem Manager client on a Linux box will automatically send information about the server as well as Teradata ETL jobs such as TPT or Data Mover processes to the Ecosystem Manager server.

How to track a Hadoop Job

In order to track Hadoop jobs a communication with the Job Tracker daemon needs to be established. Normally the configuration for Job Tracker resides in /etc/hadoop/conf/mapred-site.xml. Here’s a sample entry containing the port on which the daemon listens to requests:

<property>

    <name>mapred.job.tracker</name>

    <value>hortonworks-dev-training.localdomain:50300</value>

  </property>

A simple java program can connect to the Job Tracker via an interface called JobClient which permits getting information about all jobs. Here’s a javadoc  for Hadoop JobClient API: https://hadoop.apache.org/docs/stable/api/org/apache/hadoop/mapred/JobClient.html.  The getAllJobs() method return an array of job statuses and from there it is possible to get individual task reports which contain such data as Map and Reduce tasks finish time. The JobStatus object contains job name, start time and run state (RUNNING, SUCCEEDED, FAILED, PREP and KILLED).  

    Configuration conf = new Configuration(); // hadoop job configuration

    conf.set("mapred.job.tracker", "http://153.64.26.135:50300");

    JobClient client = new JobClient(new JobConf(conf));

    JobStatus[] jobStatuses = client.getAllJobs();

    for (JobStatus jobStatus : jobStatuses)

         System.out.println("JobID: " + jobStatus.getJobID().toString() + ", status: " + getEnumStatus(jobStatus.getRunState());

Having obtained critical job metadata enables user to track the job and make necessary management decisions such as stopping one of the related jobs or changing an application state. To execute intelligent decisions about jobs, a user needs to send job metadata to Ecosystem Manager repository via sendevent interface.

           SendEvents se = new SendEvents();          

           if(jobStatus.getRunState() == 1) // RUNNING

              se.execute(getHostName(), jobStatus.getJobID().toString(), "START");

In order to send events to an EM server from a java program user needs to use Ecosystem Manager SendEvent Java API. The steps include installing EM client software (agent and publisher packages), run service config:

/opt/teradata/client/em/bin/emserviceconfig.sh JAVA_HOME Primary_EM_Repository Secondary_EM_Reposiotry  Ecosystem_Name Server_Type

After configuring Ecosystem Manager services a user can compile the Job tracking java program together with the SendEvent API calls. In order to do that user needs to add the following jar files containing Hadoop and Ecosystem Manager libraries to java CLASSPATH environment variable:

/usr/lib/hadoop/hadoop-core.jar:/usr/lib/hadoop/lib/jackson-core-asl-1.8.8.jar:/usr/lib/hadoop/lib/jackson-mapper-asl-1.8.8.jar:/usr/lib/hadoop/lib/commons-lang-2.4.jar:/usr/lib/hadoop/lib/jetty-6.1.26.jar:/usr/lib/hadoop/lib/jetty-util-6.1.26.jar: $EM_HOME/lib/messaging-api.jar:$EM_HOME/lib/em.jar:

A sample statement to compile the program:

/opt/teradata/jvm64/jdk6/bin/javac -cp $CLASSPATH TestTracker.java

A sample statement to run the program:

/opt/teradata/jvm64/jdk6/bin/java -cp /usr/lib/hadoop/hadoop-core.jar:/usr/lib/hadoop/lib/jackson-core-asl-1.8.8.jar:/usr/lib/hadoop/lib/jackson-mapper-asl-1.8.8.jar:/usr/lib/hadoop/lib/commons-lang-2.4.jar:/usr/lib/hadoop/lib/jetty-6.1.26.jar:/usr/lib/hadoop/lib/jetty-util-6.1.26.jar:/usr/lib/hadoop/lib/commons-configuration-1.6.jar:/opt/teradata/client/em/lib/log4j-1.2.9.jar:$CLASSPATH TestTracker

Job Tracking java program would run as a daemon so there is a while(true) loop in the main method which could be terminated only with a kill command or a system exit call:

public static void main(String[] args) throws Exception {

     while(true) {

        run(args);

        Thread.sleep(4000);

     }

  }

Once started it would continue to scan the Hadoop Job tracker and send messages about jobs.

Here’s a sample set of START events that are going to be issued in order to see application and related components (jobs, tables and servers). Please note that START events will have to be followed by END events in order to see the job completed:

sendevent --et START --jid CopyLogFilesJob --tds sdll7949.labs.teradata.com  -t MLOAD --sc CONT -w WLAUOWID1 --app WebLogApp

sendevent --et START --jid MRExtractJob --tds sdll7949.labs.teradata.com  -t EXTRACT --sc CONT -w WLAUOWID2 --app WebLogApp

sendevent --et START --jid InsertJob --tds sdll6128  -t MLOAD --sc CONT -w WLAUOWID1 --app WebLogApp --wdb UserDB --wtb UserTB -v 50

User interface

While running the jobs and even after they finish user can review their statuses using Ecosystem Manager User interface. This picture shows how user can see the job execution report and server metric report that share the same timeline. This gives user a unique opportunity to debug jobs and answer the questions such as why the job failed or missed its SLA:

From Jobs view user can drill down to see job events

Finally user can view the table into which the obtained data (50 rows in this example) is inserted:

Conclusion

With Unity Ecosystem Manager’s monitoring capabilities it is easy to make sense of complicated Analytical Ecosystem. Specifically, this article shows how to monitor different ETL jobs whether it is a Hadoop job or another type.  It also demonstrates new Unity Ecosystem Manager User Interface including the ability to see all components of an Analytical Ecosystem including applications, jobs, tables and servers in a single integrated view.

 

 

 

Ignore ancestor settings: 
0
Channel: 
Apply supersede status to children: 
0

Introduction to Teradata Aster Discovery Platform

$
0
0
Course Number: 
51326
Training Format: 
Recorded webcast

The Teradata® Aster® Discovery Platform is the industry’s first next-generation, integrated discovery platform that provides powerful insights from big data.

This course provides an overview of key capabilities in Aster Discovery Platform that enable multiple analytics on all data with speed and minimal effort. The course discusses business use cases and how Aster Discovery Platform is an integral component of the Teradata Unified Data Architecture, creating new value and unmatched competitive advantage for organizations.

Presenter: Ashish Chopra - Teradata Corporation

Price: 
$195
Credit Hours: 
1
Channel: 

When to Use Hadoop, When to use the Data Warehouse

$
0
0
Course Number: 
50373
Training Format: 
Recorded webcast

In some areas, big data tools, such as Hadoop and HBase, overlap with relational databases and data warehouses. Putting the hyperbole aside, how should you choose when both systems provide analytic value?

This session explores the clear differentiators between Hadoop and the data warehouse, helping you decide when to use which software for a task. Then it gets murky because multiple requirements must be considered before making a choice about where to host an analytic task. Digging into those requirements helps shine some light on the best decision. We examine Hadoop as the repository and refinery, the data warehouse for integrated data, and then delve into use cases. Some of the use cases include temporary data, sand boxes, consumer churn and recommendation engines. The session ends with a quick look at Teradata Aster.

Note: This was a 2013 Teradata Partners Conference session.

Presenter: Dan Graham – Teradata Corporation

Audience: 
Database Administrator, Designer/Architect, Application Developer, Data Analyst
Price: 
$195
Credit Hours: 
1
Channel: 

YARN and Tez

$
0
0
Course Number: 
51457
Training Format: 
Recorded webcast

TThis session dives into the new Hadoop architectural constructs called YARN and Tez.

Use of YARN to allocate memory and CPU coupled with launching applications is explained. Comparisons with TASM and Aster SNAP to YARN are explored. Tex reuse of memory buffers for queries and YARN containers are explained.

Presenter: Dan Graham - Teradata Corporation

Audience: 
Database Administrator, Designer/Architect, Application Developer, Data Analyst
Price: 
$195
Credit Hours: 
1
Channel: 

Hadoop Security 101

$
0
0
Course Number: 
51225
Training Format: 
Recorded webcast

Processes for integrating security controls were underdeveloped in early versions of Hadoop because data protection needs in its ecosystem were not well-understood.

Until recently, many considered Hadoop security inadequate. More robust security controls are now available for user identification with Kerberos authentication, LDAP authorization, and end-user audit accountability on Hadoop systems. This session explains the new identification, authentication, authorization and audit security controls in Hortonworks Enterprise Apache Hadoop.

Presenter: Dan Pumphrey - Teradata Corporation

Price: 
$195
Credit Hours: 
1
Channel: 

Unified Data Architecture Monitoring & Management

$
0
0
Course Number: 
51539
Training Format: 
Recorded webcast

The Unified Data Architecture enables Teradata, Aster and Hadoop to deliver unparalleled value.

Teradata Viewpoint, Studio, Table Operators, and the Unity product suite provide enabling technologies for connectivity, monitoring, and management for all these systems. This presentation highlights the features and capabilities of these client enabling solutions.


Presenter: Gary Ryback - Teradata Corporation

Price: 
$195
Credit Hours: 
2
Channel: 

Teradata Enterprise Access for Hadoop

$
0
0
Course Number: 
52581
Training Format: 
Recorded webcast

This webinar provides an overview of the Teradata Enterprise Access for Hadoop solutions.

This includes an overview of the following solutions:

  • Teradata Viewpoint portlets for Hadoop
  • Teradata Unity Ecosystem Manager
  • Teradata QueryGrid: Teradata-Hadoop
  • Teradata QueryGrid: Aster-Hadoop
  • Teradata Connector for Hadoop (TDCH)
  • Teradata Parallel Transporter (TPT)
  • Teradata Studio with Smart Loader for Hadoop
  • Teradata Unity Data Mover

Presenter: Ariff Kassam, Director of Emerging Technologies Strategy - Teradata

Price: 
$195
Credit Hours: 
0
Channel: 

Teradata Query Grid and Machine Learning in Hadoop

$
0
0
Short teaser: 
This article describes how to use Teradata query grid to execute a Mahout machine learning algorithm
Cover Image: 

This article describes how to use Teradata query grid to execute a Mahout machine learning algorithm on a Hadoop cluster based on data sourced from the Teradata Integrated Data Warehouse. Specifically the Mahout K-means cluster analysis algorithm is demonstrated.  K-means is a computationally expensive algorithm that under certain conditions is advantageous to execute on the Hadoop cluster. Query Grid is an enabling technology for the Teradata Unified Data Architecture (UDA). The UDA is a logical and physical architecture that adds a discovery platform and a data platform to complement the Teradata Integrated Data Warehouse. In the Teradata advocated solution, the discovery platform is Teradata Aster, while the data platform can either be Hadoop or a Teradata Integrated Big Data Platform optimized for storage and processing of big data. Query Grid is an orchestration mechanism that supports seamless integration of multiple types of purpose built analytic engines.

The article describes several techniques

  • How to use query grid to bidirectionally transfer data between Teradata and Hadoop
  • How to convert Teradata exported data to a format consumable by the Mahout K-means algorithm
  • How to use query grid to execute a Mahout analytic on the Hadoop cluster
  • How to convert the Mahout K-means algorithm output to a format consumable by Teradata

System Configuration

Teradata Version: 15.0 installed on a 4 Node 2700

Hadoop Version: Hadoop 2.4.0.2.1.5.0-695 installed on 1 name node and 8 data nodes.

Mahout Version: 0.9.0.2.1.5.0-695. Mahout Installed on the Hadoop cluster using "sudo yum install mahout".

Hive Version: 13.0

Query Grid Foreign Server Definition:

CREATE FOREIGN SERVER TD_SERVER_DB.tdsqlhhive USING 
hosttype  ('hadoop')
port  ('9083')
hiveport  ('10000')
username  ('hive')
server  ('39.16.72.2')
hiveserver  ('39.16.72.2')
DO IMPORT WITH SYSLIB.LOAD_FROM_HCATALOG USING 
transformformatting  ('true')
,DO EXPORT WITH SYSLIB.LOAD_TO_HCATALOG USING 
hbasepath  ('/apps/hive/warehouse/')
export_file_format  ('text')

HDFS Disk Space Usage

/apps/hive/warehouse/temp:   Hive temporary work space for query grid input and output data

/tmp/mahout:  HDFS temporary workspace for Mahout K-means algorithm and Map Reduce conversion Jobs

Step 1:

Export Teradata table to Hive table. The assumption is that the input K-means Teradata table row layout is a single BIG INTEGER key value followed by N floating point variables.  In a specific test case the DDL is as follows:

Teradata table definition.

CREATE SET TABLE kmdata     (
      pkey BIGINT GENERATED ALWAYS AS IDENTITY  (START WITH 1 INCREMENT BY 1 MINVALUE -2147483647 MAXVALUE 2147483647  NO CYCLE),
      v1 FLOAT,
      v2 FLOAT,
      v3 FLOAT)
PRIMARY INDEX ( pkey );

Query Grid SQL commands to create a Hive table and export a Teradata table.

CALL SYSLIB.HCTAS('kmdata',null,null,'tdsqlhhive','temp') ;
INSERT INTO temp.kmdata@tdsqlhhive SELECT * FROM kmdata;

Hive table Definition:

describe kmdata;
pkey                    bigint
v1                      double
v2                      double
v3                      double

Note if the data to be exported to Hive is “small”,  by default query grid minimally creates one HDFS block per AMP, you may want to use the foreign server “merge_hdfs_files('value’)” option. This option indicates that files under the same partition should be merged whenever possible. The default is to not merge. A value of TRUE means that files will be merged FALSE otherwise. The following Teradata DDL command can be used to modify this option  “ALTER FOREIGN SERVER tdsqlhhive ADD merge_hdfs_files('true') ;”

Step 2:

Convert Hive data to a format consumable by Mahout K-means. Mahout K-means requires the input data to be in the following format: SequenceFile (Key, VectorWritable) where VectorWritable is a vector containing the variables which define the records to be clustered.  In addition if you want the key values associated with each clustered record you need to create NamedVector input vectors where the Name contains the Key value.  A more seamless approach to this use case would have been to use a Hive UDF, specifically as Generic UDF, to handle the data format conversions.  Unfortunately two issues where encountered with Hive UDFs:

  • Could not associate a Mahout external JAR with Hive 13 UDF.  “ADD JAR” did not resolve this Mahout "classdefnotfounderror" issue.
  • The Hive sequence file serde does not provide the key values to the UDF, or any reader. The key value contains the K-means cluster assignment value which is the main purpose of running K-means.

Because of the Hive UDF issue the following java MapReduce Mapper code was used to do the conversions. Mahout classes where used to construct the input vectors.  In addition by default query grid exports data to hive a text file format using the default field delimiter of ‘\u0001’, the defined delimiter value is passed to the mapper in the configuration object. The map reduce job will create an appropriately formatted HDFS file in the HDFS directory /tmp/mahout/input

Map Reduce Job code to convert a CSV textfile to a vector sequence file:

package org.myorg;

import java.io.IOException;
import java.util.*;
import org.apache.commons.io.FileUtils;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.*;
import org.apache.hadoop.mapreduce.lib.input.*;
import org.apache.hadoop.mapreduce.lib.output.*;
import org.apache.hadoop.util.*;
import org.apache.mahout.math.*;

public class CsvToVec extends Configured implements Tool {

public static class MapCsvVec extends Mapper<LongWritable, Text, Text, VectorWritable> {
@Override
   public void map(LongWritable key, Text value, Context context) throws IOException {
       Configuration conf = context.getConfiguration();
       String delim = conf.get("delim");
       String line = value.toString();
       String[] c = line.split(delim);
       int vlen = c.length - 1;           //first field is key, remaining are variables  
       double[] d = new double[vlen];
       String pkey = new String(c[0]);    //access key in field 0
       Text opkey = new Text(pkey);

       for (int i = 0; i < vlen; i++)     //create array of variables   
          d[i] = Double.parseDouble(c[i+1]);

       NamedVector vec = new NamedVector(new DenseVector(d), pkey );
       VectorWritable writable = new VectorWritable();
       writable.set(vec);
       try {
           context.write(opkey, writable);
       } catch(InterruptedException e) {
       }
   }
}

  public int run(String[] args) throws Exception {
    Job job = Job.getInstance(super.getConf());

    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(VectorWritable.class);

    job.setMapOutputKeyClass(Text.class);
    job.setMapOutputValueClass(VectorWritable.class);

    job.setMapperClass(MapCsvVec.class);
    job.setNumReduceTasks(0);

    job.setInputFormatClass(TextInputFormat.class);
    job.setOutputFormatClass(SequenceFileOutputFormat.class);

    FileInputFormat.setInputPaths(job, new Path(args[0]));
    FileOutputFormat.setOutputPath(job, new Path(args[1]));

    job.setJarByClass(CsvToVec.class);

    job.waitForCompletion(true);
    return 0;
    }

 public static void main(String[] args) throws Exception {
    Configuration conf = new Configuration();
    String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();
    conf.set("delim",otherArgs[2]);
    int res = ToolRunner.run(conf, new CsvToVec(), args);
    System.exit(res);
 }
}

Compile command for CsvToVec.java:

export HADOOP_HOME=/usr/lib/hadoop
export HADOOP_VERSION=2.4.0.2.1.5.0-69
export JAVA_HOME=/usr/jdk64/jdk1.7.0_45/bin
mkdir csvtovec_classes
$JAVA_HOME/javac -Xlint -classpath ${HADOOP_HOME}/hadoop-common.jar:/usr/lib/hadoop-mapreduce/hadoop-mapreduce-client-core.jar:/usr/lib/mahout/mahout-math-0.9.0.2.1.5.0-695.jar:/usr/lib/mahout/mahout-core-0.9.0.2.1.5.0-695-job.jar:/usr/lib/mahout/mahout-integration-0.9.0.2.1.5.0-695.jar -d csvtovec_classes CsvToVec.java
$JAVA_HOME/jar -cvf CsvToVec.jar -C csvtovec_classes/ .

Map Reduce invocation bash shell script: assume run as root

MAHOUT_VERSION=0.9.0.2.1.5.0-695
MAHOUT_HOME=/usr/lib/mahout
HIVEPATH=/apps/hive/warehouse/temp.db
LIBJARS=$MAHOUT_HOME/mahout-math-$MAHOUT_VERSION.jar,$MAHOUT_HOME/mahout-core-$MAHOUT_VERSION-job.jar,,$MAHOUT_HOME/mahout-integration-$MAHOUT_VERSION.jar
export HADOOP_CLASSPATH=$MAHOUT_HOME/mahout-math-$MAHOUT_VERSION.jar:$MAHOUT_HOME/mahout-core-$MAHOUT_VERSION-job.jar:$MAHOUT_HOME/mahout-integration-$MAHOUT_VERSION.jar
infile=/tmp/mahout/input
delim='\u0001'
#clean work directory
hadoop fs -rm -r /tmp/mahout
hadoop jar CsvToVec.jar org.myorg.CsvToVec -libjars ${LIBJARS} $HIVEPATH/kmdata $infile $delim
# The Hive transform function will invoke Mahout as user YARN so make accessible by YARN
hadoop fs -chmod 777 /tmp/mahout 

Note on using libjars when invoking the map reduce job. You have to use the Configuration object passed to the ToolRunner.run method in your MapReduce driver code. Otherwise your job won’t be correctly configured and the Mahout JAR’s won’t be accessible in the mappers JVM’s.

Step 3:

Execute Mahout K-means to cluster the input data based on the N input variables, in this example v1,v2,v3. The Hive TRANSFORM function is used to execute K-means. The Hive TRANSFORM function can be used within the FOREGIN TABLE query grid syntax to execute a “script” on the Hadoop cluster.  Hive Transform allows users to plug in their own custom mappers and reducers in the Hive query processing data stream by implementing mappers and the reducers as “scripts”. The specific usage in this case of query grid and Mahout K-means is to use TRANSFORM to execute  a BASH shell script that invokes Mahout K-means and processes the data stream external to Hive.   For more details on TRANSFORM see https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Transform

Because the FOREIGN TABLE syntax does not support the ADD FILE command the TRANSFORM script to run K-means must be installed on each node within the Hadoop cluster. Note the Hive ADD FILE is the typical mechanism to distribute TRANSFORM scripts. Also a one row one integer column table named temp.onerow is used to satisfy the Transform input data requirements.

Query Grid SQL used to invoke K-means:


SELECT trim(BOTH FROM t1.x) FROM FOREIGN TABLE (
 SELECT 
  TRANSFORM(c1,'/tmp/mahout/input','/tmp/mahout/kmeans/output','/tmp/mahout/kmeans/clusters-0-final',10,10,0.5) USING 'bash /tmp/runmh' AS (oc1) 
  FROM temp.onerow
)@tdsqlhhive t1 (x) ;

The input parameters to TRANSFORM will be passed to the BASH shell script and accessible from STDIN. The second parameter to TRANSFORM is a comma separated list of values defined as: input directory, output directory, initial cluster definitions work directory, Maximum Number of Iterations, K value, match error tolerance. See the following for all mahout K-means command line input definitions, https://mahout.apache.org/users/clustering/k-means-commandline.html . The USING clause contains the script to execute, in this example a BASH shell script "/tmp/runmh".

BASH shell script runmh: installed in /tmp directory on all Hadoop data nodes. 

#!/bin/bash
#
export JAVA_HOME=/usr/jdk64/jdk1.7.0_45/bin/java

logfile=/tmp/log.txt
# read TRANSFORM inputs from STDIN
read col1value infile outfile workclusters iter K tolerance INPUT

/usr/bin/mahout kmeans --input $infile  --output $outfile  --numClusters $K   --clusters $workclusters --maxIter $iter  --method mapreduce --clustering -ow -cl -cd $tolerance > $logfile

echo "Initiated on cluster node $(uname -n)"
res=$(<$logfile)
echo "$res"
rm $logfile

Step 4

Convert K-means output to a format consumable by Teradata and Hive. Mahout K-means output is in the format SequenceFile (Key, VectorWritable).  Where the Key is the cluster assignment and the Vector is a Named Vector of the Format {Primary Key, v1, v2, …Vn). Because of the Hive UDF issue and the fact that Hive sequence file serde does not return the Key the following java MapReduce code was used to do the conversions. This process creates an output external Hive table named kmout in the /tmp/mahout directory.

Map Reduce Job code to convert a sequence file vector to a CSV text file.

package org.myorg;

import java.io.IOException;
import java.util.*;
import org.apache.commons.io.FileUtils;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.*;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.mapreduce.*;
import org.apache.hadoop.mapreduce.lib.input.*;
import org.apache.hadoop.mapreduce.lib.output.*;
import org.apache.hadoop.util.*;
import org.apache.mahout.math.*;
import org.apache.mahout.clustering.classify.WeightedPropertyVectorWritable;
import org.apache.mahout.utils.vectors.VectorHelper;

public class VecToCsv extends Configured implements Tool {

public static class MapVecCsv extends Mapper<IntWritable, WeightedPropertyVectorWritable, NullWritable, Text> {
@Override
   public void map(IntWritable key, WeightedPropertyVectorWritable value, Context context) throws IOException, InterruptedException {
       String skey = key.toString();
       Text okey = new Text(skey);
       NamedVector vector =  ((NamedVector)value.getVector());
       String resStr = skey + "," + vector.getName() + "," + VectorHelper.vectorToCSVString(vector, false);
       resStr = resStr.replaceAll("(\\r|\\n)", "");
       Text outvalue = new Text(resStr);

       try {
           if (resStr.length() > 0) {
              context.write(NullWritable.get(), outvalue);
           }
       } catch(InterruptedException e) {
          throw e;
       }
   }
}

  public int run(String[] args) throws Exception {

    Job job = Job.getInstance(super.getConf());

    job.setMapOutputKeyClass(NullWritable.class);
    job.setMapOutputValueClass(Text.class);

    job.setOutputKeyClass(NullWritable.class);
    job.setOutputValueClass(Text.class);

    job.setMapperClass(MapVecCsv.class);
    job.setNumReduceTasks(0);

    job.setInputFormatClass(SequenceFileInputFormat.class);
    job.setOutputFormatClass(TextOutputFormat.class);

    FileInputFormat.setInputPaths(job, new Path(args[0]));
    FileOutputFormat.setOutputPath(job, new Path(args[1]));

    job.setJarByClass(VecToCsv.class);

    job.waitForCompletion(true);
    return 0;
    }

 public static void main(String[] args) throws Exception {
    Configuration conf = new Configuration();
    String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();
    int res = ToolRunner.run(conf, new VecToCsv(), args);
    System.exit(res);
 }
}

Compile Command for VecToCsv.java

export HADOOP_HOME=/usr/lib/hadoop
export HADOOP_VERSION=2.4.0.2.1.5.0-69
export JAVA_HOME=/usr/jdk64/jdk1.7.0_45/bin
mkdir vectocsv_classes
$JAVA_HOME/javac -Xlint -classpath ${HADOOP_HOME}/hadoop-common.jar:/usr/lib/hadoop-mapreduce/hadoop-mapreduce-client-core.jar:/usr/lib/mahout/mahout-math-0.9.0.2.1.5.0-695.jar:/usr/lib/mahout/mahout-core-0.9.0.2.1.5.0-695-job.jar:/usr/lib/mahout/mahout-integration-0.9.0.2.1.5.0-695.jar -d vectocsv_classes VecToCsv.java
$JAVA_HOME/jar -cvf ./VecToCsv.jar -C vectocsv_classes/ .

Map Reduce invocation bash shell Script: assume run as root

MAHOUT_VERSION=0.9.0.2.1.5.0-695
MAHOUT_HOME=/usr/lib/mahout
LIBJARS=$MAHOUT_HOME/mahout-math-$MAHOUT_VERSION.jar,$MAHOUT_HOME/mahout-core-$MAHOUT_VERSION-job.jar,$MAHOUT_HOME/mahout-integration-$MAHOUT_VERSION.jar
export HADOOP_CLASSPATH=$MAHOUT_HOME/mahout-math-$MAHOUT_VERSION.jar:$MAHOUT_HOME/mahout-core-$MAHOUT_VERSION-job.jar:$MAHOUT_HOME/mahout-integration-$MAHOUT_VERSION.jar
HIVEPATH=/apps/hive/warehouse/temp.db
TEMPPATH=/tmp/mahout
infile=$TEMPPATH/kmeans/output/clusteredPoints
# convert to csv
hadoop jar VecToCsv.jar org.myorg.VecToCsv -libjars ${LIBJARS} $infile  $TEMPPATH/kmout

Create Hive external table to be used for Query Grid import

use temp;
DROP TABLE kmout;
CREATE EXTERNAL TABLE kmout (clusterid INT, pkey BIGINT, v1 FLOAT, v2 FLOAT, v3 FLOAT) ROW FORMAT DELIMITED  FIELDS TERMINATED BY ',' STORED AS TEXTFILE
              LOCATION '/tmp/mahout/kmout';

Step 5

Import the data to Teradata and cleanup the temp Hive tables

SELECT * FROM temp.kmout@tdsqlhhive;

CALL SYSLIB.HDROP('temp','kmdata','tdsqlhhive');
CALL SYSLIB.HDROP('temp','onerow','tdsqlhhive');

 

Example Output for 1,000,000 input rows and K=10

SELECT clusterid,COUNT(*),MIN(v1),MAX(v1) FROM temp.kmout@tdsqlhhive GROUP BY 1 ORDER BY 3;

 *** Query completed. 10 rows found. 4 columns returned. 
 *** Total elapsed time was 48 seconds.

  clusterid     Count(*)             Minimum(v1)             Maximum(v1)
-----------  -----------  ----------------------  ----------------------
     532949       139506   1.00000000000000E 000   1.39506000000000E 005
     397723       128241   1.39507000000000E 005   2.67747000000000E 005
     495004       109499   2.67748000000000E 005   3.77246000000000E 005
      43426        89500   3.77247000000000E 005   4.66746000000000E 005
     557758        74702   4.66747000000000E 005   5.41448000000000E 005
     235205        69648   5.41449000000000E 005   6.11096000000000E 005
     138421        75580   6.11097000000000E 005   6.86676000000000E 005
     656378        89851   6.86677000000000E 005   7.76527000000000E 005
     150185       106277   7.76528000000000E 005   8.82804000000000E 005
     260781       117196   8.82805000000000E 005   1.00000000000000E 006

 

Summary:

This article has described using several technologies

  • Teradata Query Grid
  • Hive Transform Function
  • Mahout K-means
  • Map Reduce   

for performing advance analytics within the Teradata UDA. It should be noted that additional Mahout algorithms can be invoked in a similar manner assuming the input data format requirements have been met.  Future articles will discuss executing other Mahout algorithms and replacing the conversion MapReduce jobs with a Hive UDF to streamline the process.

 

 

 

 

 

 

Ignore ancestor settings: 
0
Channel: 
Apply supersede status to children: 
0

Teradata Studio 15.10 now available

$
0
0
Short teaser: 
Announcing Teradata Studio 15.10 is now available for download
Cover Image: 

Teradata Studio 15.10 is now available for download. Teradata Studio is an administration tool for creating and administering database objects. It can be run on multiple operating system platforms, such as Windows, Linux, and Mac OSX. It is built on top of the Eclipse Rich Client Platform (RCP) which allows Teradata Studio to benefit from the many high quality Eclipse features available while focusing on value-add for the Teradata Database.

What's New:

  • Teradata 15.10 Support

Added support for Teradata 15.10 Database and embedded the Teradata JDBC 15.10 driver.

  • Teradata Query Band Support

Provides support for Query Bands within the Connection Profile. Also provides support for Query Bands within the Database PROFILE object when created using the Administer Profile dialog.

  • Data Source Explorer Enhancements (Add Database, Set Root, Performance Improvements)

Studio 15.10 provides an option to add one Database/User at a time to the Data Source Explorer (DSE). By setting the Data Source Explorer Load Preference to 'User Choice', the Add Database menu option appears for you to enter the name of the Database or User to add to the DSE tree. Another DSE improvement is that ability to set the ROOT of the DSE tree. By checking the Data Source Explorer Load Preference option 'Show Databases and Users in Hierarchical Display', the Set Root menu option appears for you to enter the name of the Database or User you wish to be the root of the DSE tree.

Studio 15.10 also provides Data Source Explorer load performance improvements. Two new options are available in the Data Source Explorer Load Preferences, 'Load tables space and journal details' and 'Load View Columns Data Types'. Unchecking these options will improve the time it takes to load the Tables and Views folder in the DSE.

  • Result Set Viewer Enhancements (Result Set Reuse, Clear Result Sets)

Studio 15.10 added a new preference for the Result Set Viewer to allow reuse of the Result Set Viewer tab. Checking this option will cause the Result Set Viewer to reuse the current window with the new result set generated by the SQL Editor. Also added to the Result Set Viewer is a new toolbar button to clear all Result Set windows.

  • Result Set Export Mode

This feature allows the user to send the query result set data to a file instead of displaying it in the Result Set Viewer. The user will set the Results Handler value on the SQL Handling Preference page. When the next query is executed, the Export Data Wizard is invoked for the user to specify the export file location and file options.

  • SQL History View Enhancements

Added enhancements to the SQL History View Preferences to allow users to select the columns displayed and control the number of rows loaded in the SQL History View.

  • Smart Loader and Teradata Load Enhancements

Smart Loader and Teradata Load have added an option to allow the user to specify the OS line separator or a fixed width column delimiter, as well as providing the user better control over the handling of errors in the load input file. When an error occurs trying to load the row of data, the error row is stored in an error file. An option has been added to stop the load after 'n' number of errors has occurred. The user can also restart the Teradata Load, specifiying the row number of the input file to start the load from.

Teradata Load has also added a preview panel to show a sample of the load input file based on the file options (such as column delimiter, string delimiter, encoding) chosen. This will help the user to verify that the file options are correct before starting the load.

  • Aster 6.10 Support and Parser

Added support for Aster 6.10 Database and embedded the Aster JDBC 6.10 driver. Studio 15.10 also includes an Aster 6.10 Parser to detect syntax errors and provide content assist for SQL statements within the SQL Editor, along with SQL formatting and Outline view support.

  • Aster Administration

Provides Aster Administration user interface for creating and dropping Database, Schema, Role, User, Table, and View objects, as well as granting and revoking privileges for all Aster objects.

  • Aster Analytics Templates

Provides support for Templates View of Aster analytics templates. Users can drag and drop analytic functions from the list in the view to the SQL Editor to display a template for that analytic function.

  • Aster Generate DDL Wizard and Display DDL Menu Option

These features will use the new Aster 6.10 DISPLAY command to generate and display the DDL for Aster Database objects. The menu options are provided in the Aster menu when in the Administrator perspective.

  • Aster Compare Objects

Provides support for comparing two Aster database objects. The user selects an Aster object in the Data Source Explorer and then chooses the Compare With option from the Aster menu. This invokes the Aster Compare dialog for the user to select the object to compare with. The two objects' DDL are displayed side-by-side in the Eclipse Compare Editor.

  • Quick INSERT, DELETE, and UPDATE SQL Templates

Provides additional templates for Teradata, Aster, and Hadoop tables using the Generate SQL menu option.

  • Administer Teradata Foreign Server

Added administration support for Teradata FOREIGN SERVER objects. A user interface is provided to create, drop, and alter FOREIGN SERVER objects for Teradata Database version 15.00.02.04 and above.

  • Hadoop Enhancements

Added support for Hortonworks Knox connections, Cloudera support for Hadoop to Teradata Smart Loader, and SQL-H support for Hadoop to Aster Smart Loader. Also added is HiveQL support using the HiverServer2 JDBC driver that comes embedded with Studio. HiveQL support will allow users to execute HiveQL statements in the SQL Editor and display the result set data in the Result Set Viewer.

  • Install Configuration File

Studio 15.10 provides an install configuration file that allows distribution centers to create a configuration file, containing connection profile definitions, that is loaded during the install. This feature is provided with the Windows Studio install package.

Need Help?

For more information on using Teradata Studio, refer to the article, Teradata Studio or the Teradata Studio User Guide (available on the Teradata Studio Download Page). To ask questions or discuss issues, refer to the Teradata Studio Forum and post your question.

Online Help can be accessed within Teradata Studio:

  • From the main menu: Help > Help Contents > Teradata Studio
  • Context sensitive help: When a user is in a dialog, they can hit the F1 key to retrieve help text sensitive to where they are within the dialog.
  • Also, the Quick Tour provides a quick overview of Teradata Studio features. Go to Help > Welcome.

Reference Documentation can be found on the download page or at: www.info.teradata.com

  • Title: Teradata Studio, Studio Express, and Plug-in for Eclipse Release Definition Publication ID: B035-2040-045C
  • Title: Teradata Studio, Studio Express, and Plug-in for Eclipse Installation Guide Publication ID: B035-2037-045K
Ignore ancestor settings: 
0
Channel: 
Apply supersede status to children: 
0

Why We Love Presto

$
0
0
Short teaser: 
Teradata's collaboration with Facebook for SQL-on-Hadoop
Cover Image: 

Concurrent with acquiring Hadoop companies Hadapt and Revelytix last year, Teradata opened the Teradata Center for Hadoop in Boston. Teradata recently announced that a major new initiative of this Hadoop development center will include open-source contributions to a distributed SQL query engine called Presto. Presto was originally developed at Facebook, and is designed to run high performance, interactive queries against Big Data wherever it may live --- Hadoop, Cassandra, or traditional relational database systems.

Among those people who will be part of this initiative and contributing code to Presto include a subset of the Hadapt team that joined Teradata last year. In the following, we will dive deeper into the thinking behind this new initiative from the perspective of the Hadapt team. It is important to note upfront that Teradata’s interest in Presto, and the people contributing to the Presto codebase, extends beyond the Hadapt team that joined Teradata last year. Nonetheless, it is worthwhile to understand the technical reasoning behind the embrace of Presto from Teradata, even if it presents a localized view of the overall initiative.

____________________________________________________________________________________________________________________________________

Around seven years ago, Ashish Thusoo and his team at Facebook built the first SQL layer over Hadoop as part of a project called Hive. At its essence, Hive was a query translation layer over Hadoop: it received queries in a SQL-like language called Hive-QL, and transformed them into a set of MapReduce jobs over data stored in HDFS on a Hadoop cluster. Hive was truly the first project of its kind. However, since its focus was on query translation into the existing MapReduce query execution engine of Hadoop, it achieved tremendous scalability, but poor efficiency and performance, and ultimately lead to a series of subsequent SQL-on-Hadoop solutions that claimed 100X speed-ups over Hive.

Hadapt was the first such SQL-on-Hadoop solution that claimed a 100X speed-up over Hive on certain types of queries. Hadapt was spun out of the HadoopDB research project from my team at Yale and was founded by a group of Yale graduates. The basic idea was to develop a hybrid system that is able to achieve the fault-tolerant scalability of the Hive MapReduce query execution engine   while leveraging techniques from the parallel database system community to achieve high performance query processing.

The intention of HadoopDB/Hadapt was never to build its own query execution layer. The first version of Hadapt used a combination of PostgreSQL and MapReduce for distributed query execution. In particular, the query operators that could be run locally, without reliance on data located on other nodes in the cluster, were run using PostgreSQL’s query operator set (although Hadapt was written such that PostgreSQL could be replaced by any performant single-node database system). Meanwhile, query operators that required data exchange between multiple nodes in the cluster were run using Hadoop’s MapReduce engine.

Although Hadapt was 100X faster than Hive for long, complicated queries that involved hundreds of nodes, its reliance on Hadoop MapReduce for parts of query execution precluded sub-second response time for small, simple queries. Therefore, in 2012, Hadapt started to build a secondary query execution engine called “IQ” which was intended to be used for smaller queries. The idea was that all queries would be fed through a query-analyzer layer before execution. If the query was predicted to be long and complex, it would be fed through Hadapt’s original fault-tolerant MapReduce-based engine. However, if the query would complete in a few seconds or less, it would be fed to the IQ execution engine. 

In 2013 Hadapt integrated IQ with Apache Tez in order avoid redundant programming efforts, since the primary goals of IQ and Tez were aligned. In particular, Tez was designed as an alternative to MapReduce that can achieve interactive performance for general data processing applications. Indeed, Hadapt was able to achieve interactive performance on a much wider-range of queries when leveraging Tez, than what it was able to achieve previously.

Unfortunately Tez was not quite a perfect fit as a query execution engine for Hadapt’s needs. The largest issue was that before shipping data over the network during distributed operators, Tez first writes this data to local disk. The overhead of writing this data to disk (especially when the size of the intermediate result set was large) precluded interactivity for a non-trivial subset of Hadapt’s query workload.  A second problem is that the Hive query operators that are implemented over Tez use (by default) traditional Volcano-style row-by-row iteration. In other words, a single function-invocation for a query operator would process just a single database record. This resulted in a larger number of function calls required to process a large dataset, and poor instruction cache locality as the instructions associated with a particular operator were repeatedly reloaded into the instruction cache for each function invocation. Although Hive and Tez have started to alleviate this issue with the recent introduction of vectorized operators, Hadapt still found that query plans involving joins or SQL functions would fall back to row-by-row iteration.

The Hadapt team therefore decided to refocus its query execution strategy (for the interactive query part of Hadapt’s engine) to Presto, which presented several advantages over Tez. First, Presto pipelines data between distributed query operators directly, without writing to local disk, significantly improving performance for network-intensive queries. Second, Presto query operators are vectorized by default, thereby improving CPU efficiency and instruction cache locality. Third, Presto dynamically compiles selective query operators to byte code, which lets the JVM optimize and generate native machine code. Fourth, it uses direct memory management, thereby avoiding Java object allocations, its heap memory overhead and garbage collection pauses. Overall, Presto is a very advanced piece of software, and very much in line with Hadapt’s goal of leveraging as many techniques from modern parallel database system architecture as possible.

The Teradata Center for Hadoop has thus fully embraced Presto as the core part of its technology strategy for the execution of interactive queries over Hadoop. Consequently, it made logical sense for Teradata to take its involvement in the Presto to the next level. Furthermore, Hadoop is fundamentally an open source project, and in order to become a significant player in the Hadoop ecosystem, Teradata needs to contribute meaningful and important code to the open source community.  Teradata’s recent acquisition of Think Big serves as further motivation for such contributions.

Therefore Teradata has announced that it is committed to making open source contributions to Presto, and has allocated substantial resources to doing so. Presto is already used by Silicon Valley stalwarts Facebook, AirBnB, NetFlix, DropBox, and Groupon. However, Presto’s enterprise adoption outside of silicon valley remains small. Part of the reason for this is that ease-of-use and enterprise features that are typically associated with modern commercial database systems are not fully available with Presto. Missing are an out-of the-box simple-to-use installer, database monitoring and administration tools, and third-party integrations. Therefore, Teradata’s initial contributions will focus in these areas, with the goal of bridging the gap to getting Presto widely deployed in traditional enterprise applications. This will hopefully lead to more contributors and momentum for Presto.

Teradata’s commitment to Presto and its commitment to making meaningful contributions to an open source project is an exciting development. It will likely have a significant impact on enterprise-adoption of Presto. Hopefully, Presto will become a widely used open source parallel query execution engine --- not just within the Hadoop community, but due to the generality of its design and its storage layer agnosticism, for relational data stored anywhere.

==================================================================================

Daniel Abadi is an Associate Professor at Yale University, founder of Hadapt, and a Teradata employee following the recent acquisition. He does research primarily in database system architecture and implementation. He received a Ph.D. from MIT and a M.Phil from Cambridge. He is best known for his research in column-store database systems (the C-Store project, which was commercialized by Vertica), high performance transactional systems (the H-Store project, commercialized by VoltDB), and Hadapt (acquired by Teradata). http://twitter.com/#!/daniel_abadi.

Ignore ancestor settings: 
0
Channel: 
Apply supersede status to children: 
0

Hands-On with Teradata QueryGrid™ - Teradata-to-Hadoop

$
0
0
Course Number: 
54300
Training Format: 
Recorded webcast

Today's analytic environments incorporate multiple technologies and systems. Teradata QueryGrid™  Teradata-to-Hadoop allows you to access data and processing on Hadoop from your Teradata data warehouse.

This course gives you an in-depth understanding how QueryGrid Teradata-to-Hadoop works including querying metadata, partition pruning, push-down processing, importing data and joins. Using live queries we explain the syntax and functionality of QueryGrid in order to combine data and analytics across your analytic ecosystem.

Presenter: Andy Sanderson, Product Manager - Teradata Corporation

Prerequisite:  Course #53556, High Performance Multi-System Analytics using Teradata QueryGrid (Webcast)

Price: 
$195
Credit Hours: 
1
Channel: 

Architecture of Presto, an Open Source SQL Engine

$
0
0
Course Number: 
54669
Training Format: 
Recorded webcast

Presto is an open source distributed SQL engine, originally developed by Facebook for their massive Hadoop data warehouse.

Earlier this year, Teradata joined the Presto community and announced a multi-year roadmap to accelerate Presto development to make it ready for enterprise users. This talk covers Presto architecture and discusses the technology behind the project.

Presenter: Kamil Bajda-Pawlikowski, Teradata Corporation

Price: 
$195
Credit Hours: 
2
Channel: 

Studio 15.11 Available for Download

$
0
0
Short teaser: 
Studio, Studio Express, and Teradata Plug-in for Eclipse 15.11 are now available for download.
Cover Image: 

Teradata Studio Express, Teradata Studio, and Teradata Plug-in for Eclipse 15.11 are now available for download. The Studio family of products provide client based access to Teradata, Aster, and Hadoop.

Teradata Studio Express is an information discovery tool for retrieving and displaying data from your Teradata Database systems. Teradata Studio, in addition to the functionality of Studio Express, includes administration features for creating and administering database objects. They are both built on top of the Eclipse Rich Client Platform (RCP) which allows them to benefit from the many high quality Eclipse features available while focusing on value-add for the Teradata Database, Aster Database, and Hadoop platforms. Studio Express and Studio can be run on multiple operating system platforms, such as Windows, Linux, and Mac OSX.

Teradata Plug-in for Eclipse provides an Eclipse plug-in that includes the same functionality as Teradata Studio along with Dialogs and Wizards to help build Java Web Services, Java Stored Procedures, and Java User-defined Functions. Teradata Plug-in for Eclipse is supported on both Windows and Mac OSX platforms.

What's new in Teradata Studio and Studio Express:

  • Connection Management

Provides a preference to limit the number of allowed connections. A connection pool is maintained for each connection profile. The maximum number of connections default value is set to 8 connections, with the minimum number of connections being 2. The Data Source Explorer always keeps one database connection per profile. Each SQL Editor window will hold a connection after the first query is executed. Closing the SQL Editor will release the connection back to the connection pool. Other features that obtain a connection will release the connection after completing its task. Refer to the Connection Optionspreference page to administer the maximum number of connections per profile.

  • Load Teradata Volatile Table

A new 'Import' option is provided on the SQL Editor toolbar to allow you to load data into a Teradata Volatile Table. In using this option, you must first run the CREATE VOLATILE TABLE SQL statement within the SQL Edior. Press the Import button and choose the volatile table to load. This will invoke the Load Data Wizard for importing data.

  • Cloudera SQL Execution Support (Impala JDBC Driver)

A new connection profile type, Hadoop Cloudera, is provided allowing for JDBC support within Studio and Studio Express. The Impala JDBC driver is bundled with Studio and Studio Express to allow users to run SQL commands against Cloudera within the SQL Editor. Result sets are displayed in the Result Set Viewer and an entry is placed in the Teradata SQL History.

  • SQL Editor Abort Button

Provides an option in the SQL Editor to cancel the running SQL statements.

  • New Query List Preferences

Provides more granularity for accessing the Teradata Data Dictionary views when obtaining the metadata for the query lists. Previously, users had to choose between the ViewV or ViewVX used for all query lists. This new feature will allow the user to individually select between these views depending on the list: databases, tables and views, macros and procedures, functions, etc. For example, a user can choose to use the ViewVX (DBC.DatabasesVX) for the databases list but ViewV (DBC.TablesV) for the tables list. The preference page also provides an option to provide a custom view. (NOTE: If choosing the custom view option, you must provide a view that contains the exact same view definiton as the DBC view.)

  • Aster Timeout Auto Reconnect Support

Provides automatic reconnection for Aster timeouts. When the Aster session times out, a pop-up message is displayed asking if you want to reconnect to the database. If yes, the user will be prompted for the password and re-connected to the database with a new session.

  • Configurable Preferences at Install time

The Windows installation of Studio and Studio Express supports preconfigured preferences that can be provided during the installation process. The feature focuses on those preferences that pertain to database interactions. The Windows media image contains a preferences template file in the /config directory called TeradataStudioPreferences_template.config. Edit this file prior to installation to specify those perferences you wish to intialize, such as query lists views, connection options, query band settings, Data Source Explorer load preferences, and profile connection types. Additional special preferences are also provided, such as disabling the connection profile 'Save Password' option, disabling auto commit in the SQL Editor, or controlling copy, export, and print of result set content. The preferences template also provides an option to lock the preference, preventing the user from changing the preference at runtime.

Refer to the article Studio Preferences: Initialization and Locking for more information on pre-configuring preferences.

What's new in Teradata Studio only:

  • Studio Express features

All of the features mentioned above for Studio Express are also provided for Studio.

  • New Administration Perspective

The Studio Administration Perspective has an all new look and feel. The display is divided into three main view panes: Navigator View, Filter View, and Object List Viewer. The Navigator View allows you to create, import, or select an existing connection profile. The connectio profile list mirrors the connection profiles contained in the Data Source Explorer. After choosing a connection profile, you are presented with a list of categories to further choose the objects to administer. Double-click on a category to show the list of objects in the Object List Viewer. The Filter View provides a filter service for confining the list of objects displayed in the Object List Viewer. Filters are create per object category and can be combined via AND or OR options. The Object List Viewer displays the lists of objects. Toolbar and Menu actions, such as Create, Modify, Drop, Open, and Show options, are provided to administer the objects.

Refer to the article The New Teradata Studio Administration User Experience for more information on using the new administration perspective.

  • New Administration Formshe

Administration forms will replace the administration dialogs that were present in the previous release of Studio. The administration forms are invoked via actions in the Object List Viewer. For example, you can select a Teradata database in the Object List Viewer and press the 'Create Database' action in the toolbar and a 'Create Database' form is opened below the Object List Viewer. The user will enter the database information and press the 'Commit' button to send the request to the database. Similarly, you can display the details of an object by selecting it in the Object List Viewer and choosing the 'Open' object action. In the 'Open' example, the detail information is provided in a read only form.

  • Secure Zone Administration Support

Administration of Secure Zones has been added to Studio. Users can display, create, modify, and drop Secure Zones, as well as administer Zone users and guest users.

  • Cloudera Connection Options

A new Hadoop Cloudera connection option is provided that supports TDCH, JDBC, and SQL-H services for Hadoop Cloudera.

Need Help?

For more information on using Teradata Studio Express, refer to the article, Teradata Studio Express.

For more information on using Teradata Studio, refer to the article, Teradata Studio or the Teradata Studio User Guide (available on the Teradata Studio Download Page).

For more information on using Teradata Plug-in for Eclipse, there is an article to help you get up and running. Please refer to Getting Started with Teradata Plug-in for Eclipse.

To ask questions or discuss issues, refer to the Teradata Studio Forum and post your question.

Online Help can be accessed within the plug-in in two ways:

  • From the main menu: Help > Help Contents
  • Context sensitive help: When a user is in a dialog, they can hit the F1 key to retrieve help text sensitive to where they are within the dialog.
  • Also, the Quick Tour provides a quick overview of Teradata Studio Express or Studio features. Go to Help > Welcome.

Reference Documentation can be found on the download page or at: www.info.teradata.com

  • Title: Teradata Studio, Studio Express, and Plug-in for Eclipse Installation Guide Publication ID: B035-2037-056K
Ignore ancestor settings: 
0
Channel: 
Apply supersede status to children: 
0

Teradata Connector for Hadoop (Command Line Edition)




<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>