SparkSQL 初步应用（HiveContext使用）

浏览数：14 / 时间：2015年06月12日

折腾了一天，终于解决了上节中result3的错误。至于为什么会产生这个错误，这里，先卖个关子，先看看这个问题是如何发现的：

首先，找到了这篇文章：http://apache-spark-user-list.1001560.n3.nabble.com/SparkSQL-select-syntax-td16299.html 里面有这么一段：

The issue is that you‘re using SQLContext instead of HiveContext. SQLContext implements a smaller subset of the SQL language and so you‘re getting a SQL parse error because it doesn‘t support the syntax you have. Look at how you‘d write this in HiveQL, and then try doing that with HiveContext.

In fact, there are more problems than that. The sparkSQL will conserve (15+5=20) columns in the final table, if I remember well. Therefore, when you are doing join on two tables which have the same columns will cause doublecolumn error.

这里提及到两点：（1）使用HiveContext；（2）也就是导致这个错误的原因。

好吧，说到使用HiveContext，那咱就用HiveContext（尼玛，这里折腾了半天）：

首先呢，看使用HiveContext都需要哪些要求，这里参考了这篇文章：http://www.cnblogs.com/byrhuangqiang/p/4012087.html

文章中有这么三个要求：

1、检查$SPARK_HOME/lib目录下是否有datanucleus-api-jdo-3.2.1.jar、datanucleus-rdbms-3.2.1.jar 、datanucleus-core-3.2.2.jar 这几个jar包。

2、检查$SPARK_HOME/conf目录下是否有从$HIVE_HOME/conf目录下拷贝过来的hive-site.xml。

3、提交程序的时候将数据库驱动程序的jar包指定到DriverClassPath，如bin/spark-submit --driver-class-path *.jar。或者在spark-env.sh中设置SPARK_CLASSPATH。

那咱就按照要求配置，可是，配置完成之后报错（交互模式）：

Unable to instantiate org.apache.hadoop.hive.metastore.HiveMetaStoreClient

初步判断，是hive连接源数据库这块的问题，于是在hive-site.xml文件中添加连接源数据库的参数：

        <property>
          <name>hive.metastore.uris</name>
          <value>thrift://111.121.21.23:9083</value>
          <description></description>
        </property>

指定好参数之后，满怀期待的执行了个查询，尼玛又报错（这个错误纠结了好久）：

ERROR ObjectStore: Version information not found in metastore.

这个错误说的是，在使用HiveContext时，需要访问Hive的数据源，获取数据源的版本信息，如果获取不到，此时就会抛出该异常。关于解决方案网上倒是挺多，需要添加参数到hive-site.xml文件：

       <property>
          <name>hive.metastore.schema.verification</name>
          <value>false</value>
          <description></description>
        </property>

添加完参数，重启了Hive服务，执行Spark 的HiveContext，依旧报改错。使用IDE将程序编译打包后，放在服务器上执行：

#!/bin/bash

cd /opt/huawei/Bigdata/DataSight_FM_BasePlatform_V100R001C00_Spark/spark/

./bin/spark-submit \

--class HiveContextTest \

--master local \

--files /opt/huawei/Bigdata/hive-0.13.1/hive-0.13.1/conf/hive-site.xml \

/home/wlb/spark.jar \

--archives datanucleus-api-jdo-3.2.6.jar,datanucleus-core-3.2.10.jar,datanucleus-rdbms-3.2.9.jar \

--classpath /opt/huawei/Bigdata/hive-0.13.1/hive-0.13.1/lib/*.jar

无奈，又报另一个错（真是崩溃！）：java.net.UnknownHostException: hacluster

这是hadoop的dfs.nameservices