Python 调用 Hive

在没有使用 Client 库之前，通常我们是用 Hive 脚本执行，然后用胶水语言来处理结果，很不方便。不过 Hive 本身提供了多种语言的 Client，可以直接操作 Hive。但库比较复杂，这里推荐使用 pyhs2 库，该库对官方提供的接口进行了分装，使用起来更加方便。

使用 pip 安装 pyhs2：

$ sudo pip install pyhs2

注：目前云平台已安装 pyhs2

pyhs2 流程很简单：连接 > 执行HiveQL > 取出执行结果

实例代码如下：

# coding=utf-8

import pyhs2

sql = "select * from test"

with pyhs2.connect(host='主机地址', authMechanism='PLAIN', user='用户名', database='数据库') as conn:
        with conn.cursor() as cursor:
                try:
                        # 2、执行 HiveQL
                        cursor.execute(sql)
                        # 3、获取查询结果
                        print cursor.fetchone()
                except pyhs2.error.Pyhs2Exception as err:
                        raise SystemExit('execute error: %s' % err)

获取查询结果有几种方法：

fetchone()，返回一条结果，类型为列表（日志已被分割好了）
fetchall()，返回一个列表，列表的每个元素也是列表，即每一行日志
如果数据量很大，导致内存很吃力，直接用 for..in.. 遍历 cursor，结果将迭代产生

更多方法可见 pyhs2 的源码：https://github.com/BradRuderman/pyhs2/blob/master/pyhs2/cursor.py