Adding Apache Hive to ODPi Runtime Specification 2.0

By September 27, 2016Blog

By Alan Gates, ODPi technical steering committee chair and Apache Software Foundation member, committer and PMC member for several projects

Today, ODPi announced that the ODPi Runtime Specification 2.0 will add Apache Hive and Hadoop Compatible File System support (HCFS). These components join YARN, MapReduce and HDFS from ODPi Runtime Specification 1.0

With the addition of Apache Hive to the Runtime specification, I thought it would be a good time to share why we added Apache Hive and how we are strategically expanding the Runtime specification.

Why Hive?
ODPi adds projects to its specifications based on votes from ODPi’s diverse membership. We have a one member, one vote policy. In discussions regarding what projects to add to the next Runtime specification, many members indicated that they used Apache Hive, which is data warehouse software that facilitates reading, writing, and managing large datasets residing in distributed storage using SQL. Members indicated that by adding Apache Hive to the ODPi Runtime Specification 2.0, ODPi can reduce SQL query inconsistencies across Hadoop Platforms, which is one of the key pain points for ODPi members and Big Data Application vendors in general.

What is the process?
As with everything we do in ODPi, the addition of any project to the ODPi Runtime specification is done collaboratively, with participation from everyone who has interest. ODPi has established the Runtime Project Management Committee (PMC) to maintain the Runtime Specification.

In order to make sure all voices were heard and use cases considered, the Runtime PMC formed an Apache Hive working group. This group included Runtime PMC members, as well as other ODPi contributors who wanted to be involved. It included representatives from several distributors and application vendors, including: Hortonworks, SAS, IBM, Syncsort, and DataTorrent.

The working group came together over the course of a month, meeting regularly, to determine how to add Apache Hive to the spec.

What are we adding?
The working group decided early on to focus on SQL and API compatibility rather than matching a specific version of Apache Hive. We chose Hive 1.2 as our base version that distributions must be compatible with. This gives distribution providers freedom in what version of Hive they ship, while also guaranteeing compatibility for ISVs and end users.

What has to be compatible?
The working group focussed on interfaces that the ISVs and the distributors’ customers use most frequently. We agreed that SQL, JDBC, and beeline (the command line tool that allows users to communicate with the JDBC server) are used by the great majority of Hive users and so we included them in the spec. We also included the classic command line, the metastore thrift interface, and HCatalog as optional components; that is the distribution may or may not include them, but if it does they must be compatible. We chose to make these optional because they are frequently, but not universally, used.

Where can you see our work?
The initial draft of the Runtime PMC is open to the public and everything is published on Github.

How Can You Be Involved?
We are still writing tests for distributions to check that they comply with the specification. We would love to have your help writing tests. You can also give feedback on the spec. Participation in the ODPi is open to anyone, with all work being done is public on GitHub. Developers can join the conversation on the mailing lists or Slack channel.