By: Roman Shaposhnik, VP of Technology at ODPi
The release of ODPi 2.1 marks five-months worth of the ODPi technical community’s diligent work, though on the surface it may appear to be incremental change to last fall’s 2.0 release. While there aren’t any big, splashy additions to our specification this release is very noteworthy in its own way. Why? Because it follows in the great tradition of tick-tock releases and invests a lot of energy into the underlying infrastructure that is largely invisible to the consumer. This, of course, makes it a “tick” release and those are truly foundational to the success of the follow up “tocks” that get all the excitement. If you still don’t believe tick-tock pairs well with complex systems, ask any Sun microsystems SPARK engineer how well an alternative release model has worked out for them – I believe they called it humpty-dumpty, but I digress, so back to ODPi 2.1.
One of the biggest underlying changes in ODPi 2.1 is that we have fully transitioned to leveraging Apache Bigtop for our reference implementation and validation testsuite needs. This required a lot of upstream backporting. Some of it was pretty straightforward, such as backporting all ODPi-developed tests into Bigtop, while some required us to engage with upstream communities and get their feedback on the best way to accomplish a similar goal. This was the story of our ODPi reference implementation stack for Apache Ambari. It started as a custom stack that was shipped as part of the ODPi reference implementation but, after receiving community feedback, it evolved into a standalone management pack that can now be developed and shipped independently of Ambari. This outcome benefits everybody because now any product based on Ambari can simply point at the management pack and deploy ODPi reference implementation.
ODPi 2.1 is our first release consisting of just the specifications. All of the software artifacts are also being released as part of Apache Software Foundation. Such renewed alignment with upstream community efforts allows us to be much more in tune with big data practitioners, regardless of whether they participate in ODPi directly or not. This is a win-win for both ODPi and upstream ASF communities. If Bigtop release 1.2.0 was any indication, ODPi’s focus on enterprise stability and readiness brings to light a lot of issues that would otherwise go unnoticed or would only be fixed in vendor-specific patch releases. ODPi’s Bigtop collaboration brings these issues up closer to the source, creating a feedback loop that results in much faster fixes.
On the flip side, Bigtop’s extensive platform coverage and a vibrant community of ASF developers means the ODPi specification will bring value far beyond what we believe are our core deployment targets. For example, we’ve never really considered IBM’s POWER as a supported ODPi platform, but since Bigtop runs on this hardware, we get it for free. Starting from ODPi 2.1, all of the engineering work will happen directly in the upstream ASF communities, and we expect this to make our development cycle extremely agile and asynchronous. Of course, we’ll continue releasing the specifications, which brings me to the last part of this release.
Most of our effort on the Operations spec was focused on standardizing Ambari 2.5 and taking care of upgrade and backward compatibility guarantees for future ODPi releases. On the Runtime side, we spent quite a bit of time future proofing it against Hive 2.0 (and looking at how known incompatibilities with Hive 1.2 can affect ISVs and end users). We also considered Spark 2.0 as the next component on which to standardize.
New Special Interest Groups Spark Exploratory Developments
Our Spark 2.0 work was interesting in its own right. Our take was that while Spark was still considered experimental and not at the level of maturity that is required of ODPi Core components, it was still highly important to enterprise readiness. We’re tackling this through a loose construct of Special Interest Groups (SIGs), rather than a highly-rigorous body of a Runtime PMC. Thus, Spark gave birth to our first SIG: Spark and Fast Data Analytics SIG.
With the increase in the popularity and usage of Hadoop and Spark, the notion of Spark replacing Hadoop is gaining traction. While this is possible in some use cases, Spark is already part of Hadoop and there are several components from the Hadoop stack on which Spark depends. Our Spark and Fast Data Analytics SIG, led by Pradeep Roy, advisory software engineer at IBM, is expected to publish guidelines for Spark deployment and recommend best practices on Spark and Hadoop use, along with providing guidelines for different deployment methods for Spark on YARN, Mesos or Spark standalone; comparisons of different SQL on Hadoop solutions; and more.
The formation of two new SIGs, Data Security and Governance SIG and BI and Data Science SIG, quickly followed.
Our Data Security and Governance SIG was formed to provide a place for industry experts to collaborate on a set of best practices aimed at solving the complexities of dealing with multi-tenant Big Data data lakes in a secure fashion and with considerations for control points demanded by enterprise regulatory environments and compliance policies. As the leader of this group, my fellow members and I plan to produce a series of whitepapers and validation test suites addressing both platform considerations and solutions practitioners may need to augment their platform practices. This SIG’s first deliverable will be a Security Guide Handbook, developed on GitHub by members from IBM, Hortonworks and Pivotal, that will bring much needed clarity to securing Hadoop-based data lakes infrastructure. We’ve also started working on codifying security-related deployment recommendations as part of the Apache Bigtop deployment capabilities, thus providing baseline functionality around security for the entire Hadoop ecosystem. Stay tuned for our outputs, coming soon!
For our BI & Data Science SIG, according to the group’s champion Cupid Chan, managing partner of 4C Decision, we have a two-fold goal. The first goal is to help bridge the gap between Relational Database Management Systems (RDBMS) and Hadoop so that BI tools can sit harmoniously on top of these systems, while also providing the same, or even more, business insights to the BI users who also use Hadoop in the backend. Another goal is to collaboratively explore ways for Data Science to better leverage the underlying Hadoop ecosystem. In order to attain an achievable result, the first deliverable for this SIG is to develop a “Data Science Notebook Guideline.” Stay tuned for the release of this group’s findings!
While these SIGs are still very young, they are pushing forward important exploratory work that, we hope, will form a basis for some of the future PMCs and specification updates within the broader scope of ODPi.
These SIGs also represent our lowest barrier of entry to date – so, if you feel like contributing to ODPi efforts but don’t know where to start, we encourage you to join an existing SIG or propose a new one.
By default SIGs are using odpi-technical mailing list for all on-line communications between the SIG members. This means that all you have to do to join a SIG is drop an email to the odpi-technical mailing list, introduce yourself and briefly describe why are you interested in the SIG activity. Include your GitHub ID in the introductory email so that a SIG Champion can add you to the GitHub group.
Contributing to the ODPi community is that easy!