Category

Blog

2017 Predictions: What’s Next for Hadoop

By | Blog | No Comments

Hadoop

By: John Mertic, Director of Program Management for ODPi

If you follow ODPi insight closely, you might remember these 2017 Big Data Predictions from our VP of Technology, Roman Shaposhnik. After the start of the new year, I started to think about what his predictions and emerging trends like Big Data’s “Push to the Cloud” might mean for our ecosystem – especially as it relates to the Hadoop landscape.

Last year, Apache Hadoop celebrated its tenth birthday. It was a milestone for the diaspora of the early team at Yahoo! that invented the technology and the worldwide community, along with The Apache Software Foundation that shepherded the growing platform since its launch. However, this decade-iversary also showcased something less obvious than Hadoop’s staying power: it brought to light that the canonical state of Hadoop is breaking apart.

Over the last couple of weeks, I’ve spent a lot of time reading through Hadoop and Big Data landscape articles written in the past few years. The most popular conversation was clearly the expansion of the stack – meaning new projects for every possible nook and cranny of the space. Fast data? Check. 12 ways to perform a SQL slice and dice? Done. AI (artificial intelligence) and ML (machine learning) capabilities? Yup. To see what I mean, take a look at this enormous Hadoop Ecosystem Table – summarizing current Hadoop related projects – here.

Traditionally, the role of Hadoop distribution providers within the ecosystem was to help make sense of a fast-changing and often-confusing landscape for customers. Showcasing their own preferred tools, distros gave the enterprise a stack of components that (more-or-less) worked well together – provided users stayed within confining application architecture walls. While this wasn’t ideal, it worked fairly well if enterprises were happy to stay in the “safe zone” their selected vendors laid out and could blissfully ignore other distros and solutions.

Though this may seem simple, the nature of deploying Big Data is quite varied. Reading through AtScale’s recent “Big Data Maturity” report, 53% of respondents reported using cloud in their deployment but only 14% have all of their data in the cloud. Not to mention Tony Baer’s recent ZDNet article citing that Hadoop in the cloud is a varied product depending upon the provider – and not in the traditional sense with how Cloudera CDP differs from Hortonworks HDP. This emergence of cloud brings into focus a fundamental shift emerging within the entire Big Data landscape.

If there is one overarching lesson the drive to PaaS and IaaS taught us, it would be the benefits of being lean. For example, you can throw more CPU, RAM and disk drives onto your on-premise environment with negligible cost increases; but for cloud instances, each addition counts against you quickly. Knowing this, the best cloud architectures include the ability to compartmentalize, identify focus areas of work and optimize each resource used – as wasting resources on the cloud has in-your-face cost ramifications.

Now combine Hadoop’s push to the cloud with the forced fiduciary responsibility of using cloud resources, and it’s quickly apparent that a traditional one-size-fits-all Hadoop distro is at natural odds – especially when that distro comes with a number of projects and tools that you’ve long-since outgrown.

My biggest prediction for 2017 is that the Hadoop of 2016 is going to become much more modular, special purpose and leaner than what is currently being shipped. We’re are already seeing these trends in the following ways:

  • IBM’s Watson Data Platform is centered around Spark – notice anything missing?
  • Cloud vendors are moving away from traditional HDFS and, instead, making their native block stores the data lake
  • Even traditional Hadoop distro vendors are recognizing this trend and launching offerings leveraging containers as a stopgap solution

This slow elimination of the one-size-fits-all ideal leads me to my second prediction: Hadoop and Big Data will no longer be discussed as their own beings – they’ll instead just be referred to as “Data.” I see this acknowledgment as the separation line between vendors who will be successful in 2017 and those who will not. Connecting the entire landscape story together, and speaking to customers about their data strategy vs. shiny new Hadoop or Big Data products, will separate this year’s data winners from its data losers.

My third prediction for Hadoop: ridding the marketplace of the “traditional Hadoop” baggage, and having the important conversations around data strategy, will employ the needs of traditional business to highlight leading technologies in this space. While this may sound pretty obvious, try answering this: how many traditional businesses are bragging about the efficiency of their Hadoop/Big Data/Data solutions and strategies right now? Not many. However, these businesses know that in order to remain competitive they’ll need to become “data driven.” I think we’ll start seeing organizations drive their needs back to vendors like never before and their successes will be much more prominently showcased. In other words, less focus on Amazon, Netflix and Facebook, and more narratives around companies like Progressive Insurance.

It’s a key year for Big Data as it crosses its biggest chasm yet, but as greater focus comes to this industry I think we’ll start seeing a noticeable push forward – setting up some even more impressive leaps in 2018 and beyond.

ODPi Community Lounge @ Apache Big Data Europe

By | Blog | No Comments

Join the Discussion at the ODPi Community Lounge

Once again ODPi is sponsoring the Community Lounge at Apache Big Data Europe, November 14-16 in Seville, Spain.  Apache project members and speakers are welcome to hold their meetings and after-session discussions.  This is a great way to have a deeper intimate conversation with fellow attendees, and to introduce new potential collaborators to your project

Please choose a time on the Community Lounge Schedule  for your topic or project.  We’ll help promote your upcoming meeting.  Be sure to tell your followers as well.  Time slots are 30 minutes each and can be scheduled on a first come, first served basis.

ODPi Community Lounge – ApacheCon EU 2016

Discussion Schedule

Monday, November 14

Time Speaker or Project Name Topic
10:30
11:00
11:30
12:00
12:30  Apache Giraph – Roman Shaposhnik  Discussion session: Practical Graph Processing with Apache Giraph
13:00
13:30 – 15:30 Lunch 
15:30  Apache MADlib – Roman Shaposhnik  Distributed In-Database Machine Learning with Apache MADlib (incubating) – Roman Shaposhnik, Pivotal
16:00  Apache Geode – Greg Chase  Meet Apache Geode – graduated for Apache Incubator
16:30
17:00

Tuesday, November 15

Time Speaker or Project Name Topic
10:30
11:00
11:30
12:00
12:30
13:00
13:30 – 15:30 Lunch
15:30  Apache Big Top & Greenplum Database – Greg Chase & Roman Shaposhnik Discussion: Massively Parallel Data Warehousing in the Hadoop Stack
16:00
16:30
17:00

Wednesday, November 16

Time Speaker or Project Name Topic
10:30  John Mertic, Director, ODPi and Open Mainframe Project, Linux Foundation  Discussion: Keynote: Lessons from the Trenches: How Apache Hadoop is Being Used & The Challenges Its Users Face –
11:00  ODPi – John Mertic  Discussion: Standardizing data governance across Hadoop distributions
11:30 ODPi – Roman Shaposhnik and John Mertic Discussion: Security in Hadoop
12:00 ODPi – Roman Shaposhnik and John Mertic Discussion: Streaming data in Hadoop
12:30 ODPi Discussion – Roman Shaposhnik Discussion: Hadoop Compatible File Systems across Hadoop Distributions
13:00  ODPi – Alan Gates  Discussion: Standardizing Hive in Hadoop distributions
End of conference

Is Your Data Clean or Dirty?

By | Blog | No Comments

downloadOver the weekend I read an incredible post from SAS Big Data evangelist Tamara Dull. I love her down-to-earth and real life perspectives on Big Data, and your analogy of cleaning the car hit home for me. She is spot on – clean data pays dividends in being able to get better insights.

But, what is clean data? What is that threshold that says your data is clean versus dirty?

Could data even be “too clean”?

(pause to hear gasps from my OCD readers)

Clean data and clean houses

Taking this to a real life example, I can say first hand there are often different definitions of what clean is. For example, my wife is very keen on keeping excess items off our kitchen counters, to the point where she’ll see something that doesn’t belong and put it in the first cabinet or drawer she encounters that has space for it. Me on the other hand, I’m big on finding what I believe is the right place for it. Both of us have the same goal in mind – get the counters clean.

To each of us, there’s value in our approaches – which is efficiency. Hers is optimized at the front end, mine at the back end. However, the end result of each of our “cleaning” could have negative impacts (with my approach, it’s my wife’s inability to find where I put something – with my wife’s method, it’s having items fall out of a cabinet on me as I open it).

Is “clean” to one person the same as everyone?

The life lesson above teaches something critical about data – clean isn’t a cut and dry threshold. And taking a page from Tamara’s post, it’s also not a static definition.

The trap you can quickly fall into is thinking of data in the same terms as you would have looked at structured data. While yes, part of the challenge is to understand what the data is and its relationships, the more crucial challenge is how you intend to consume the data and then use it. This is a shift from the RDBMS thinking of focusing on normalization and structure first and then usage second. With the Big Data-esque ways of consuming and processing data (streaming, ML, AI, IOT) combined with velocity, variability, and volume, the use-case mindset is exactly where your focus should be.

“Use case first approach” is how we look at these technologies at ODPi. We look at questions like “Here is the data I have, and this is what I’m trying to find out – what is the right approach/tools/patterns to use?” and how they can be answered. We ensure all of our compliant platforms, interoperable apps, and specifications have the components needed to enable successful business outcomes. This provides companies the peace of mind that they are making a safe investment, and that switching tools doesn’t mean that their clean data becomes less than optimal to leverage the way they want.

This parallels on the discussion of cleaning in our house – are we trying to clean up quickly because company is coming over, or are we trying to go through an entire room and organize it. Approaching data cleaning is the same thought process.

ESG Whitepaper: ODPi Simplifies Apache Hadoop Application Development and Portability

By | Blog | No Comments

ODPi_Twitter_WhitepaperAd_800x300_v2_ac-01

Overview

Over the last decade, Apache Hadoop has generated many popular open source software projects, spawned a number of rapid growth startups with commercial distributions and complementary products, and has been a reliable distributed data platform for analytics. As Apache Hadoop adoption continues to grow, the larger Hadoop ecosystem is expanding, too. However, some debate remains about the future direction of the technology.

In this paper created by ESG Senior Analyst Nik Rouda, he discusses Apache Hadoop support from businesses, governments, academia, and technology vendors and how this large and diverse community differs in their specific goals and objectives for harnessing this technology.

Rouda dives into how ODPi is helping to bring maturity and choice to the Hadoop ecosystem in several ways, offering:

  • More confidence that Hadoop will remain a safe data platform choice for companies.
  • Simplified application and compatibility testing for third-party software developers.
  • Vendor-neutral coordination of efforts between vendors to build synergies across their offerings.

Download this free report to learn more about simplifying Apache Hadoop application development and portability.

Join author Nik Rouda, ESG Senior Analyst and ODPI Director John Mertic for a complimentary webinar Monday November 7 from 12-1 PM Eastern. All registrants will get a free copy of this valuable white paper.

Is Open Source Big Data a broken promise?

By | Blog | No Comments

An article caught my eye this past week, where Robert Hof of SiliconAngle asserted that the challenges of Apache Hadoop adoption are a byproduct of the open source development approach. Hof argues that the various pieces do not integrate well together and some projects are not living up to their promises, which has resulting in additional work being required by organizations for them to see their true value. This has lead to a small pool of available talent and end-customers that are uncertain about where to direct their investments.

On the heels of this article, I watched the below video from Rakesh Kant of US Bank that I found just as insightful.


His sentiment rings loud and clear:

  • “I’m not seeing any signal, only noise.”
  • “The landscape is evolving into more experiments”
  • “A standard is required to help businesses”
  • “I’d like to focus time on delivering business value”

The Hadoop ecosystem has always been a technology focused one, and its clear this technology has been ground breaking and impactful. However, I do think that, over time, this technology has evolved to solve the needs of technologists. Enterprises have been largely been left without a voice and to struggle to embrace it with confidence.

In my view, open source as a development model is not the problem. Rather, it’s the lack of feedback from end-users like US Bank into the process. ODPi would like to solve this problem and help end-users share their feedback.

If you are an end-user of Hadoop, we’d love to have you as part of our End User Advisory Board to discuss these issues and help us focus on making adopting these technologies less risky for you.

ODPi Interoperable at Strata NYC 2016

By | Blog | No Comments
By Susan Malaika, IBM ODPi member

ODPi Interoperable Solutions: Tested against ODPi Compliant Distributions

On September 29, 2016 I shared a session with my excellent colleague and IBM Fellow Berni Schiefer at Strata and Hadoop World Big Data Conference. I was substituting for John Mertic from the ODPi, a non-profit organization for the simplification & standardization of the big data ecosystem, as he was unable to participate. Strata occupied a large portion of Jacob Javits Conference Center in NYC, with many thousands of attendees, and a massive expo.

In the ODPi session we described the new concept of ODPi interoperable, announced on September 27, 2016, where big data solution and application providers can self-certify to be ODPi interoperable if they run their tests successfully against ODPi run-time compliant distributions. The benefit of ODPi interoperable is that when an application runs against one ODPi compliant distribution, it will run against all ODPi compliant distributions, therefore simplifying and reducing testing. A number of Hadoop applications were announced as being ODPi interoperable from Data Torrent, IBM, Pivotal, SAS, SyncSort, and WanDisco.

Berni talked about Big SQL which is one of the ODPi interoperable applications from IBM. Other ODPi interoperable applications from IBM includeAnalytic Server (SPSS), Big Replicate, and IDR for Apache Hadoop.

The audience in the session, which included representatives from banks and hardware companies, asked questions about the ODPi including how projects are added to the ODPi Runtime Specification 2.0. There is a nice description from Alan Gates of Hortonworks on how Apache Hive was selected by ODPi members to be added to the ODPi Runtime Specification 2.0. The audience also asked about participating in the ODPi, and were interested in seeing a roadmap for candidate projects for the Runtime Specification, beyond HDFS, Yarn, MapReduce, Hive & HCFS.

Call to Action: One of the ways that enterprises can participate in the ODPi is to join the ODPi User Advisory Board which will provide technical guidance and feedback on planned initiatives such as future projects to be incorporated into the ODPi. Ways individuals and companies can engage include:

The following is a recording of Berni Schiefer talking about ODPi and Big SQL. It was a wonderful experience to present alongside Berni.

Adding Apache Hive to ODPi Runtime Specification 2.0

By | Blog | No Comments

By Alan Gates, ODPi technical steering committee chair and Apache Software Foundation member, committer and PMC member for several projects

Today, ODPi announced that the ODPi Runtime Specification 2.0 will add Apache Hive and Hadoop Compatible File System support (HCFS). These components join YARN, MapReduce and HDFS from ODPi Runtime Specification 1.0

With the addition of Apache Hive to the Runtime specification, I thought it would be a good time to share why we added Apache Hive and how we are strategically expanding the Runtime specification.

Why Hive?
ODPi adds projects to its specifications based on votes from ODPi’s diverse membership. We have a one member, one vote policy. In discussions regarding what projects to add to the next Runtime specification, many members indicated that they used Apache Hive, which is data warehouse software that facilitates reading, writing, and managing large datasets residing in distributed storage using SQL. Members indicated that by adding Apache Hive to the ODPi Runtime Specification 2.0, ODPi can reduce SQL query inconsistencies across Hadoop Platforms, which is one of the key pain points for ODPi members and Big Data Application vendors in general.

What is the process?
As with everything we do in ODPi, the addition of any project to the ODPi Runtime specification is done collaboratively, with participation from everyone who has interest. ODPi has established the Runtime Project Management Committee (PMC) to maintain the Runtime Specification.

In order to make sure all voices were heard and use cases considered, the Runtime PMC formed an Apache Hive working group. This group included Runtime PMC members, as well as other ODPi contributors who wanted to be involved. It included representatives from several distributors and application vendors, including: Hortonworks, SAS, IBM, Syncsort, and DataTorrent.

The working group came together over the course of a month, meeting regularly, to determine how to add Apache Hive to the spec.

What are we adding?
The working group decided early on to focus on SQL and API compatibility rather than matching a specific version of Apache Hive. We chose Hive 1.2 as our base version that distributions must be compatible with. This gives distribution providers freedom in what version of Hive they ship, while also guaranteeing compatibility for ISVs and end users.

What has to be compatible?
The working group focussed on interfaces that the ISVs and the distributors’ customers use most frequently. We agreed that SQL, JDBC, and beeline (the command line tool that allows users to communicate with the JDBC server) are used by the great majority of Hive users and so we included them in the spec. We also included the classic command line, the metastore thrift interface, and HCatalog as optional components; that is the distribution may or may not include them, but if it does they must be compatible. We chose to make these optional because they are frequently, but not universally, used.

Where can you see our work?
The initial draft of the Runtime PMC is open to the public and everything is published on Github.

How Can You Be Involved?
We are still writing tests for distributions to check that they comply with the specification. We would love to have your help writing tests. You can also give feedback on the spec. Participation in the ODPi is open to anyone, with all work being done is public on GitHub. Developers can join the conversation on the mailing lists or Slack channel.

My Experience at Global Big Data Summit: Discussing the Importance of Standards

By | Blog | No Comments

I had a good day last week presenting to the audience at the Global Big Data Summit in Santa Clara. The tail end of the the last day of any conference is a bit slow, but was thrilled when many came barreling in right as I was ready to start working through my slidedeck which spoke to the point of the importance of standards, like ODPi, in driving future investment in Big Data and Apache Hadoop.

I had one critical question after the talk that I thoroughly enjoyed answering. A gentleman pushed back on my point that standards need to be the focus. In his experience, staff training and education were the biggest concerns and it didn’t make sense to focus on standards until a critical mass of developers and practitioners were properly trained first. It was a fair argument, and one that Gartner has found as a key blocker to Apache Hadoop growth as well, but to me one that tries to treat the symptom more than the core issue, and I pushed back saying that standards enables better education and enablement. My point made sense to him, but I walked away wanting to discuss this more in a blog with better data points behind it. After all, we are in the data industry here and should be data driven!

If there is one industry where standards are at the forefront, it’s education. Education standards are a very touchy subject (disclaimer here – I’m a parent of 4 school aged children and good friends with several educators) and while I’ll attempt to steer clear of the execution for this article, the concept of what is trying to be driven makes perfect sense. Does what skills a first grader have in one state equate with another state? What are reasonable benchmarks for defining competency? Can trends in learning/teaching method and outcomes be better correlated?

I came across an interview with a leader in educational standards entitled “How and Why Standards Can Improve Student Achievement: A Conversation with Robert J. Marzano”. The interviewee made some interesting insights which drew parallels to the critical question I received at the talk. Here’s a few quotes from the interview and their relation to Apache Hadoop standards:

“Standards hold the greatest hope for significantly improving student achievement. Every other policy mandate we’ve tried hasn’t done so. For example, right after A Nation at Risk (Washington, DC: U.S. Department of Education, 1983) was published, we tried to increase academic achievement by making graduation requirements more rigorous. That was the first wave of reform, but it didn’t have much of an effect.”

This makes a great point – creating a measuring stick for competency without some sort of standard to base education from hurts more than it helps.

The interviewer goes on to ask about what conditions are needed to implement standards.

“Cut the number of standards and the content within standards dramatically. If you look at all the national and state documents that McREL has organized on its Web site (www.mcrel.org), you’ll find approximately 130 across some 14 different subject areas. The knowledge and skills that these documents describe represent about 3,500 benchmarks. To cover all this content, you would have to change schooling from K–12 to K–22. Even if you look at a specific state document and start calculating how much time it would take to cover all the content it contains, there’s just not enough time to do it. So step one toward implementing standards is to cut the amount of content addressed within standards. By my reckoning, we would have to cut content by about two-thirds. The sheer number of standards is the biggest impediment to implementing standards.”

Lots and lots of content and things identified to learn across a diverse set of subject areas, with a finite time to turn out individuals competent in the space. Seem similar to the situation in the Apache Hadoop ecosystem?

The interviewer then follows up with asking how can you do this with knowledge continuing to expand.

“It is a hard task, but not impossible. So far the people we’ve asked to articulate standards have been subject matter specialists. If I teach music and my life is devoted to that, of course I’m going to believe that all of what’s identified in the national documents is important. Subject matter experts were certainly the ones to answer the question, What’s important in your content area? To answer the question, What’s absolutely essential? you have to broaden that population dramatically to include all constituents—those with and without college degrees.”

This response aligns very well with the ODPi approach to creating Apache Hadoop standards. We aren’t in the business of creating full end-to-end comprehensive standards of what an Apache Hadoop Platform should offer, or an Apache Hadoop-Native Big Data Application should adhere to, but instead focus on what’s truly important to provide that base level –  those essential pieces for what a platform should offer. And I particularly like the last point “expanding the scope of the conversation around standard to get diverse opinions and experiences,” which is something ODPi is uniquely positioned to drive.

One last quote, which I think shapes the “Why?” on this effort.

“Whether we focus on standards or not, we’re entering an era of accountability that has been created by technology and the information explosion.”

The enterprise has the same expectations – they want to lower risks in Big Data investments, which those risks are a byproduct of not having staff to manage them. Fortune 500 executives need this in place to have any confidence in this technology, which the abysmal adoption rates have shown to be problematic. In short, Apache Hadoop needs to be accountable for its enterprise growth.

ODPi Meetup Recap: “War Stories of Making Software Work with Hadoop”

By | Blog | No Comments

Hadoop Summit is notorious for bringing together everyone who’s anyone in the in the Big Data world – and this year’s event, welcoming more than 4,000 attendees, was no different.

 

Not only was ODPi able to announce that five Apache™ Hadoop® distributions are officially ODPi Runtime Compliant, but we also hosted a meetup that centered on “War Stories of Making Software Work with Hadoop.”

 

Successfully migrating big data software to interoperate with one or more Apache™ Hadoop® releases requires unique engineering approaches and streamlined innovation. Our meetup discussed the importance and benefits of certifying compatibility between multiple Hadoop distributions. Those who have navigated this space for years without any true standardization shared their war stories.  

Attendees also heard from ODPi members hailing from big data software vendors and ISVs. The War Stories panel featured insights from Scott Gray, chief architect of IBM’s Open Platform for Apache Hadoop; Vineet Goel, principal product manager of Pivotal HDB & Hadoop at Pivotal; Paul Kent, VP of big data initiatives at SAS; and Smiti Sharma, principal engineer of big data and emerging technologies for EMC. These members have each ported their software to work with one or more Hadoop distributions.

They discussed technical challenges they overcame and why they believe ODPi will help simplify this for both end users and ISVs in the future.

After explaining to the room how their companies are committed to both big data innovation, and how their numerous technologies aid end users, Gray, Goel, Kent, and Sharma then covered off on cross-organizational compatibility within the Hadoop space.

 

John Mertic’s first question to the panel, “Before the concept of what ODPi is meant to deliver, what were the chief challenges you were running into?” (can be found at the 28:50 mark).

When diving into this question – most of which centered on their experience and the difficulties of supporting multiple, disjointed distributions – the panelists made some insightful statements.

Gray of IBM set the stage for these pain points, noting, “Hadoop evolves at an incredible pace and there’s this never-ending tension between what the customers want… and distros [being] pressed to keep up with this evolution, and we have all these products trying to chase the distribution… It makes it incredibly, insanely expensive… It really was in our best interest to try to put a little sanity into the landscape.”

 

Goel applauded ODPi’s baseline specifications and explained Pivotal’s arduous journey of taking on a new distribution (around the 34:00 mark). Mertic commented: “I like how you said, ‘If we had the money back from supporting all these distros, imagine the innovation we could have…’ I think that’s a really powerful statement.”

After kicking off an interactive Q&A with the engaged audience, an audience member then asked for examples of the value proposition for the end users for engaging with companies part of ODPi (starting after the 42:00 mark).

Sharma addressed this question, noting her experience in pre-sales, saying “You could benefit from being on an ODPi-compliant platform… if you want to have your application portable from a Hadoop as an OS, it’s possibile through being part of ODPi.”

 

“In the early days of Hadoop, you really did have to grow your own in-house talent,” said Kent. “but we’re entering the mature part of the lifecycle curve where there’s lots of customers that just want to pick it up and use it. They don't really want to get into all these nuances. So the value of something like ODPi… will inevitably make a standardized path, where people can say ‘If you don't go out of these lines, you’re pretty safe.’”

Catch a full recording of our meetup, centered on how ODPi fits into the Hadoop and Big Data ecosystem, here – and don’t forget to subscribe to our YouTube channel!

Hadoop Summit San Jose 2016 Wrap-up

By | Blog | No Comments

We’re Making Good on our Pledge to Open the Big Data Ecosystem

As part of the industry convergence on San Jose, ODPi members and Linux Foundation staffers used Hadoop Summit to share our common commitment to grow Apache Hadoop and Big Data through a set of Specifications.

HS16SJPic1.jpg

.@vVineet @ScottCGrayIBM @hornpolish & @smiti_sharma sharing “War Stories: Making Software Work w/ Hadoop

HS16SJPic2.jpg

@ODPiOrg booth at Hadoop Summit – those rocket footballs were a hit!

HS16SJpic3.jpg

@IBMBigData booth before the show opened – Can you find the ODPi Rocket?

HS16SJpic4.jpg

@CaskData captured plenty of attention with their focus on Applications and Insights, not Infrastructure and Integration

HS16SJpic5.jpg

@Altiscale ready for the rush of attendees looking for Big Data as a Service

It was terrific seeing ODPi members and sharing ideas at the conference. And the conference sessions couldn’t have been more on point. In the words of Ben Markham from ODPi member Xiilab:

I particularly loved the session about Apache Nifi and how to build a smart home, as this is related to Xiilab and also something I’d personally love to do. The sheer amount of data that needs to be processed in order to make an efficient smart home is amazing, and it speaks to why we’re all so passionate about this industry!

Before describing the significant milestone achieved at Hadoop Summit, first let me provide a short recap on ODPi’s progress to date.

ODPi published its first Runtime Specification in March to specify how HDFS, YARN, and MapReduce components should be installed and configured. The Runtime specification also provides a set of tests for validation to make it easier to create big data solutions and data-driven applications.

  • The purpose?
    Increases consistency for ISVs and End Users when building on top of, integrating with, and running Hadoop.

  • Why?
    Because consistency around things like how APIs are exposed and where .jar files are located reduces engineering effort on low-value activities like maintaining compatibility matrices, so that more effort can go into building the features that customers care about.

That’s the promise and commitment ODPi and its members made to the industry when we published the Runtime Spec.  

At Hadoop Summit, ALL FIVE ODPi members that ship Apache Hadoop distributions announced that they achieved ODPi Runtime Compliance.

runtime_compliant_image.PNG

Cool – so how exactly does that Open the Big Data Ecosystem?

Two of the Distros that achieved Runtime Compliance, Hortonworks and IBM Big Insights, collectively partner with several hundred of the biggest Big Data ISVs and IHVs.

Altiscale, a cloud Big Data as a service company, Infosys, which supports many government clients around the world with their Hadoop distro and custom Big Data apps on top of it, and ArenaData, who is making a name for themselves bringing Hadoop and Big Data to more Russian and Eastern European businesses, also achieved Runtime Compliance.

Thanks to ODPi, today ANY of the applications that run on Hortonworks or IBM Big Insights can, WITH SIGNIFICANTLY LESS UP FRONT AND ONGOING engineering cost, support Altiscale, ArenaData and Infosys.

Pivotal lit the way by describing on their blog how Pivotal HDB was installed on the ODPi reference implementation and on one of the ODPi Runtime Compliant distributions with no modifications to standard installation steps.

That’s called Opening the Big Data Ecosystem!

Now it’s your turn to show your support for an Open Big Data Ecosystem

Tweet why YOU think Hadoop and Big Data need standards.

Share a challenge you’ve faced, maybe an engineering effort that just took way longer than it should have, or a customer support ticket that by rights should have taken minutes but instead took hours.

Be sure to tag @odpiorg and include the hashtag #ODPi4Standards in your tweet and you’ll be entered to win one of TEN $25.00 Visa Gift cards. Read contest rules here.*

*Eligibility Criteria: 10 people, tweeting 7/14/2016 – 7/18/2016, with constructive #ODPi4Standards feedback + @ODPiOrg tag or RT will win a $25 Visa gift card.