ODPi Community Lounge @ Apache Big Data Europe

By | Blog | No Comments

Join the Discussion at the ODPi Community Lounge

Once again ODPi is sponsoring the Community Lounge at Apache Big Data Europe, November 14-16 in Seville, Spain.  Apache project members and speakers are welcome to hold their meetings and after-session discussions.  This is a great way to have a deeper intimate conversation with fellow attendees, and to introduce new potential collaborators to your project

Please choose a time on the Community Lounge Schedule  for your topic or project.  We’ll help promote your upcoming meeting.  Be sure to tell your followers as well.  Time slots are 30 minutes each and can be scheduled on a first come, first served basis.

ODPi Community Lounge – ApacheCon EU 2016

Discussion Schedule

Monday, November 14

Time Speaker or Project Name Topic
12:30  Apache Giraph – Roman Shaposhnik  Discussion session: Practical Graph Processing with Apache Giraph
13:30 – 15:30 Lunch 
15:30  Apache MADlib – Roman Shaposhnik  Distributed In-Database Machine Learning with Apache MADlib (incubating) – Roman Shaposhnik, Pivotal
16:00  Apache Geode – Greg Chase  Meet Apache Geode – graduated for Apache Incubator

Tuesday, November 15

Time Speaker or Project Name Topic
13:30 – 15:30 Lunch
15:30  Apache Big Top & Greenplum Database – Greg Chase & Roman Shaposhnik Discussion: Massively Parallel Data Warehousing in the Hadoop Stack

Wednesday, November 16

Time Speaker or Project Name Topic
10:30  John Mertic, Director, ODPi and Open Mainframe Project, Linux Foundation  Discussion: Keynote: Lessons from the Trenches: How Apache Hadoop is Being Used & The Challenges Its Users Face –
11:00  ODPi – John Mertic  Discussion: Standardizing data governance across Hadoop distributions
11:30 ODPi – Roman Shaposhnik and John Mertic Discussion: Security in Hadoop
12:00 ODPi – Roman Shaposhnik and John Mertic Discussion: Streaming data in Hadoop
12:30 ODPi Discussion – Roman Shaposhnik Discussion: Hadoop Compatible File Systems across Hadoop Distributions
13:00  ODPi – Alan Gates  Discussion: Standardizing Hive in Hadoop distributions
End of conference

Is Your Data Clean or Dirty?

By | Blog | No Comments

downloadOver the weekend I read an incredible post from SAS Big Data evangelist Tamara Dull. I love her down-to-earth and real life perspectives on Big Data, and your analogy of cleaning the car hit home for me. She is spot on – clean data pays dividends in being able to get better insights.

But, what is clean data? What is that threshold that says your data is clean versus dirty?

Could data even be “too clean”?

(pause to hear gasps from my OCD readers)

Clean data and clean houses

Taking this to a real life example, I can say first hand there are often different definitions of what clean is. For example, my wife is very keen on keeping excess items off our kitchen counters, to the point where she’ll see something that doesn’t belong and put it in the first cabinet or drawer she encounters that has space for it. Me on the other hand, I’m big on finding what I believe is the right place for it. Both of us have the same goal in mind – get the counters clean.

To each of us, there’s value in our approaches – which is efficiency. Hers is optimized at the front end, mine at the back end. However, the end result of each of our “cleaning” could have negative impacts (with my approach, it’s my wife’s inability to find where I put something – with my wife’s method, it’s having items fall out of a cabinet on me as I open it).

Is “clean” to one person the same as everyone?

The life lesson above teaches something critical about data – clean isn’t a cut and dry threshold. And taking a page from Tamara’s post, it’s also not a static definition.

The trap you can quickly fall into is thinking of data in the same terms as you would have looked at structured data. While yes, part of the challenge is to understand what the data is and its relationships, the more crucial challenge is how you intend to consume the data and then use it. This is a shift from the RDBMS thinking of focusing on normalization and structure first and then usage second. With the Big Data-esque ways of consuming and processing data (streaming, ML, AI, IOT) combined with velocity, variability, and volume, the use-case mindset is exactly where your focus should be.

“Use case first approach” is how we look at these technologies at ODPi. We look at questions like “Here is the data I have, and this is what I’m trying to find out – what is the right approach/tools/patterns to use?” and how they can be answered. We ensure all of our compliant platforms, interoperable apps, and specifications have the components needed to enable successful business outcomes. This provides companies the peace of mind that they are making a safe investment, and that switching tools doesn’t mean that their clean data becomes less than optimal to leverage the way they want.

This parallels on the discussion of cleaning in our house – are we trying to clean up quickly because company is coming over, or are we trying to go through an entire room and organize it. Approaching data cleaning is the same thought process.

ESG Whitepaper: ODPi Simplifies Apache Hadoop Application Development and Portability

By | Blog | No Comments



Over the last decade, Apache Hadoop has generated many popular open source software projects, spawned a number of rapid growth startups with commercial distributions and complementary products, and has been a reliable distributed data platform for analytics. As Apache Hadoop adoption continues to grow, the larger Hadoop ecosystem is expanding, too. However, some debate remains about the future direction of the technology.

In this paper created by ESG Senior Analyst Nik Rouda, he discusses Apache Hadoop support from businesses, governments, academia, and technology vendors and how this large and diverse community differs in their specific goals and objectives for harnessing this technology.

Rouda dives into how ODPi is helping to bring maturity and choice to the Hadoop ecosystem in several ways, offering:

  • More confidence that Hadoop will remain a safe data platform choice for companies.
  • Simplified application and compatibility testing for third-party software developers.
  • Vendor-neutral coordination of efforts between vendors to build synergies across their offerings.

Download this free report to learn more about simplifying Apache Hadoop application development and portability.

Join author Nik Rouda, ESG Senior Analyst and ODPI Director John Mertic for a complimentary webinar Monday November 7 from 12-1 PM Eastern. All registrants will get a free copy of this valuable white paper.

Is Open Source Big Data a broken promise?

By | Blog | No Comments

An article caught my eye this past week, where Robert Hof of SiliconAngle asserted that the challenges of Apache Hadoop adoption are a byproduct of the open source development approach. Hof argues that the various pieces do not integrate well together and some projects are not living up to their promises, which has resulting in additional work being required by organizations for them to see their true value. This has lead to a small pool of available talent and end-customers that are uncertain about where to direct their investments.

On the heels of this article, I watched the below video from Rakesh Kant of US Bank that I found just as insightful.

His sentiment rings loud and clear:

  • “I’m not seeing any signal, only noise.”
  • “The landscape is evolving into more experiments”
  • “A standard is required to help businesses”
  • “I’d like to focus time on delivering business value”

The Hadoop ecosystem has always been a technology focused one, and its clear this technology has been ground breaking and impactful. However, I do think that, over time, this technology has evolved to solve the needs of technologists. Enterprises have been largely been left without a voice and to struggle to embrace it with confidence.

In my view, open source as a development model is not the problem. Rather, it’s the lack of feedback from end-users like US Bank into the process. ODPi would like to solve this problem and help end-users share their feedback.

If you are an end-user of Hadoop, we’d love to have you as part of our End User Advisory Board to discuss these issues and help us focus on making adopting these technologies less risky for you.

ODPi Interoperable at Strata NYC 2016

By | Blog | No Comments
By Susan Malaika, IBM ODPi member

ODPi Interoperable Solutions: Tested against ODPi Compliant Distributions

On September 29, 2016 I shared a session with my excellent colleague and IBM Fellow Berni Schiefer at Strata and Hadoop World Big Data Conference. I was substituting for John Mertic from the ODPi, a non-profit organization for the simplification & standardization of the big data ecosystem, as he was unable to participate. Strata occupied a large portion of Jacob Javits Conference Center in NYC, with many thousands of attendees, and a massive expo.

In the ODPi session we described the new concept of ODPi interoperable, announced on September 27, 2016, where big data solution and application providers can self-certify to be ODPi interoperable if they run their tests successfully against ODPi run-time compliant distributions. The benefit of ODPi interoperable is that when an application runs against one ODPi compliant distribution, it will run against all ODPi compliant distributions, therefore simplifying and reducing testing. A number of Hadoop applications were announced as being ODPi interoperable from Data Torrent, IBM, Pivotal, SAS, SyncSort, and WanDisco.

Berni talked about Big SQL which is one of the ODPi interoperable applications from IBM. Other ODPi interoperable applications from IBM includeAnalytic Server (SPSS), Big Replicate, and IDR for Apache Hadoop.

The audience in the session, which included representatives from banks and hardware companies, asked questions about the ODPi including how projects are added to the ODPi Runtime Specification 2.0. There is a nice description from Alan Gates of Hortonworks on how Apache Hive was selected by ODPi members to be added to the ODPi Runtime Specification 2.0. The audience also asked about participating in the ODPi, and were interested in seeing a roadmap for candidate projects for the Runtime Specification, beyond HDFS, Yarn, MapReduce, Hive & HCFS.

Call to Action: One of the ways that enterprises can participate in the ODPi is to join the ODPi User Advisory Board which will provide technical guidance and feedback on planned initiatives such as future projects to be incorporated into the ODPi. Ways individuals and companies can engage include:

The following is a recording of Berni Schiefer talking about ODPi and Big SQL. It was a wonderful experience to present alongside Berni.

Stay Informed

Sign up for our Newsletter to receive the latest ODPi news and updates.