All Posts By

John Mertic

Project Frontier: Shaping the Next Generation Hadoop Build Framework of Apache Bigtop

By | Blog

By Evans Ye, Yahoo Taiwan

As a mature Apache top-level project, Apache Bigtop has now been around for 6 years, serving as a critical component for building Hadoop distributions running in production. From on-premises, to big data solution vendors, to cloud providers—Bigtop has been widely leveraged in the big data world.

Yet today that world is growing even more complex. Having started with only a handful of components (HBase, Hive, Pig, Oozie, etc.), the latest release of Bigtop now includes more than 30 components. To handle such complexity, developers need to make sure a patch won’t break components that are integrated together, and release engineers also need to ensure features are fully functional. This is why we initiated Project Frontier, funded by ODPi.

Project Frontier focuses on extending and hardening the feature that Bigtop was originally designed for: building Hadoop distributions. Bigtop can only produce high-quality distributions if working with upstream projects closely to solve integration problems across multiple Hadoop ecosystem projects.

Based on observations to existing Bigtop build frameworks, we set the following goals for Project Frontier:

  1.  Provide a one-stop seamlessly integrated build pipeline
  2.  Document examples as reference implementations
  3.  Create better documentation for iTest, Smoke Tests and the others

These goals are all around one core mission of Project Frontier: Make Bigtop extremely friendly to use. The industry needs a simplified integration test framework for Apache Bigtop. We need a better solution for Apache Bigtop to work with other Hadoop ecosystem projects, with release and integration tests to ensure versions of different projects are working properly with one another.

For example for you, one of the scenarios we’d like to support is that developers can just submit a commit SHA1 which contains newly developed feature, then the framework will handle all the rest to craft an integration test report. That’s how simple it is.

Project Frontier Feature Preview

To tackle these ambitious goals, we will develop the features and functionality of Project Frontier in phases. The initial phase is focusing on improvements to building components in Bigtop. Let’s preview a feature that will be available in the upcoming Bigtop 1.3 release. In Bigtop’s master branch, users will now be able to run the following command under the Bigtop repository to build components.

Let’s say Hadoop:

$ git clone https://github.com/apache/bigtop.git

$ cd bigtop

$ ./gradlew hadoop-pkg-ind

That’s it. Bigtop will take care of the full build environment, and dependencies,  for you. The advantages of this new feature are:

  1.  It abstracts the tedious work that requires direct user attention
  2.  Now grade targets can be streamlined like this:

$ grade hadoop-pkg-ind docker-provisioner, which has hadoop built and deployed as a testing cluster.

We’re still polishing the feature to support more customizations. For example, adding build packages with Nexus server support. Many more features are under development, so share your input and get involved. The Bigtop community welcomes all kinds of contributions from code, to doc, test and discussion—Learn more by visiting our page on GitHub. Join us now to shape the way we are building and integrating the Big Data ecosystem!

 

Evans Ye is a PMC member and former Chair of Apache Bigtop, and leads the Project Frontier initiative for ODPi. He works at Yahoo Taiwan to develop E-Commerce data solutions. Ye loves to code, automate things, and develop big data applications.           

Managing Privacy in the GDPR-era

By | Blog

 

Now that the EU General Data Protection Regulation (GDPR) is in full effect, businesses both large and small have made changes to be fully compliant, regardless of where they are located. The changes include more regulation for how companies collect data, how they store it, keep it safe from hackers and use it in their day-to-day activities. Some people think GDPR as ‘giving the power over data back to the user’. GDPR replaced old data privacy laws that were set up in 1995 and that have been obsolete for some time now.

But what does this mean for the consumer?

According to this Marketing Week article, consumers don’t understand how brands use their data. In fact, 48% of consumers still don’t understand where and how organizations use their personal data. This is up from 31% when the research was last conducted two years ago.

Only 7% feel they have a good understanding of how companies use their data, with 45% saying they “somewhat understand,” but 18% believe businesses treat people’s personal data in an honest and transparent way.

This is where ODPi comes in. ODPi’s Data Governance initiative aims to create an open data governance ecosystem through collaboration with data governance subject matter experts and data platform and tools vendors. On Thursday, July 12, ODPi is hosting a webinar focused on managing privacy.

Mandy Chessell, distinguished engineer and master inventor at IBM, will share best practices for how IBM manages data that keeps individuals’ privacy respected and is compliant with new regulations on data privacy such as the EU GDPR.

Attendees will learn:

  • The life cycle of a digital service as it is developed, sold, enhanced and used. This life cycle breaks the work into six stages. Each stage describes the roles and the activities involved to ensure data privacy.
  • The types of artifacts that need to be collected about a digital service and the methods used to develop it.
  • How these artifacts link together in an open metadata repository (data catalog).

Click to learn more or to register for the webinar.

The state of open source and big data – three years later

By | Blog

Originally posted on DataWorks Summit blog

ODPi turns 3 this year, being first announced at the spring Strata+Hadoop World and brought under the auspice of the Linux Foundation later in the year at the fall Strata+Hadoop World. Hadoop then turned 10 the following year, and seemed to be proclaimed deadthen alive, and then seemingly scrubbed from the world. One might think this meant the nail in the coffin for an organization centered on Hadoop standardization.

The Linux Foundation looks at open source projects in a life cycle, driven by the market needs. A common chart used to describe this is shown below.

In essence, open source foundations such as ODPi invest in developer communities, whose work enables accelerated delivery of new products on the marketplace and cost savings for R&D in organizations. As this produces profits for these organizations, they push investment back into the projects and foundations that support this work. In the present day, open source parlance this practice known as “Managing your Software Supply Chain”. An active cycle here is able to react and adapt to market demands, as well as, take inputs from all stakeholders – developers, implementers, administrators, and end-users.

So, as ODPi started to hit stride in 2016, we talked with people across the data landscape. From these conversations, we quickly saw that big data technology enterprise production adoption numbers were skewed – mostly because of the lack of a solid definition. To better baseline the discussion, we came up with this maturity model on how big data technologies are adopted in the enterprise.

Using this model showed that in 2017, nearly 3/4ths of organizations are still not fully enterprise-wide in deployment of big data. What’s blocking this? Data Governance, a broad and under-invested in area, but one growing more critical by the day with new regulations coming into play along with breakdowns in managing data privacy.

ODPi’s belief is that tackling such a broad issue as Data Governance can only be done with all members of the data ecosystem participating – platform vendors, ISVs, end users, and data governance and privacy experts. This collaboration can only happen in a vendor-neutral space, which is why ODPi has launched a PMC to solely focus on this space.

During Dataworks Summit Berlin, there will be numerous sessions and Meetups around this effort to help you learn more:

We will also be active in the community showcase, where you can chat directly with the experts in this area and learn how to participate in this effort.

Bringing it back to the original question – we are three years into this journey for creating sustainability in big data. We’ve had successes in reducing the numbers of disparate platforms and bringing market awareness to the issues of enterprises adopting these tools. Now the community is poised to take the lessons learned and build a strong community around governance to solidify this practice. Are the challenges different than 3 years ago – absolutely. However, the goal of enterprise adoption remains the same, and with that, we see that big data is becoming more mature, more inclusive, and is building a more collaboratively community.

ODPi Webinar on How BI and Data Science Gets Results

By | Blog

By John Mertic, Director of ODPi at The Linux Foundation

ODPi recently hosted a webinar on getting results from BI and Data Science with Cupid Chan, managing partner at 4C Decision, Moon soo Lee, CTO and co-founder of ZEPL and creator of Apache Zeppelin, and Frank McQuillan, director of product management at Pivotal.

During the webinar, we discussed the convergence of traditional BI and Data Science disciplines (machine learning, artificial intelligence… etc), and why statistical/data science models can now run on Hadoop in a much more cost effective manner than a few years ago.

The second part of the webinar focused on demos of Jupyter Notebooks and Apache Zeppelin. These were important and relevant demos, as Data Scientist utilize Jupyter Notebooks the most and Apache Zeppelin supports multiple technologies, multi-languages & environments; making it a great tool for BI.

The inspiration for the webinar was the new Data Science Notebook Guidelines. Created by the ODPi BI and Data Science SIG, the guidelines help bridge the gap so that BI tools can sit harmoniously on top of both Hadoop and RDBMS, while providing the same, or even more, business insight to the BI users who have also Hadoop in the backend. Download Now »

Additionally, webinar listeners asked detailed questions; including:

  • How can one transition from a bioinformatics developer to Data scientist in Bio-statistic?
  • Where do you see the future of both Jupyter and Zeppelin going? Are there other key data science challenges needing solved by these tools?
  • When do you choose to use one notebook over the other?
  • Can the 2 notebooks be used together?  i.e., can you create a Jupyter notebook and save it, then upload it into Zeppelin (or vice versa)?

Overall, the webinar was an insightful discussion on how we can achieve big data ecosystem integration in a collaborative way

If you missed the webinar, Watch the Replay and Download the Slides.

Looking at the latest Gartner Magic Quadrant for Business Intelligence and Analytics Platforms

By | Blog

By John Mertic

I spent some time reviewing the latest Gartner Magic Quadrant for Business Intelligence and Analytics Platforms in preparation for my time at the Gartner Data and Analytics Summit last week. Overall, I’m really excited to see vendors overall scoring higher in ‘Ability to Execute’; Gartner toughly judges this so seeing the general shift upwards is great to see.

While the piece is clearly targeted towards buyers of these tools – I wanted to take a critical eye on the positioning of vendors in relation to their interoperability with Big Data and Hadoop tools. After all, it was a mere decade ago that all of data was covered by a single Gartner analyst. Enter the age of Big Data; with that variability, velocity, and volume has come a cornucopia of products, strategies, and opportunities for answering the data question.

In the same way, BI and Analytics has come from being purely the realm of “data at rest” to become cohesive with “data in motion”. It’s no surprise then to see two “pure play big data” BI vendors, Datameer and ZoomData, joining ClearStory which joined the MQ last year – cementing the enterprise production need of valuable data insights. And with a tip of the hat to the new breed of open source trailblazers such as Hortonworks, they heavily leverage Hadoop and Spark as not just another data source but instead a tool to better process data – letting them focus on their core competency of delivering business insights.

However, what really struck me was the positioning of data governance as a whole in this report – let’s dig into that more.

Data governance and discovery is being pushed farther out

If you’d compare the 2016 report to the 2017 report – you’d immediately notice this line from 2016…

By 2018, smart, governed, Hadoop-based, search-based and visual-based data discovery will converge in a single form of next-generation data discovery that will include self-service data preparation and natural-language generation.

…became…

By 2020, smart, governed, Hadoop/Spark-, search- and visual-based data discovery capabilities will converge into a single set of next-generation data discovery capabilities as components of modern BI and analytics platforms.

Two year delay in just a year is something of note – clearly there is a continual gap in converging the technologies. This aligns with what our members and end-users in our UAB mention as well – the lack of a unified standard here is hurting adoption and investment.

Governance no longer considered a critical capability for a BI vendor

This really stood out to me in light of the point above – is sounds like Gartner believes that governance will need to happen at the data source versus the access point. It’s a clear message that better data management needs to happen in the data lake – we can’t secure at the endpoints for true enterprise production deployment. This again supports the needs of driving standards in the data security and governance space.

I recently sat down with IBM Analytics’ WW Analytics Client Architect Neil Stokes on our ODPi Member Conversations podcast series and the discussion of data lakes was a very present one. To listen to this podcast, visit ODPi Youtube.
I’m reminded of the HL Mencken quote “For every complex problem there is an answer that is clear, simple, and wrong.” Data governance is hard, and not ever going to be something one vendor will solve in a vacuum. That’s why I’m really excited to see the output of both our BI and Data Science SIG and Data Security and Governance SIG in the coming month. Starting the conversation in the context of real world usage, looking at both the challenges and opportunities, is the key to building any successful product. Perhaps this work could be the catalyst for smarter investment and value adds as these platforms continue to grow and become more mature.

Is Your Data Clean or Dirty?

By | Blog

downloadOver the weekend I read an incredible post from SAS Big Data evangelist Tamara Dull. I love her down-to-earth and real life perspectives on Big Data, and your analogy of cleaning the car hit home for me. She is spot on – clean data pays dividends in being able to get better insights.

But, what is clean data? What is that threshold that says your data is clean versus dirty?

Could data even be “too clean”?

(pause to hear gasps from my OCD readers)

Clean data and clean houses

Taking this to a real life example, I can say first hand there are often different definitions of what clean is. For example, my wife is very keen on keeping excess items off our kitchen counters, to the point where she’ll see something that doesn’t belong and put it in the first cabinet or drawer she encounters that has space for it. Me on the other hand, I’m big on finding what I believe is the right place for it. Both of us have the same goal in mind – get the counters clean.

To each of us, there’s value in our approaches – which is efficiency. Hers is optimized at the front end, mine at the back end. However, the end result of each of our “cleaning” could have negative impacts (with my approach, it’s my wife’s inability to find where I put something – with my wife’s method, it’s having items fall out of a cabinet on me as I open it).

Is “clean” to one person the same as everyone?

The life lesson above teaches something critical about data – clean isn’t a cut and dry threshold. And taking a page from Tamara’s post, it’s also not a static definition.

The trap you can quickly fall into is thinking of data in the same terms as you would have looked at structured data. While yes, part of the challenge is to understand what the data is and its relationships, the more crucial challenge is how you intend to consume the data and then use it. This is a shift from the RDBMS thinking of focusing on normalization and structure first and then usage second. With the Big Data-esque ways of consuming and processing data (streaming, ML, AI, IOT) combined with velocity, variability, and volume, the use-case mindset is exactly where your focus should be.

“Use case first approach” is how we look at these technologies at ODPi. We look at questions like “Here is the data I have, and this is what I’m trying to find out – what is the right approach/tools/patterns to use?” and how they can be answered. We ensure all of our compliant platforms, interoperable apps, and specifications have the components needed to enable successful business outcomes. This provides companies the peace of mind that they are making a safe investment, and that switching tools doesn’t mean that their clean data becomes less than optimal to leverage the way they want.

This parallels on the discussion of cleaning in our house – are we trying to clean up quickly because company is coming over, or are we trying to go through an entire room and organize it. Approaching data cleaning is the same thought process.