Today is an exciting day for me. After months of hard work, IBM, the University of Pennsylvania, and the Linux Foundation are announcing an innovative, first-of-a-kind open source project that will enable universities around the world to build Data Science programs faster.
With IBM’s investment and industry expertise, University of Pennsylvania’s long-standing academic leadership and the Linux Foundation as a premier open source consortium, we are creating a curriculum kit comprised of a set of open source building blocks for teaching the core concepts of data science in undergraduate and graduate programs. These building blocks are based on Python and open source tools and frameworks, and include slides, documentation, code, and data sets that could be adopted or updated by anyone.
This idea of open source Data Science education is personal to me. Access to education changed my life. Coming from a small town in Colombia, South America, education gave me the opportunity to work with cutting edge Data Science and AI technologies at one of the best companies in the world (IBM). I believe this project will provide a foundation of building blocks for schools to supplement, strengthen and start up their data science programs. And most importantly, because this is open source, it enables any institution on earth thus providing more opportunities for learners to participate in the AI Economy like I did.
When I first started this project,I met with universities in different regions of the world and a common theme emerged: starting a Data Science program from scratch is incredibly difficult, and universities need educational materials to accelerate their efforts. This was not only encouraging but validated the need: there is a demand worldwide and this concept of open source education could reach across oceans and to our local community colleges.
By making a “starter set” of training materials available and providing guidance on how to build a Data Science program, IBM and cross-industry partners and educators working together can help accelerate the availability of skills building programs around the world.
It is the beginning of a new era for Data Science Education.
The project is in incubation currently as IBM and UPenn create the initial set of materials to contribute. The project will officially launch in early 2020. To get early insights and stay up to date with this project please register here.
Do you like understanding a new technology hands-on yet also want to understand the concepts? Concerned it will take too long to get started?
Wait no longer! You can now experiment with Egeria by making use of our new Jupyter notebooks installed via Docker. Within minutes (plus download time) you’ll be happily running REST API calls against a live Egeria environment, and gaining an understand of Egeria’s concepts.
In this first Blog post I’ll take you through getting set up with a lab environment and running your first notebook.
Before we get started on setting up Egeria, you’ll need access to a few things:
docker – the environment in which to run Egeria
git – the source code control tool to get files needed
Setting up docker
Docker makes it easy to run pre-created environments in ‘containers’ which are isolated from the host machine such as your laptop. The instructions here were tested with ‘Docker for Mac’, but you can also use ‘Docker for Windows’, or docker installed on linux.
Note: The containers are linux containers built for Intel 64 bit architecture, so they won’t work on ARM, nor will they work in Windows containers …
Once you’ve installed docker, make sure it’s running as covered in the docs above. If using windows or mac, you should see a docker icon (a whale) on the toolbar.
Setting up git
git is the tool we use to manage our code. If you don’t have it installed, install it from the git website (easiest), or else from your linux distribution or homebrew . No special configuration is needed.
Retrieving the Egeria code
You’re now ready to retrieve the Egeria code. Whilst we only need a few files for the docker work this will be useful for further exercises and following along with other blog posts.
Open up a command window (mac, windows or linux), switch to a suitable directory and type:
git clone https://github.com/odpi/egeria
This will pull down the egeria code locally to your machine.
Running the notebooks
We’re now ready to run the notebook. To do this we will use a feature of docker called ‘docker-compose’. This is a simple approach to running multiple containers (think of these as applications or services) together.
For this example we are running
multiple Egeria servers (which we call a platform)
To get started with the docker compose environment (all one line – and replace / with \ for Windows):
docker-compose -f egeria-tutorial.yaml up
At this point you’ll notice a lot of activity. Once it has settled down go to a web browser and go to http://localhost:18888 . You should see a Jupyter notebook environment open, and a list of our current labs will be shown in the left hand folder tree
If you don’t see the UI appear, press CTRL-C, and retry the docker compose command. Sometimes a slower network download can cause things not to start properly first time.
Running the notebooks
In the Jupyter UI navigate to ‘administration’ and open up the `read-me-first` notebook. This introduces you to how to setup an Egeria environment in a fictional company ‘Coco Pharmaceuticals’.
The large blue bar is effectively a cursor. It shows where you are in the notebook. Read each paragraph in turn and then hit the ‘play’ button to progress through the notebook. You can also press SHIFT-ENTER to run the current step and move to the next one. As well as text, some paragraphs contain code which are being executed live against a real egeria server in your docker environment.
Once you’ve worked through this notebook try ‘managing-servers’ which goes into more specifics of how to start and stop servers. Other tutorials get into topics such as accessing assets.
Shutting down the environment
docker-compose -f egeria-tutorial.yaml down
Updating the environment
Each time the environment is started the same code will be run, since the container is downloaded the first time it’s used.
In order to refresh the contains and run the latest code (recommended) run:
Many organizations today wish to become Data Driven. This means that data is easy to locate and use by decision makers and applications. This sounds like it should be simple, but it takes a good understanding of your data backed by an effective data management program to ensue accurate and reliable insights.
Throughout your organization’s IT systems, data is copied, enriched, manipulated and duplicated. This causes complexity and exponential growth of the data. The end result being:
uncontrolled duplication of data
confusion over data heritage and lineage
misunderstanding and confusion when different teams derive conflicting results
inability to deal with legislation, such as GDPR
inconsistent access, security issues and governance over multiple data sets
Without enterprise metadata management, data management becomes a challenge and the business will struggle to become data driven.
The issue with metadata is that it is ‘siloed’ in each application or data store. To overcome this the metadata needs to be stitched together. Some metadata management tools provide this capability in part. Unfortunately, no one tool covers the entire enterprise’s data landscape.
ODPi Egeria is an Open Source metadata management capability that provides a new Open Standard for metadata exchange and consumption. The Egeria platform uses a peer to peer protocol, that connects disparate metadata repositories enabling interoperability and governance across the entire metadata landscape!
Here are my top Seven Savvy Skills that we are continuing to deliver via the ODPi Egeria project, that enable an organization to become data driven:
1. Ultimate Source and Destination for data assets, or Full Lineage
This is the ability to stitch together the possible data journeys as it flows from inception to all end points. This requires the capture of each step of the data’s journey and identifying where the data has been manipulated.
With Egeria’s ability to integrate the different tools involved in the data journey, it is possible to follow what happened to the data. Making it a simple task to discover the possible sources and destinations of a data value or perform impact analysis when changes occur in the IT landscape. Imagine how easily you could identify all locations that need to be addressed when dealing with GDPR, if you only had an enterprise wide map of all your data assets.
There are many Data Lineage patterns we consider in Egeria, the popular ones are – Design, Operational, Vertical, Horizontal, Historical and Glossary Lineage. I will be covering these patterns in a future blog.
2. Provenance of Metadata
Just as it is key in the art world to understand the provenance of a painting to determine if it is fake or worth a fortune we also need to understand the origin of metadata as it is gathered from different tools. Understanding the provenance of your metadata verifies that is has been captured and managed by an authoritative source.
3. Time Travel Through the Data Landscape
Imagine being able to go back to a point in time and verify how the data was processed Or be able to identify, if and when something changed and altered the asset. Whether this involves looking at the metadata as it was last week or a year ago, the historical capture of lineage by Egeria makes it possible to follow the data journey and review all touch points.
Egeria typically uses a dedicated graph repository to hold the lineage history. This is a fully searchable repository and provides a change audit log for all changes in data flows.
4. Data Awareness and Notifications
With a real-time distributed metadata management system you gain a new awareness of all data assets in the organisation. Each time a change is made to a data asset, such as a new column is added to a table, or a new transform is applied, these events are logged and distributed in real time so subscribers (typically other tools) are immediately alerted. No matter which tools are used, data consumers will be able to follow the evolution of the data landscape.
5. Governance, Access and Security
Data Stores and Applications each have their own governance, access and security, which is fine when they operate in isolation but when they are part of a wider data ecosystem then consistency is required.
There are many patterns for creating governance, access and security in Egeria. These are based on the type of data, its location, origin or the purpose to which it is being used. The policies are driven by Egeria’s metadata so they are always in sync with the data that exists in the data landscape. I will cover this in a future blog about “The Data Lake and Asset Access Maturity Model”.
Audit logging of access requests is also key. It is all about knowing who accessed a data asset and the breadth of data access requested by an individual. These are two interesting insights to have when investigating, data leaks or fraud.
6. Understanding for All, through Business Glossaries
There as many naming conventions for data assets, which often look nonsensical. Using a Business Glossary enables a ‘Business Definition’ to be created such as “customer name” which could be linked to multiple data items to show they contain the customer name. Having a Glossary for metadata makes understanding for both the technical and business perspective simple.
Once a Glossary is in place then classifications can be created, such as “PII” or “Company Sensitive” and associated with glossary terms to show data of that type has the attached classification. This is another powerful capability if you are looking to add additional security based on Egeria’s metadata.
A single piece of data can be used in many contexts by different groups and individuals. Consumers and creators of data know and understand the data they work with in a unique way. With the Egeria enabled metadata interchange, the organization’s use of data is collected and any feedback from the data consumers is passed to the data owners to collectively improve the quality and understanding for that data!
Egeria provides the mechanism to govern and understand an organization’s complete data landscape by linking the metadata distributed across many tools. For me, it is the metadata equivalent of stitching together all the isolated islands, continents, oceans and seas to make an interactive map of the world. With the map in place, you have complete understanding of the landscape and can build out new capabilities with confidence.
Next week I will be blogging about “User Profiles and Personas – It’s a granularity thing”.
Working with data can enable startling insights or cause confusion within an organization depending on its accuracy, availability or quality. Being able to harness the power of data empowers organizations to operate from a knowledgeable standpoint, putting them ahead of the game. However, when data is not accurate or governed it becomes an inhibitor to success.
Today’s organization have many repositories, applications, and tools that create islands of data and capability. Data as it moves between these islands is frequently manipulated, causing each data island to have a slightly different view. Such inconsistencies can cause low use of data due to concerns over data accuracy and heritage.
Without accurate, honed, complete and governed data, an organisation is working blind or worse from a basis of untruth!
Introducing ODPi Egeria…
Egeria is an ODPi project (ODPi is part of The Linux Foundation) that is enabling an open standard for metadata management across all technologies that touch data. This benefits organizations by enabling them to manage their data’s heritage, lineage, provenance, semantics, relationships, evolution, change and so on.
Many technologies in the data landscape have their own proprietary metadata store containing it’s own island of metadata. Egeria enables these different tools to exchange metadata, supplementing their metadata content and increasing its consistency. This is done using Egeria’s distributed real-time metadata highway. The highway connects selected metadata repositories, tools and applications together providing a bi-directional unified view of the metadata across the data landscape.
In the situation where a tool’s metadata repository is advanced, Egeria utilizes those capabilities directly. Conversely when the Metadata repository is less capable, Egeria can fill in the gaps. This enables comprehensive management and visibility for all metadata in the data landscape.
Egeria’s “Savvy Skills”
Egeria is an active Open Source project which continues to evolve new capabilities at pace. Here are my favorite “Savvy Skills” in Egeria, most of which are available today:
Ultimate Source and Destination for data assets, or Full Lineage
Provenance of Metadata
Time Travel Through our Data Landscape
Data Awareness, Notifications, Metadata Discovery
Governance, Access and Security
Understanding for All, through Business Glossaries
I will be detailing all these in my next blog – “My favorite Seven ‘Savvy Skills’ of Egeria”.
Why Open Source?
Egeria seeks to create an open implementation for organizations, data tool vendors and open source projects to utilize for their metadata management and governance. This implementation will enable data repositories, tools and applications to keep their metadata synchronized – which in turn promotes consistency and visibility of valuable data across the whole data landscape within an organization. An open source implementation provides access for all and most importantly breaks down the barriers created by competitive pressures between vendors.
For an organization to take on the challenge of integrating all the silos of metadata itself, is a colossal project. In a community of organisations and individuals collaborating together, each can choose which area/s is of most importance for them to work on. With focus, such a community is able to contribute a wide set of capabilities back to the project, which benefits all. If you or your organisation would like to become ODPi Egeria contributor please visit the ODPi Egeria website for more information. Of course you can always use ODPi Egeria, contributing to the project is completely optional.
Egeria is cutting new ground and to be able to continue to do this since the wider community provides a continuous stream of new ideas and skills. Only an open source project enables such a diverse collection of people to work together to create such a comprehensive metadata solution.
The Latin goddess Egeria is considered as a water nymph who created laws that pertain to rituals and religious practices. With ODPi Egeria we see her as managing and governing the metadata of an organization’s data lakes and beyond. Egeria has the ability to govern all data assets as they ebb and flow around the islands of data stores and applications in an organization.
Watch out for my future blogs as I delve deeper into ODPi Egeria.
I recently took my kids to Hersey’s Park in Pennsylvania. In case you haven’t heard about it, it’s just a normal attraction park with rides, and long lines. As we were waiting in line, my son asked, “Dad, what are you doing at work?”
I said, “I help my clients to define KPIs, and then try to apply Naive Bayes to predict the outcome. If the result is not good, we may need to build a neural network, and test it again.”
Do you really think that’s the answer I gave my son?
OF COURSE NOT!
Not because what I said is wrong, but he is simply not the right audience for that type of response. More importantly, I don’t want him to think “My dad is crazy and I’d better not ask him anything again.” So, I need to come up with an answer in a language that he can understand.
“If a computer can do work but no one knows whether it’s you doing the work or the computer, that’s AI.” – a basic principle of AI proposed by Alan Turing.
“Great! I can then use AI to do my homework and my teacher would not know that it’s not me doing that!”
“Hmm… Do you remember how you taught your younger sister the difference between a pen and an apple? You hold up a pen in front of her so she can see it and say, ‘pen.’ And you hold up an apple so she can see it and say, ‘apple.’ And you repeat this. Sooner or later, you expect her to understand the long pointy thing is a pen. And the red, round thing is an apple.”
Long, pointed, round, red. These are Features in Machine Learning. And “Pen” or “Apple” are Labels. Combined, this is Supervised Learning. This is one way how a computer can understand that different Features are associated with different Labels in Supervised Learning.
“Dad, I remember I saw a guy teaching people this on YouTube, too!”
Well, the song is funny but it is not related to Supervised Learning. But if it inputs the concept of Supervised Learning for a child, why not let it be?
In the real world, Supervised Learning can help in many different ways. One of them is distinguishing between a cancer cell from a normal cell. In this case, the computer is the “child” and the doctor is the “parent.” By showing examples repeatedly, the doctor trains the computer to distinguish the patterns between a normal cell and a cancer cell.
You may have heard about the Law of Entropy, or the Second Law of Thermodynamics. In general, unless you put in energy to keep the situation in that current state, the whole condition will just become messier over time.
You can apply the very same law to a kid’s playground. Unless you really put in effort to keep toys tidy, the toys will not automatically go back to their original positions. At my home, my mother-in-law helps out the kids to keep the play areas organized. Once, when she went to Hong Kong for a vacation, the play areas became more disorganized day after day. Finally, my wife had to step in and demand that the kids clean up before grandmother returned. She did not give exact instructions. She just demanded they clean up!
Guess what happened in the next few hours? The kids put all the four-wheels-boxy-shaped things in one area, and we called it “Cars.” And all the fluffy stuff was put together in another area, and we called it “Stuffed Animals.” And then they put all the blocks that can be stacked up together in some boxes and named “Legos.”
They did not get any specific instructions or rules to decide what should go where. But somehow they figured out the similarities and differences. In Machine Learning, this is called Unsupervised Learning.
This is when the computer is given a lot of data points and the computer figures out the pattern by itself. In the real world, Unsupervised Learning can be used in customer segmentation. There is a lot of information and data about a lot of customers. You don’t tell the computer who should be grouped with whom, but this is figured out by Unsupervised Learning. Traditionally, this is done by the expert who observes different patterns, like age, spending pattern, where you live, salary… and then tries to group the types of customers together. And now, we have the machine to play the role of expert, which is able to scan through millions of records in a few seconds but is impossible for any human being
When dealing with kids, it’s not always the best way to just keep telling them and keep showing them the proper examples. At the same time, it’s not very effective to give no instructions and let them figure out everything by themselves.
It’s a common practice in teaching kids to reward them when they do something good. And when they do something bad, you punish them. This is intended to reinforce certain behaviors. In Machine Learning, this is known as Reinforcement Learning.
When a computer performs the way that you want, you add a point. When it fails to do what you want, you reduce a point. The computer therefore knows what to do to gain points.
In the real world, Reinforcement Learning is applied heavily in Robotics. For example, a robot is trying to walk a straight line. It may make it or it may fall down. Whenever the robot falls down, you reduce a point. And whenever the robot successfully makes one step, you add one point. There are many motors and sensors on a robot, and all of them are collecting data for the system. The robot learns what kind of motor speed, what kind of angle is needed in order to keep walking in a straight line and avoid falling.
2 Types of Measurement
2 Popular Questions by Kids – Key Approaches in Machine Learning
Kids like to ask a strangers, “How old are you?” and “Are you a boy or a girl?”
“How old are you?” is asking for a number. It’s Regression.
“Are you a boy or a girl?” is Classification. Looking for an outcome for a pre-defined category. Both are 2 important concepts in Machine Learning.
3 Ways to Learn
Kids observe the world around them. They come up with certain rules. They will propose the result, and they will be corrected by adults. Which makes the rule to get better and better.
Compared to the old way of programming: Developer observes the world. They code rules using rule-based algorithms. And they will come up with some results. Based on this, they will change or modify the rules.
In AI, it’s a little bit different. Developer creates the AI algorithm and have it create the rule. The algorithm comes up with a model and continue to train it. The model then tries to predict the result and see if it is accurate or not. The key here is that the algorithm keeps modifying the model using more data without the developer being involved.
That’s the beauty of AI!
No Right or Wrong. Just Right or Left!
Final question: What are the similarities and differences between Tesla and Uber? They both are both in the automobile industry. But one company, Tesla, creates new technology to help revolutionize the whole car industry. While Uber uses existing technology (like mapping, mobile app..etc) to create a new business model.
So the power of AI is not just in making algorithms. It can be using existing algorithms to build new ways of doing business. One builds the technology, one utilizes it.
Remember my son who was thinking about ways to get his homework done? Ultimately, I would be equally proud if he came up with an algorithm that could do his homework and successfully fool his teacher or if he utilized existing algorithms to do the same thing. Both are important new ways of adopting AI to solve problems.
There is no Right or Wrong, only Right or Left. But no matter which direction you pick, be persistent and you will cross the finish line of success via either route – Cupid Chan tweet on Nov 28, 2018
The content of this blog has been presented in a few national and international conferences such as Open Source Summit in Shanghai China and MicroStrategy Federal Summit in Washington DC. I also captured this in my very first YouTube channel video which you can find here: https://www.youtube.com/watch?v=dh9xz4SBukE&t=13s