Everything you need to know about Kedro

Explore Kedro - the open-source data pipeline framework that can help streamline your machine learning workflows. Discover how it can help you organize, version, and share your code, data, and models with ease in this blog post.

Published on:

April 29, 2024

Kedro is an open-source data engineering framework that helps you build data pipelines that are scalable, maintainable, and reproducible. It was developed to make it easier for data scientists and engineers to build, deploy, and maintain complex data pipelines.

Kedro was developed by QuantumBlack, a data science consulting firm now part of McKinsey & Company. Kedro was released as an open-source project in 2018 and is now maintained by a team of developers at QuantumBlack. Kedro is built on top of the famous Python library pandas.

Kedro is designed to be used in conjunction with many popular Machine Learning and Deep Learning frameworks such as sci-kit-learn and TensorFlow. If you are working on a data project and looking for a tool to help you build and manage your data pipelines, Kedro may be a good option.

On Data Science Pipelines

To understand the impact of Kedro on the data science pipeline, we must first understand what a data science pipeline is. A data science pipeline refers to the series of steps a data scientist must take to go from raw data to a finished product, such as a machine learning model or a data visualization. The specific steps in a data science pipeline can vary depending on the project, but common measures include

  • Data collection: This step involves collecting the data used in the project. This can include web scraping, API calls, or accessing a database.
  • Data cleaning and preprocessing: This step involves cleaning the data and preparing it for analysis. This can include missing value imputation, outlier detection, and feature engineering.
  • Data exploration and visualization: In this step, the data scientist will explore the data and create visualizations to understand the characteristics of the data and identify trends or patterns.
  • Modeling: In this step, the data scientist will build and train machine learning models on the data. This can include selecting an appropriate model, tuning hyperparameters, and evaluating the model's performance.
  • Evaluation: In this step, the data scientist will evaluate the model's performance and determine if it meets the project's goals.
  • Deployment: If the model performs well, it can be deployed in a production environment, such as on a website or mobile app.
  • Monitoring: To maintain a robust and continuously operating data science pipeline, data scientists must monitor how well it performs after deployment.

A data science pipeline helps data scientists to organize their work and ensure that the different steps of the process are completed in the correct order. It also makes it easier to reproduce the results and collaborate with others on the project. More often than not, building a production-grade data science pipeline requires expertise. Attri’s AI Blueprints, and AI Engine make high quality, production-grade data pipelines accessible to everybody including non-domain experts.

Why Kedro

Kedro can significantly impact the data science pipeline by providing a framework for organizing and managing the different steps of the process. Some unique features of Kedro that differentiate it from other data engineering platforms include

  • A clear separation of concerns between the data pipelines and the data itself: Kedro separates the data pipelines from the information itself, which makes it easier to reason about and maintain the channels. This separation makes it easier to swap out different datasets or data sources without modifying the pipeline code.
  • A modular and reusable design: Kedro's modular design makes it easy to build, test, and maintain data pipelines. You can reuse components of your channel in different projects, saving time and reducing your code's complexity.
  • A strong focus on reproducibility and collaboration: Kedro includes features that make it easier to collaborate with others on data projects and ensure that the results are reproducible. For example, you can use Kedro to version control your pipelines and data, and you can use it to track the provenance of your data.
  • Integration with various tools and technologies: Kedro integrates with multiple tools and technologies, including Jupyter notebooks, Git, and cloud platforms. This makes it easy to use Kedro in multiple environments and workflows.

How to Get Started with Kedro

To start with Kedro, you will need to install it and create a new Kedro project. Here are the steps you can follow:

Install Kedro: You can install Kedro using pip. Open a terminal and run the following command:

Create a new Kedro project: To create a new Kedro project, run the following command:

This will create a new directory with the structure of a Kedro project.

Explore the project structure: The Kedro project has a standard directory structure that organizes the project's code, data, and configuration. You can find more information about the project structure in the Kedro documentation.

Run the example pipeline: Kedro has an example pipeline that you can run to test your installation. To run the example pipeline, navigate to the project's root directory and run the following command:

This will run the example pipeline and print the results to the terminal.

You can find more information about getting started with Kedro in the Kedro documentation

Kedro in Action

Many data projects have been built using Kedro. Here are a few examples:

  • Kedro-Airflow: This project uses Kedro to build data pipelines that are run using Apache Airflow. The project includes a set of custom operators and hooks for interacting with Kedro pipelines from within Airflow.
  • Kedro-Viz: This project provides tools for visualizing Kedro pipelines and their dependencies. The visualizations can be used to understand the pipelines' structure and behavior and identify potential issues or bottlenecks.
  • Kedro-Forecast: This project provides a set of tools for building and evaluating machine learning models for forecasting time series data using Kedro. The project includes a set of custom nodes for preprocessing, modeling, and evaluation and a bunch of pre-built pipelines for everyday forecasting tasks.
  • Kedro-GCP: This project provides tools for deploying Kedro pipelines on the Google Cloud Platform (GCP). The project includes a set of custom nodes and hooks for interacting with GCP services such as BigQuery and Cloud Storage and a set of templates for deploying Kedro pipelines on GCP.

One of the primary advantages of employing Kedro is using software and data engineering concepts in the data science world very efficiently. With Kedro, it is possible to see everything working together: data engineering plus the data science part in the same repository. Here are some instances of Kedro applied in real-world scenarios:

  • NASA utilized Kedro as part of their cloud-based predictive engine that predicts impeded and unimpeded taxi out duration within the airspace.
  • JungleScout used Kedro to accelerate the training and review of its sales estimation models by 18 times while supporting more marketplaces. 
  • Telkomsel uses Kedro to run various feature engineering tasks and serve dozens of machine learning models in their production environment. Kedro's data pipeline visualization helped them to debug and explain the pipeline to the business user.
  • ElementAI enhanced their work efficiency and collaboration by using Kedro in their scheduling software to estimate historical performance and create replay scenarios.

These are just a few examples of projects built using Kedro. You can find more information about these and other projects on the Kedro community page.

Conclusion

Here is a summary of some of the pros and cons of using Kedro for data engineering projects:

Pros:

  • Kedro has a clear separation of concerns between the data pipelines and the data itself, making it easier to reason about and maintain them.
  • Kedro's modular and reusable design makes it easy to build, test, and maintain data pipelines.
  • Kedro strongly focuses on reproducibility and collaboration, making it easier to work with others on data projects and ensure consistent results.
  • Kedro integrates with various tools and technologies, including Jupyter notebooks, Git, and cloud platforms.
  • Kedro is built on top of popular Python libraries such as pandas and is designed to be used with other data science libraries.
  • Kedro helps efficiently collaborate between the Data Engineering and Data Science teams.
  • Kedro acts as a single source of truth of data sources & sinks, feature logic, and configurations.

Cons:

  • Kedro is a relatively new project, so it may have less community support and documentation than other data engineering platforms.
  • Kedro is focused on data pipelines, so it may have fewer features for other aspects of data engineering, such as data storage and data processing.

Overall, Kedro is a promising tool for building data pipelines that are scalable, maintainable, and reproducible. It is particularly well-suited for data projects that require a clear separation of concerns between the data pipelines and the data itself and that need to be easily reusable and maintainable. Kedro can be the best fit for all data engineering projects, depending on the specific needs and requirements of the project.