A data pipeline is a method in which raw data or unchanged data is ingested from various data sources and shipped to another data store, like a relational database, data lake, or a data warehouse; where the data is eventually used for analysis. Before data is eventually used it undergoes some form of data processing or transformation; including filtering, masking, and aggregations. The transformation process ensures data integration and standardization. This is particularly important when the destination for the raw data is a relational database.
As the names suggests, data pipelines act as the “piping” or “plumbing” for many different projects in modern data platforms, data science projects, or business intelligence dashboards. Data is can be and often sourced through a wide variety of places – APIs, SQL, NoSQL, flat files, etc., but the data is not ready for immediate use. Preparation of data usually falls on the shoulders of data engineers or data scientist, who structure the data to meet the need of the business and the associated use cases. The type of data processing that a data pipeline requires is usually determined through a mix of exploratory data analysis and defined business requirements. Once the data has been appropriately filtered, merged, and summarized, it can then be stored and used. Well-organized data pipelines provide the foundation for a range of data projects; this can include exploratory data analyses, data visualizations, and machine learning tasks.
Types of Pipelines
There are a few different types of data pipelines, but two primary types stand out; which are batch processing and stream processing.
The development of batch processing was critical step in building data infrastructures that were reliable and scalable in the early days. This type of processing enabled organizations to move and process large amounts of data into repositories at set time intervals, typically during off-peak hours. This way workloads were not impacted as batch processing jobs tend to work with large volumes of data; taxing the overall system. Batch processing is the optimal data pipleline when there isn’t a immediate need to analyze a specific dataset and is more associated with the Extract, Transform, and Load (ETL) data ingestion process.
Streaming data is leveraged when it is required for data to be continuously updated. For example, apps or point of sale (POS) systems need real-time data to update inventory and sales history of their products; that way, sellers can inform consumers if a product is in stock or not. A single action, like a product sale, is considered an “event”, and related events, such as adding an item to checkout, are typically grouped together as a “topic” or “stream.” These events are then transported via messaging systems or message brokers, such as the open-source offering, Apache Kafka.
Since data events are processed shortly after occurring, streaming processing systems have lower latency than batch systems, but aren’t considered as reliable as batch processing systems as messages can be unintentionally dropped or spend a long time in queue. Message brokers help to address this concern through acknowledgements, where a consumer confirms processing of the message to the broker to remove it from the queue.
Architecture of Data Pipelines
There are three phases that make of a data pipeline.
- Data Ingestion
- Data Transformation
- Data Storage
Within these three phases, data is moved and transformed as needed to ensure data can be used by an organization.
Data Ingestion: Data is collected from various data sources, including various data structures (i.e. structured and unstructured data). Businesses can choose to extract data only when they are ready to process it; however, it is best practice to land raw data with a cloud provider first (data warehouse or data lake). This way, business can update historical data if they need to make adjustments to data processing routines.
Data Transformation: A series of jobs are executed to process data and transform the data into a format that is required by the destination data repository. Transformation jobs embed automation and governance into the process flow, ensuring that the data is cleaned and transformed accordingly.
Data Storage: After data is transformed, the data is then stored within a data repository (commonly, a relational database), where it can be exposed to business stakeholders.
Data Pipelines vs ETL Pipelines
In many circles, the terms of “data pipeline” and “ETL pipeline” are often interchangeable within a conversation; however, the term “ETL pipeline” should be considered a sub-category of the conversation. Between these two terms of a pipeline, there are distinguished points that need to be understood.
ETL Pipelines: follow a specific sequence. As ETL implies, the pipeline extracts data, transform data, and then loads the data into a data repository. Not all data pipelines follow this sequence of events. In fact, changing the order of the processes with an ETL pipeline enables an ELT (Extract, Load, Transform) pipeline. ELT pipelines have be come popular with cloud-native approaches since they do the transformation later in the process. ETL pipelines also tend to imply the use of “batch processing”, but as noted earlier can also be inclusive of stream processing.
Data Pipelines: It is unlikely to see a true data pipeline undergo data transformations, like an ETL pipeline. Data pipelines tend to be more focused on feeding data to the end target platform (relational database, data lake, or data warehouse), where additional processes will be used to do the data transformation.
Data Pipeline Use Cases
With the term “big data” being coined in the 1990’s, the growth of data has continued to grow and projected to reach 180 zettabytes by 2025 (2 years from now). As this growth of data continues, data management and data cleaning becomes an ever-increasing priority and putting more pressure on the use of “data pipelines”. While data pipelines can serve many different functions, the following, broad applications of them within business are mostly seen:
Exploratory Data Analysis (EDA): EDA is used by data scientists to analyze and investigate data sets and summarized the sets main characteristics, often employing data visualization methods – making easier for data scientists to discover patterns, spot anomalies, test hypothesis, or check assumptions.
Data Visualizations: Representations of data via common graphics (charts, plots, info graphs, etc.). Data visualizations display information and communication of complex data relations and data-driven insights in a way that is easy to understand.
Machine Learning (ML/AI): Machine Learning is a sub-branch of Artificial Intelligence (AI) and compute science which focuses on using data and models to imitate the way a human learns, thinks, and gradually improves it accuracy. Through the usage of statistical models, models are trained to make classifications or predictions, uncover key insights within an organization’s data.
In the above discussion on Data Pipelines, there is a lot for organizations to think about and how these pipelines may or may not be in place with your organization. Organizations are often looking at the bigger picture or the goal, but not how to transform their existing “pipelines” into more modern approaches of getting data where it is needed. For these reasons, RheoData recommends using the following Oracle products to establish or refresh “data pipelines” and building analytical or machine learning/artificial intelligence (ML/AI) processes today.
- Oracle GoldenGate / Oracle GoldenGate Service
- Oracle GoldenGate Steram Processing
- Oracle Autonomous Data Warehouse / Oracle Autonomous Transaction Processing
- MySQL Heatwave
These products from Oracle can help organizations build robust data pipeline, scalable data lake or data warehouse platforms and ensure timely data processing.
Data Pipelines and RheoData
RheoData has helped many customers, private and public sectors, gain understanding of their data pipelines and how various Oracle products can be used to enable organizational transformation. Below are a few examples:
- Shoe Carnival improves data pipeline by upgrading Oracle GoldenGate (here)
- Altec uses a hyper-volume data pipeline to ingest to Oracle Autonomous Data Warehouse (ADW)(here)
- American Tire Distributor using Oracle GoldenGate for Big Data to populate Google Cloud Storage (here)
- Zero-ETL – What is it? (here)
Give us a call today to schedule a review or build your data pipelines!