Designing Data Pipeline Architechture
Data Pipeline Architecture is an intrinsic part of software engineering. With everyday increase in data, companies are finding innovative ways for data mining. Every type of data is important from user data to server logs. Data analytics provides insights about the behaviour of systems. With increase in data consumption data too has increased. Machine learning and AI are data hungry. No data is more for these models. But the data is not always structured that could be readily consumed. The huge amount of time and resources goes into cleaning. Every unstructured data needs to be processed and stored properly. Many systems are involved in this process. Different types of data require different types of processing and storage. That is why designing data pipeline architecture is different in different companies, different projects or different data types.
Basic components of Data Pipeline
Every data pipeline requires at least four components, Data source, Processing Systems, Data Storage and Query engine.
There could be one or many data sources for the data pipeline. Ideally, data sources are transactional databases like Mysql, Postgres, or NoSQL like MongoDB. In some cases, if data is media files like photos, documents source could be S3 or any other object-store. There could be real-time sources too like stream data from mobiles, logs from servers or databases(Change Data Capture). If the data source is streamed probably message queue is required. Message queue acts as a buffer for the processing unit. Popular message queues like Kafka and rabbitMQ can be used.
This is the heart of the data pipeline. Data structures and Algorithms need to be super-efficient in processing systems. Different types of map-reduce techniques, AI models are used for processing. Data can be processed in two ways - Batch Processing and Stream Processing.
Batch Processing does not require a queue and can be directly used data from the data source. Batch Processing is not real-time. It might be a scheduled minute, hourly or daily. Batch Processing is sometimes also referred to as ETL (Extract, Transform, Load) process data files using long-running batch jobs to filter, aggregate, and prepare data for analysis. Usually, these jobs involve reading source files, processing them, and writing the output to the new files. Batch Processing is popular with AI models. This is because AI models require a high amount of resources to run in real-time. Even with that it still can be still slow.
Stream Processing capturing real-time messages, the solution must process them by filtering, aggregating, and otherwise preparing the data for analysis. Processed stream data is written to the output sink. Apache Storm and Spark are popular stream processing engines.
Analytical data store serves the processed data in a structured format that can be queried using analytical tools. Data can serve queries using relational data warehouses like AWS Redshift, SparkSQL, or data could be presented through a low-latency NoSQL technology like HBase or Cassandra which provides a metadata abstraction over data files in the distributed data store.
A CLI(command line) based query engine or UI based query engine pulls data from datastore and presents the data to reporting tools. Query engines like prestoDB can aggregate data from different datastores. They can present data in proper structure which can be used in data mining or training AI models. With proper tools like Power BI, Tableau, or Kibana this data can be explored and visualized.
Orchestrator is an optional component. Data pipeline consists of repeated data processing operations, encapsulated in workflows, that transform source data, move data between multiple sources and sinks, load the processed data into an analytical data store, or push the results straight to a report or dashboard. To automate these workflows, orchestration tools like apache airflow are helpful.