Developing pipelines for Big Data can be complex but is the only way to give customers access to the data the way they prefer, maintains developer tools company JetBrains in its Big Data World blog series.
“Pipelines are generally built with the help of orchestrators, which call other Extract, Transform, Load (ETL) tools, but sometimes the whole pipeline may be built with a single tool like Apache NiFi,” explains JetBrains’ Pasha Finkelshteyn, in the third post of the series.
The best way to accomplish the task of building a data pipeline is not always obvious, however, and in data engineering, the multiple sources, sinks or places to put data), complex transformations, and lots of data tend to promote complexity.
Customer organisations can have myriad data sources — including things like “dozens” of operational databases, clickstreams from their site coming through Apache Kafka, multiple reports, OLAP cubes, and A/B testing.
“Imagine, as well, having to store all the data in several ways, starting with raw data and ending with a layer of aggregated, cleaned, verified data suitable for building reports,” Finkelshteyn writes.
“All these processes need to be orchestrated by data engineers.”
Orchestrators and ETL tools can be understood as two different levels of pipeline. Orchestrators launch tools in the required order, performing retries if something goes wrong. ETL tools are the lower, more localised, level, typically using batched or streamed processing.
For example, Apache Spark is an ETL tool as well as a general-purpose distributed computations engine; it can move data from one place to another, such as from sources to sinks, transforming this data on the way, Finkelshteyn writes.
Part 2 of JetBrains‘ Big Data World blog explains the difference between data scientists, data engineers, and machine learning (ML) engineers.
Data engineers build pipelines from sources to destinations — they might be software engineers, database administrators (DBAs), or ops, Finkelshteyn says.
Data scientists usually apply statistics to understand data — sometimes writing production-grade supportable code as well as doing research. On the other hand, ML engineers are all about productising ML, he says.
“Code should be deployed, monitored, work reliably, and be available. Before code even starts working, data should be collected and prepared,” explains Finkelshteyn.
“ML engineers should configure every ML application in a known and predictable way. Data should be versioned (and this is a huge difference from regular software engineering).”
JetBrains defines Big Data as data that won’t fit the node’s memory, that is characterised by high volume, variety and velocity, or is sufficient to make reliable business decisions, according to the first post in the blog series, which moves on to a discussion of typical consumers and key JetBrains Big Data projects such as DataGrip.
Click here for the JetBrains Big Data tools plugin.
( Photo by Joshua Sortino on Unsplash )