How to structure data ingestion and aggregation pipelines
When preparing a product for large-scale IoT data ingestion, it is very important to get the architecture right. The architecture needs to stay flexible in processing as well as in the ability to scale. When building, it's best to divide the architecture into smaller independent pieces, so it is possible to mature each separately without affecting them every time an update occurs.
The high-level conceptual diagram below shows how you could structure your data ingestion and aggregation pipelines to support scaling as well as an independent multi-team delivery model.
The process
Data collection, data ingestion, event stream, persistence, data perspective/warehousing, and perspectives are integral to the architecture. Each is critical to the structure and how the pipeline functions.
The process works as follows:
Data collection: Data is collected from devices. If they have direct access to the gateway, they would be sending data directly. If not, an IoT edge concept could be utilized where data reaches an intermediate point where information is collected and forwarded to the gateway.
Data ingestion: Device data reaches gateway where security and usage policies are applied and verified, and later messages are forwarded to event stream with possible transformations if they are needed.
Event stream: Device data is streamed into a platform where various consumers can pick it up and process it for their purposes. Each consumer would be processing at their own pace and be independent of each other. Consumers can be added independently. Staged data streaming could be added to transform/filter events and funnel them to different event streams where other processors could pick them up (e.g., could be preparing events for public consumptions by removing sensitive data).
Persistence: Raw event data is persisted and stored long-term for audit or replay capabilities.
Data perspective/warehousing: A warehouse view can be built by aggregating data from a data stream. It is possible to set up multiple warehouses with different structures and dimensions for different usages. They can be made independent, so other teams would not need to fit a single model. Analytics streams could be hooked up and leveraged.
Perspectives: Applications could build out using their data perspectives of incoming data. They would be independent, so updating them and adding additional ones would not require a canonical data model and could be tailored for specific use. Later on, APIs would be built on aggregated/transformed data and consumed by 3rd parties or applications. If APIs are not required, visualization tools could directly tap into the data stored for consumption.
The benefits
Using this architecture allows you to have a data processing pipeline that is:
Easily scalable: Each component is an independent piece that could be scaled to the desired number of instances.
Easily extensible: Components are not coupled together. Everything goes via streams. New producers and consumers can be added at any time without touching existing ones.
Technology agnostic: As data goes via streams, it's agnostic in how individual producers or consumers are written.
Data model agnostic: As data is shaped into an event stream and then its consumer's responsibility to shape into a specific data model, changing that model does not require changing events that are already available or the way those events are ingested. Each perspective becomes independent from each other and can be developed and extended in isolation.