Architecture and construction of platforms that ingest, store and make data available for analysis.
Data ingestion is the process of capturing data from its source and transmitting it to a place of storage. It sounds deceptively simple, but there are a number of factors complicating the process:
- the sensitivity of the data determines a number of security and compliance considerations
- the volume of data might requiring filtering before it is sent, so the choice of what to filter is important
- the location of the data source and place of storage could be in different domains
There exist workflow tools specifically built for ingestion, which make use of queues and queue processing services. This is mostly because of the unpredictable nature of data sources. External factors can produce a lot more data during a short period of time and if that exceeds the maximum capability of a sequential process, then data loss can occur. Queues and micro services allow for scaling to be built into the solution.
Feature engineering is the process of unpacking, uncovering and deriving hidden value from data. For example, in some cases complex data types such as a
contain features that are very useful for frequency analysis, but because it is all hidden in a single value it complicates the data model. Unpacking these features each on
their own will expose them for use by analysis.
You should consider feature engineering as part of Analyse rather than Manage, but there is a good reason why it is dealt with here. In some cases the feature engineering process itself can be computationally heavy which means it could waste a lot of time calculating it every time before analysis. If calculated and stored as part of ingestion or storage the time is invested.
When large volumes of data needs to be effectively ingested, processed and stored it requires very specific designs and services. There are good open source tools available for this purpose, but they require specialised skills to run and maintain. Alternatively, there are similar tools available as cloud services where the maintenance and complexity is handled by the cloud service provider.
Once a platform has been selected, the data architecture and format is next. A data warehouse is used when mostly structured data is stored, and a data lake in the case of mostly unstructured data. It is best to remain as flexible as possible, because data sources and formats might change over time. If the storage is built too rigid it will require a major redesign to cope with the change.
A good set of principles to keep in mind is the Four Vs of Big Data:
- Volume - system capability to deal with large volumes of structured and unstructured data
- Velocity - scalable capabilities so that more data can be handled over shorter time periods if required
- Variance - a sourcing principle to ensure data is from as many different sources as possible
- Veracity - keeping data clean before committing to feature processing and storage
Data engineering in summary is a much more complex set of skills than any of the components, to ensure the most important objective of Manage is met - store the data in such a way that it can be made available for analysis in an effective way.
Edge of the Cloud
It is important to recognise the cloud as the primary enabler of Data Science as we know it today. The ability to store large volumes of data cost effectively, and provide agile access to large clusters of processing and memory is mostly unique to cloud platforms. Without these capabilities machine learning and deep learning would look very differently.
The success of most of the large cloud platform providers has also allowed expansion into developing regions. This is fuelled in part by a much greater focus on data governance, data sovereignty, security and privacy regulations. The principles of 'secure by design' and 'private by design' is possible with cloud - if correctly implemented.
As most digital transformations have seen, it not the question if cloud should be adopted or not, but rather the expected compromise of hybrid cloud. Public cloud can be used where it makes sense, but most large organisations will still have their own infrastructure and systems that run their business on premise. As part of the hybrid approach, trained machine and deep learning models can be deployed on 'the edge of the cloud'. Systems like this allow for training of models to be done in the cloud, but deployment and remote configuration can be done inside of the company network. The integration of these ML services is more resilient and fast because they run locally, but can be remotely configured and updated.
Ready for Analysis
As mentioned above, the main objective for data engineering is not the ingestion or storage of data, but the enablement of the Analysis step. The reason why this is important is because most big data projects in the past got this wrong. The data formats used by many data warehouses and data lakes were good for storage, but very poor for analysis. So much so, that analysis scientists spent 80-90% of their time fixing data issues instead of doing analysis work.