SAP data ingestion with Spark and Kafka

Building data pipelines connecting traditional back-end systems and creating additional business value.

An important component of any data platform is those pieces that manage data ingestion. In many of today’s big data environments, the data involved is at such scale in terms of throughput or volume that approaches and tools must be careful considered.

There is a trend going on in the market where more data is being processed in real-time to realize as much value out of this data for an organization. The reason for this that insights can be created faster, or data points can change the decision or business path reducing for example risks or increasing customer service or to support new developments in case of IoT. Reality is the fact that the value of data over time is decreasing.

Typical components of a complete solution for a real-time data pipeline consist out of data ingestion whatever volume or variety, processing of data at the moment it arrives and data serving for users and applications.

In the last few years Apache Spark and Apache Kafka have become popular tools in a data architect’s tool set, as they are equipped to handle a wide variety of data ingestion scenarios and have been used successfully in mission-critical environments where demands are high.

Apache Spark offers besides data ingestion the possibility to process data in real-time and to apply in memory data transformations to keep data latency as low as possible.

Proof of Concept

Scalar Data recently successfully finished a proof of concept where data ingestion was an important part of the overall solution and introduced for this an architecture based on Apache Spark and Apache Kafka so that solution is scalable and future ready for the ever increasing volumes of data.

In this particular proof of concept data coming from SAP clients needed to be ingested and processed into the storage layer of the customer.

The data to be transferred from a SAP client could be several tens of gigabytes per client, as a proof this needed to be ingested in a reasonable amount of time.

There were several restrictions we had to deal with, which we won’t explain in detail here but these influenced the choices we made in the overall architecture for the solution. Of course security was one of them that had top priority amongst others.


We decided to transfer the data from the SAP client via secure https to a data ingestion api at our customer side. Obviously there were multiple things to take care of, like splitting the dataset up into smaller pieces and compressing them before transferring them to the data ingestion api. On the server side we had to put these multiple parts back together in the right order and make sure all parts were received before starting to uncompress them on a NFS and process them. The reason for splitting these large datasets up into smaller pieces is due possible failures while transferring the data, you don’t like to resend the complete dataset again in case of a failure, but by detecting where it fails you can retry sending it from the point just before failure saving valuable transfer time.

After putting the received data back in the right order and uncompressing the content, data is being sent on a Kafka topic via a Spark producer process that is listening for incoming data.

A Spark consumer process is consuming the data from this Kafka topic and transforming the data in the correct format to store it into the back-end database of the customer.

Both the Spark producer as consumer were written in the programming language Scala.

Although the above sketched scenario is a simplified version of the real scenario, it will give you the big picture.

The charm of this solution is that all data is being processed in real-time once it is received at the customer back-end, so no manual interventions are needed as we still often see in practise. Due to this information is quicker available and decisions can be made much quicker giving an competitive advantage for the organization by reducing unnecessary risks.

Below figure shows you a simplified representation over the overall architecture.

Since development is just one part of the overall solution we decided to add also monitoring on some parts of the back-end as part of the delivery of this proof of concept. Via the dashboards created by us the customer is able to manage the solution by detecting possible bottlenecks and defects before they start to become problematic.

Obviously there is much more work to do after delivery of this proof of concept to make this solution production ready, something which we are currently working on.


Speaking with the words of the customer, most of the proof of concepts are failing we are happily surprised by the result of this proof of concept in such short amount of time, thanks to the dedication and professionalism of the Scalar Data team.

Future Improvements

One might say the storage layer of the customer in this solution is not so scalable and this may become a bottleneck when the volumes are increasing and they are right, but it is very simple to replace the data storage layer with a more scalable one in the near future. In this proof of concept we had to work with the existing storage system and proof we were able to connect to it and deliver the results in the already existing database schemes, which we did.

The overall solution is future ready because of the underlying open source technology we used in this proof of concept, we can easily start building models on basis of the incoming streaming data and make use of a fast storage layer to create insights or do predictions, something the customer thought would not be so easy because of the architecture they had in place.

Other things we can think about is improving the level of self service, by providing the customer and their clients with the right reporting tools to connect to the data storage layer.

Added Business Value

It is not to hard to see where the additional value from this solution for the customer is coming from:

  1. Semi-automatic to fully automatic processing of client data, saves valuable time from scarce IT resources
  2. Reduction of failures, since no manual interventions are needed
  3. Insights are earlier available
  4. Better monitoring capabilities, bottlenecks can be easily detected before they are becoming real problems
  5. Scalability of the solution, allows the customer to connect more clients
  6. Possibility to build models and apply them on the data to do predictions


If you like to know more about what Scalar Data can do for your organization or you have a similar use case where you would like to do a proof of concept please don’t hesitate to contact us at [email protected] or via our telephone number on our website.

Ronald Span

Founder of Scalar Data, over 20 years of experience in a variety of national and international IT projects in different roles, development, consultancy, pre-sales, management and business development. Scalar Data is helping organizations to implement their big data strategy.