Let’s Get The Data Ready!
Each company strives to improve its business and decision making processes with the final goal of reducing operational costs, increasing customer satisfaction, and consequently, the company’s profit. Decision support systems, advanced data analytics, and machine learning algorithms are tools to achieve the mentioned goals, but all of them have an important prerequisite: clean and valid data!
This is the point where a data processing called Extract, Transform, and Load (ETL) comes into play. And we will try to give you a short overview of it.
During regular operations, companies collect a vast amount of transactional data, such as inventory data, incoming and outgoing prices, stock data, purchase records, etc. These data are stored permanently in On-line Transaction Processing (OLTP) systems. In most cases, OLTP systems actually represent good, old relational databases, but those can also be NoSQL databases, cloud storages, or even magnetic tapes in some traditional systems.
In any case, OLTP systems are designed and optimized for fast acceptance and storing of large amount of transactional data. However, because of the data size and many errors that can be found, analyzing directly the OLTP data is practically useless. Therefore, the data is:
1. extracted from the OLTP systems,
2. then cleaned, enhanced, extrapolated, etc., or we can more simply say, transformed, and
3. finally, loaded into an analytics system, e.g. JourneyTree 😊.
The extraction must be flexible and support many different source systems. By relying on the vast number of python-based technologies, we at JourneyTree currently support relational databases, CSV and text files, and will support many more source types in the future.
The transformation part might be considered as the most interesting and most demanding. During this process, we might identify the duplicate records, interpolate the missing time series data, change abbreviations into the standard words, and many other things, all with the goal of obtaining a clean data set ready for applying advanced analytics algorithms. For more information on the algorithms, please check our other blogs.
Finally, the data load is performed into the proprietary-built in-memory data structures, in order to allow fast execution of our algorithms, as well as into the database, to ensure the persistence of the data and results.
Of course, the whole ETL process must be fast and reliable, in order for JourneyTree to have up-to-date and correct results every morning. But that’s our job to make it happen… 😉