An introduction to EvoFlow
EvoFlow is a versatile platform developped by Evo’s data engineer and devOps teams that creates, schedules and monitors workflows.
Codifying workflows allows the development teams to create and share versions of their products. This collaborative process can all happen while maintaining the structure of those products or, even better, while improving those structures.
EvoFlow was created to be an enhanced flow of ETL, Scraper, and other data pipeline checks, but its functionalities expand far beyond these uses.
The EVO way
So why did we create our own platform? After all, the widely-adopted, AirFlow (https://airflow.apache.org/) can do all of the things that EvoFlow does; it has long become the standard. Although the usefulness of AirFlow is not in dispute, the tool can be somewhat unwieldy. With so much functionality, it can actually slow down simple checks. We needed a tool to simplify ETL and Scraper flows, not to complicate the process further.
The EVO data engineer and DevOps teams decided to build a simplified platform that would accomplish our goals more quickly. We named it EvoFlow as it born as a flow of flow checks.
Creating EvoFlow gave the team some significant advantages over Airflow. With EvoFlow, we benefit from:
- a simpler structure due to the reduced functionalities of the platform;
- a higher control due to the knowledge of the entire platform;
- a higher adaptability due to the built of the EvoFlow code on the in-use infrastructure;
- a lower infrastructural costs due to the reduced requirements of the platform;
- a higher flexibility in the management of the alerts due to the union with the widely adopted Alertmanager (https://prometheus.io/docs/alerting/latest/alertmanager/).
EvoFlow has five key components, all of which you can see in Figure 1:
- Sources include JSON, R, SQL, txt, Python, and CSV files that load data used in the desired task;
- Rules specify all of the information required by the platform to perform a task and come in .yml files;
- Functions come defined in Python and detail the actions to carry out;
- Logs are printed at the execution and stored in the SQL database;
- Alerts are managed by Alertmanager;
- Reports are built as Metabase (https://www.metabase.com/) dashboards.
Source files include several extensions and are generally used to load data into the EvoFlow platform. The only data source requirement is that they must not alter the data during the load. As an example, SQL queries should only include select statements:
Rules are .yml files that define EvoFlow requirements (data), paths to the data sources (sources), check requirements (acceptances), error messages (error_message), descriptions (check_descriptions), and severities (severity):
Functions specify all of the actions and checks required by the rule acceptances. They are generally written in Python:
Logs are generally stored in the SQL database, managed by Alertmanager, and reported as Metabase dashboards.
EvoFlow primarily handles checks for ETL and scraper automations. That said, its functionalities can also execute any possible script and manage corresponding alerts.
EvoFlow is currently used to:
- check the quality of the outputs of the client ETL;
- check the quality of the data resulting from automated scrapers;
- analyse the tool folders and identify issues for the automations;
- import daily exchange rates into the SQL database.
If we need to expand our use of EvoFlow in the future, the platform has the flexibility necessary to accomplish those tasks, as well.
The development of the EvoFlow platform has been a game-changer within the EVO organisation. With this new piece of infrastructure, we have been able to:
- identify and correct issues and bugs in the automations at least four times faster;
- create new checks on data and processes and monitor them by using dashboards;
- alert all people involved in the use of a tool, not just the developers;
- create self-standing flows that could live outside of the most used pipelines;
- automatically evaluate the quality of a code before ever running its flow.
To evaluate EvoFlow performance, we wanted to quantify its impact. We have primarily measured the time spent on a task and the number of alerts. Time spent on these tasks was reduced by two to four times, depending on the duty. The number of alerts has dropped by approximately 10% per week.
The number of alerts has continuously dropped primarily due to the integration of EvoFlow with Alertmanager. This union allows us to manage alerts according to the name of the associated project and the level of severity of the alert (error, warning, critical). The integration furthermore automatically creates Gitlab issues related to triggered alerts to ensure that each alert is given proper prioritization and that tasks to resolve it are delegated to the right team members.
Our early results show promise for the young EvoFlow platform. It has already delivered significant ROI, and we are optimistic to see even better performance shortly. While we cannot suggest that a simplified platform like EvoFlow is the right solution for every team, we are pleased with its success so far.
About the author
Tobia Tudino is a Data Scientist at Evo. After obtaining his Ph.D in climate science at the University of Exeter, Tobia worked as Data Engineer in London. He is specialised in the analysis of the customers’ digital experience and in the creation of complex machine learning algorithms using R, Python, Amazon SageMaker and Google DataLab. He is particularly interested in the translation of complicated concepts to easier terms. In his free time Tobia loves swimming, reading, and exploring new challenges.