ETL Revelations: Navigating Data Migration with Foresight and Agility
- Mohamed Macow
- Etl , Data migration , Database management
- February 8, 2024
Unveiling the Hidden Chapters of ETL Migration
ETL (Extract, Transform, Load) processes can often feel like navigating through a dense forest, with each step presenting new challenges and discoveries. In this blog post, I’m going to share some of the pivotal lessons and experience I have gained by being responsible for a system and data migration.
Utilize Expert Knowledge
In every migration project the data you’re handling is more than just rows and columns— it’s a narrative. It’s crucial to involve those with intimate knowledge of this narrative. Domain experts can provide context and help map out the complex relationships within the database. Their insights ensure that the data migration accurately reflects the underlying business logic, which might not be immediately apparent from the data alone.
Understanding old data can be very tough! Especially when dealing with incredibly old data, or data models that have rapidly changed. Maybe there has been some drastic changes to the business logic, meaning some data is allowed to look like X whilst other data must look like Y.
- Involve domain experts early in the process.
- Document the data narrative and business logic.
- Validate your data model with those who know it best.
Business layer or database?
In the journey of data migration, the road less traveled—direct database interaction—often leads to a more precise destination. Direct manipulation via raw SQL not only grants us the control needed to navigate the complex terrain of data migration but also helps us sidestep the pitfalls that business logic may conceal.
While the business layer might initially seem like a guide, it can quickly turn into a maze of workarounds and “magic spells” needed to counteract its unintended effects. Yes - I know what you’re thinking. It does in fact make it easier to make sure the data fits the actual criteria, but the tradeoff here is dealing with lots of cleaning up in different tables, perhaps adding some variables to act as flags and mixing finished with unfinished data. How would you have the import of an entire entity as a transaction if you’re using the backend in the middle of everything?
Loading historical data is also very important, as the data is often the reason the business exists, these data fields can easily be stamped with a extra field called legacy_id
which is going to be very helpful when having to deal with creation or update flows.
Avoid deleting as much as possible :)
If you’re dealing with tons of data, deleting and starting over might be calling your name - but please keep in mind the tradeoff you’re going to be making here. Yes you will work on a clean slate, but a CREATE and a DELETE is often more expensive than an UPDATE. Perhaps you might not even need the UPDATE. Maybe a SELECT and comparing programatically shows the entity hasn’t changed at all, so you can continue.
Before making a decision on whether to update an existing entity or create a new one, consider leveraging the legacy_id. This identifier serves as a checkpoint: if it’s absent, it indicates the need to construct a new entity from scratch. On the flip side, the presence of a legacy_id can streamline your process, sparing considerable effort and significantly enhancing your ETL script’s efficiency by avoiding redundant operations.
- Use raw SQL for direct database manipulation.
- Avoid the potential pitfalls of ORM frameworks during migration.
- Maintain control and improve performance with tailored queries.
Emphasize Integration Testing
You want to be fast. You want to be precise. You want to know if you might have collapsed the card house.
Just as in regular development of systems, integration tests are going to be an integral part of your data migration. When dealing with complex data, and having to ensure the data looks correct you avoid alot of banging your head on your keyboard by writing some simple tests.
Personally I recommend splitting up the functions into three different segments:
- Functions responsible for extracting, where data is carefully sourced from different systems
- Functions responsible for transforming, the critical phase where you go from the old data model to the new data model
- Functions responsible for loading, the final step where you put the data safely into it’s new nest
This way you can precisely be sure that each step does as behaved. Put an emphasis on the transforming functions.
Rapid and safe iteration is the name of the game.
Before writing tests, I was often faced with troubles of accidentally breaking stuff, or introducing regressions. Don’t be like me.
- Create a suite of integration tests.
- Automate testing to run with every change.
- Catch issues early and maintain data integrity.
Transparency
When the script reaches a point where it is running remotely on a node on a server, you need to know where things go wrong. While locally you can run your debugger and see errors, for longer running migration runs dealing with tons of data the transparency quickly fades. You need to know when an ETL process has started, what has been it’s outcome and most importantly what went wrong. There will be something going wrong, so you better be prepared to have actionable error logs. Personally what has seem to be something that helps alot is having two supporting tables.
etl_process_logs
This table will be responsible for summarizing your runs. Ideally it would also contain information about state changes, for example how many entities were created, how many were updated and how many errors made your transactions quit.
A row in this table can also easily function as a continuation point, if you would like to work with buffer periods. It is one thing to run the script first time where entities are created, but you also need to update! This is where the continuation point is very useful.
When pulling out your data, you could perhaps look at an updated_at
timestamp and fetch based on that, with respect to your buffer period, so you also catch entities that were created during the last run.
etl_error_logs
This table will save you tons of grey hairs. The idea here is to create a row for each error that made the transaction quit. The row in the etl_error_logs
table should have a foreign key to the process in where the error occured. That will give you more transparency in regards to introduced regressions, or which segment of the data have certain issues. Feel free to also have a column where you save the error message, and the primary key and perhaps source for the data. Afterwards when you will be debugging, you can focus on only pulling out that specific entity and see where the chain jumps off.
Performance!
Which engineer does not love tinkering with the engine seeing if we cannot make the wheels move faster. If you follow my advice from the previous segment in regards to the seperation of function responsibilities you’re walking right up the alley where your code becomes more scalable.
The magic word here is concurrency. Ideally, each phase of the ETL process should operate independently and concurrently. This isn’t always straightforward, but even partial concurrency can yield substantial improvements. In my experience, an application of concurrent processing slashed processing times from a sluggish 5-8 seconds down to a brisk 500 milliseconds for each entity.
Identifying bottlenecks is your first step. Sometimes, you’re network-bound, waiting on responses from external APIs or systems. Here, concurrency shines, allowing other parts of your pipeline to process data while awaiting these responses. It’s about doing more in the same amount of time.
Look for opportunities to parallelize tasks. Even simple adjustments can lead to significant throughput improvements. Tools like Go’s goroutines and channels can offer straightforward paths to improving the speed. And speed can be very important if you are dealing with lots of data which might also be getting updated alot.
Leverage SQL’s explain feature
This is an underutilized gem. Running an explain on your queries can unveil missing indices or other inefficiencies slowing down your database interactions. Adding a well-placed index can turn a slow query into a lightning-fast operation.
Enhancing performance in ETL isn’t just about speed; it’s about making your data pipelines more resilient, scalable, and efficient. With the right approach to concurrency and a keen eye for bottlenecks, you can transform a sluggish process into a high-performance engine that powers your data infrastructure forward.
Remember, the goal isn’t just to make your current processes faster; it’s to build a foundation that can handle the increasing scale and complexity of data workloads with grace and efficiency.
Benchmarks of Success
How do you know your ETL process is successful? By using comparison metrics. Comprehensive comparison metrics. These are the signposts that guide you through the fog of data transformation, offering clear evidence of whether your migration has hit its mark. By meticulously comparing key performance indicators (KPIs) before and after the migration, you can understand the impact of your work, ensuring that no detail is overlooked.
Analyze KPIs
KPIs are the north star of your analysis, guiding the evaluation of your ETL process. Compare these indicators across both systems to gauge improvements or declines in performance, user satisfaction, and business outcomes. This comparison not only validates the migration’s success but also identifies areas for further optimization.
Benchmark and validate
Validation through data ensures that your ETL process has not only transferred information from point A to point B but has also enhanced the value of your data, making it more accessible, accurate, and actionable than ever before.
The Historical Data Conundrum
Sometimes, the strict data constraints that maintain integrity under normal operations can impede the migration of historical data. In such cases, features like PostgreSQL’s deferred constraint checks become invaluable, allowing for the temporary relaxation of rules to facilitate a smooth transition.
- Use PostgreSQL’s deferred constraint checks for greater flexibility.
- Maintain data integrity while accommodating the needs of historical data.
- Balance the preservation of data relationships with the practicalities of migration.
Final thoughts…
Data migration is as much an art as it is a science. It requires a deep understanding of both legacy and new systems, and an ability to be skeptical of the quality of the data which is flowing through the pipeline.
For those of you facing the daunting task of ETL migration, I hope my experiences can serve as a beacon, guiding you to approach these challenges with confidence and agility.
Should you be burning in with some questions, feel free to contact me or the team - we’re open to having talks.