Synchronization of huge amount of data between two systems can sometimes be problematic, especially if one of the systems is accessed through a slow API.
The service we had to integrate with was exposed through SOAP API. The API itself was slow and it had ~4,5 seconds time-lag per 10 records, but there was a bigger problem - Python SOAP libraries/implementations are really, really slow, f.e. +20 sec per request to process them.
So in a case of 500,000 records, it would take approximately one week to get all of the data. So, we had to find a way to boost the synchronization time.
Here are all the details about what we had done to find a suitable solution for this problem.
Attempt 1: Finding a suitable Python library
The first idea as mentioned earlier - was to find a suitable Python library. We found the suds library that allows SOAP to Python objects transformation. Unfortunately, suds library’s last update was from five years ago. After a thorough search, we found a suds-jurko, a fork of the original suds. In theory, module did a query to the server and got the objects. Then we would have to map the received objects to our database. However, the matter of slow synchronization lasted. In practice, the module needed 20 seconds to parse a single API response. Add 4,5 seconds from servers application and you get almost 25 seconds per 10 records. We wanted to faster synchronization, so we abandoned the idea of using suds library.
Attempt 2: Implementing a Java wrapper
The second approach was in some way similar to the first – we could utilise some Java tools and generate whole POJO client. In fact InteliJ IDEA was very helpful with that - it tool 20 minutes to generate simple POC. But it would add another layer of abstraction and technology to the system. So finally we abandoned this approach.
Attempt 3: Dump all XML responses and work with them locally
Finally, we decided to store all API responses (raw XMLs) the way SOAP api delivered them and then think later about import method.
The data dump took roughly 35 hours; for some reason,some of the data had two endpoints on the SOAP server. Finally, we had all raw XML files in our database (PostgreSQL with XML native types). All we had to do is to come up with some kind of mapping and an importer.
We used Python xmltodict library to transform XML fields into a Python dictionary.
Then, the newly formed Python dictionary would be stored JSONB field in Postgres.
Seems we hit a bullseye. It turns out PostgreSQL is quite good at filtering over nested JSONB fields:
syncDB> select count(*) from core_jbjorder where data -> 'client' -> 'delivery_address' ->> 'city' = 'Szczecin'; +--------+ | count | |--------| | 198762 | +--------+ SELECT 1 Time: 3.279s
syncDB> create index on jbjorder ((data -> 'client' -> 'delivery_address' ->> 'city')); CREATE INDEX Time: 3.910s syncDB> select count(*) from core_jbjorder where data -> 'client' -> 'delivery_address' ->> 'city' = 'Szczecin'; +--------+ | count | |--------| | 198762 | +--------+ SELECT 1 Time: 0.018s
The use of JSONB allowed us to gradually transfer our data from document-like JSONB field into old, good relational schema.
Splitting responsibility for data fetching and transformation was a good decision. First was to store all of the raw XMLs to save time before we even came up with import implementation - it is always easier to work with data locally. The second was the application of the JSONB field in PostgreSQL.
So general cue in tackling such problems would be to divide it into smaller problems and attacking them one by one. These way you can focus on simple solutions and engage more team members in creative process of problem solving!