In this final part of our multi-part post (see part 1 and part 2), I will discuss the various transports which can be used to synchronise data between applications.
As opposed to the previous two parts, the strategies in this post are not limited to synchronisation and apply equally to integration.
File transfer
The old school method for data exchange has existed for as long as computers have interchanged data. File transfer involves one application creating a file and putting it in an agreed location, whilst the receiving application reads, processes and then does something with the file.
Pros
Secure: File transfer can be easily locked down, making it a relatively secure means to exchange data. FTP and SSH are both solid protocols meaning remote exchange over the internet is secure.
Cons
Difficult to audit: Compared with other mechanisms files typically don’t have any natural means to audit or to trace audit.
Remote servers require an FTP/SSH solution to exchange files, adding another chink in the chain of events.
Fragile: Easy to get out of sync where a failure can require significant manual intervention to re-establish normal operation.
CSV and other column based formats such as fixed width text and Excel suffer from brittleness, where file schema modifications (field additions and deletions) require significant re-configuration on the receiving end.
Tips for use
One transaction per file: It’s much easier to identify a particular transaction, and re-processing is easy should something go awry.
For CSV and other column based formats, always add new fields to the end of the file; never remove existing ones, better still don’t use these formats!
Use a resilient API based data format such as Xml or JSON. Do not go overboard however.
Try to avoid Xml schema where possible. We feel schemas add unnecessary constraints to the format, where a substantial amount of the inherent flexibility of the format is lost. They add a level of complexity not required in the SME space.
Always use a recognised API/library when consuming and producing Xml/JSON data. This will at minimum ensure the format is standard compliant. We’ve seen several instances where organisations (that should know better) provide us with improperly escaped Xml, simply because they weren’t using an API to produce the Xml.
After files are processed move them to an archive directory; don’t delete.
Database & staging tables
Our preferred means for both integration and synchronisation, the use of databases & staging tables providing a much more robust transport as compared to file exchange.
Staging tables are an intermediate set of database tables where the sender records pending data for the recipient(s) to later process.
Pros
SQL: Provides vendor and format independence to query and update a database and table schema(s) changes are usually non-breaking.
Audit and Traceability: SQL/Databases provide an easy means to record and query any key data required for auditing purposes e.g. timestamps, error & warning messages.
Durability and Recoverability: Databases are naturally durable and far less vulnerable to corruption. In the event of a failure recovery can be made quickly assuming a backup is available.
Cons
Security: Where interchange take place over the internet the database needs to be open to the outside world leading to potential attacks. These can be mitigated by restricting the connecting IP addresses, use of complex passwords, amongst others.
Tips for Use
Always have the following fields in a staging table:
Status: Used to record whether a record is pending or processed. Always include an error status which allows you to identify erroneous transactions quickly. It is recommended that an index is setup on the status field to help speed querying.
Last Modified Date: A timestamp to record when the last update was made.
Last Update By: Used to record which application made the last update.
Reason/error/message: Where there’s an error or a warning generated as part of the integration or synchronisation it should be recorded against the original record. This then gives you a robust audit trail which can be used later to trace problems.
Foreign Record Id: Where a recipient application generates a transaction or record id (such as an Order, Invoice, or even Customer number) update the staging table with this reference.
Backup frequently: Depending on the volume of data or the location of the database i.e. for high volume sites or where the database is publicly accessible this may be several times a day.
Always update the staging table immediately after writing to the destination application.
Where processing takes a long time, consider setting the status field to an intermediate value to prohibit other integration processes from processing the same data.
Message queue
A message queue is a transport level asynchronous protocol where a sender places a message into a queue until the receiver removes the message to process it. One of the key attributes of a message queue, especially for data synchronisation, is that messages are guaranteed to be delivered in FIFO order.
Message queues have traditionally been limited to larger scale integration and synchronisation and where both sender and receiver exist on the same network. However, with the explosion of cloud computing and an agreed standard (AMQP), there is an increasing prevalence of low cost services:
Storm MQ: Currrently in-beta this ‘free’ service looks promising.
Pros
FIFO: Messages are guaranteed to be delivered in order.
Robust and Transactional: Message queues can be backed up, and are naturally transactional.
Cons
Message Queues are a Transport: The data format of the message still needs to be agreed between sender and recipient.
Tips for use
Use a flexible data format such as Xml.
File and database recovery is well known, message queues less so. Understand how to administer your message queue technology and how to recover on the event of a failure.
Prevent message build up by having an application which is constantly listening for incoming messages.
Use different queues for each data flow; don’t try to multiplex multiple message types onto a single queue.
No interchange
In this scenario there is no intermediate data form, instead each application holds its pending data until the recipient application retrieves it.
Unless an application specifically integrates with another this style of interchange requires some form of middleware to mediate the interaction since:
Each application has its own interface i.e. webservice or a COM/Java/.Net API.
The data usually requires some transformation for it to ‘fit’ with the other.
Summary
If a choice can be made, use databases/staging tables as they are by far the easiest and most robust strategy.
We expect an increased use of message queues, especially with hosted infrastructure environments such as Azure and Amazon.
CSV: Dear oh dear! It may be easy to produce and simple to read, interpret and load into Excel. But it’s 2013! And there are more sophisticated and better means for exchanging data, so why are we still using this archaic format?
In Part I of this multi-part blog some of the most common issues for synchronisation were discussed.
This second post will describe the strategies available for synchronising data between applications, their advantages and pitfalls.
The strategy is the means to determine when a record should be inserted, updated, or deleted.
Data Structures
It is necessary to first understand the issues surrounding more complex data structures.
Simple single level data structures, such as customer records, can be readily synchronised. However, as data structures become more complex so does the strategy.
The above example is illustrative, both order and order lines must be synchronised; an order without lines is not an order. However, the third level of data – the serial–numbers add several complications:
Is it necessary to synchronise them? If they are used for transactional purposes, yes, but a view-only application may not require them.
What do the serial numbers mean when an order line is updated? Are they a fresh set of serials, or just additions? While this question applies also to lines, the deeper the structure the more pronounced the issue becomes. This is due to the feedback of how the update is made on each level of data.
Means of Identifying Updates
Inserting & Updating Records
Inserts and updates follow a simple pattern where the destination application is queried to identify if it contains an existing record. An update is performed where a match is found; otherwise when no matches are found, an insert is performed.
Deleted Records
Handling deletions is a much harder issue, primarily as the delete needs to be identified. There are three approaches:
Tombstoning
Tombstoning is a technique which instead of actually deleting the record, marks it as deleted, via a field within the record’s table. What makes this a preferred method for handling deletes is that the tombstone makes the delete operation explicit and in turn reduces complexity around synchronisation.
Tombstoning however requires the source application to provide such functionality (usually only a feature when data synchronisation/integration is integral to the application e.g. CRM).
For applications that don’t support tombstoning, it is necessary to record the record deletion(s) in a separate interface table or file, usually by way of a trigger (to be discussed in part 3 of this blog).
Record Comparison
Record comparison involves comparing a set of data from the source application to the set of data in the destination application. A delete is performed when a record is found in the destination but not in the source.
In contrast to ‘Tombstoning’ this suffers from three primary problems:
Complex: Record comparison is implicit; it requires the ‘data synchroniser’ to determine if a record has been deleted.
Auditing: There is a lack of an audit trail where records just disappear, with no facility to check why/when a deletion occurred.
OK for child records not for the primary entity: Using comparison for child records in hierarchical structure, e.g. order lines within an order, can be appropriate where there are limited records and the detection of deleted records is cheap and relatively straight forward.
Using comparison for a primary entity (e.g. the order) requires the examination of a large set of data to determine whether a delete is required.
Re-create Everything
A third alternative is to delete and re-create all records on each ‘sync’. This strategy is easiest to implement, covering inserts, updates and deletes, since it doesn’t require any logic to detect and handle changes. However, its use is only limited to the following scenarios:
Data is loaded at a database level e.g. reporting & data warehouses and CRMs. Applications with a sophisticated data structure and/or API usually frown on this approach as business logic and data validation is side-stepped and can lead to integrity issues.
The destination application is view only and non-transactional.
There are no (or very limited) foreign-key constraints: Foreign keys enforce relationships between records. When a record is deleted so must all the associated records. For example, deleting a customer record also requires deleting all the customer’s associated order & invoice records.
Data size: Deleting and re-inserting all records in a large dataset may take a long time to complete and can be expensive in terms of computing power, particularly with hosted environments where there can be CPU cycle costs. Pruning the dataset, usually by date, is an effective strategy to limit the load size.
Handling Application Logic
When synchronising data between applications it is necessary to negotiate and handle the restrictions or logic the API applies to the data.
Business logic is much stricter than any database enforcement.
Transactional data moves through a process, where the status of the transaction (or line) dictates whether modifications can be made.
Deletes are never easy – Particularly with ERP applications, deletes are handled in a number of ways: master items may need to be marked as inactive (a form of tombstoning); transactions can either be deleted or cancelled.
Strategies for handling application logic
Atomicity: An update should be all or nothing; never have a situation where some updates to a single transaction or master item fail and some succeed. Whilst this should apply whether synchronising at data level or application level, it’s easier to fall into a situation where a partial update is made due to lack of understanding of the API.
Is your synchronisation dumb or smart: You will need to make a choice as to how much logic you embed into your synchronisation, and often they start dumb and evolve to smart. What this means is the amount of logic required to cover certain scenarios or use cases. At the stupid end, there is no logic and a simply try/catch block is placed around your processing to catch any exceptions. Conversely, at the smart end you may check for an entity’s existence, raise workflow events, or handle specific edge cases.
Summary
This post hopefully has helped to explain the most common synchronisation strategies and pitfalls. It is by no means complete and we would like to hear your opinions.
The next post will discuss the various means i.e. file transfer, database, etc. available for data synchronisation.
In this first of a series of posts, I will layout some of the common issues encountered specifically when synchronising data between applications.
Synchronisation is the replication of data between two or more applications. It is usually more than a one-way push. Synchronisation is the continued updating of data to keep all applications reflecting the same view.
Firstly, there are two broad categories, each with its own set of properties, that data falls into:
Master Data
Master Data is conceptualised as key entities used within an organisation. Customers, Employees, Items & Suppliers are all considered master data. At the simple end of synchronising, Master Data records involve a single record of data per entity. However, Master Data often has a number of child attributes such as addresses for customers, bank details for employees, and a range of attributes such as locations, suppliers and units of measure for items.
It is increasingly expected that Master Data be updatable from multiple applications where each application holds its own copy of a common ‘truth’.
Transactional Data
Whilst sometimes more complex due to their header/detail structure transactions, they are usually easier to synchronise due to the following characteristics:
Simple direction/flow – One application is the master and the other(s) are slaves thus eliminating issues such as conflicts.
Transactions are less likely to be updated – Once a transaction is posted it has reached its final state and requires only a single push to the target application. There are however a couple of notable exceptions: Sales & Purchase Orders often need to be synchronised to reflect their current ‘state’ and because of their more complex structure and heavy business logic require careful attention. Further, Customer & Supplier Payments can have a complex many-to-many interaction with their corresponding invoice documents.
There is a third category: Master Data with transactional properties. An example of this is a project structure, where the project goes through several stages throughout its life and where each stage has different business rules affecting its behaviour.
Issues Surrounding Synchronisation
Data Structure Complexity – More complex data involving header, details, sub-details, etc need strategies which address changes for each ‘level’ of data.
Synchronisation & Conflict Resolution
Synchronisation needs to be considered at each level of data.
Deciding which application, if any, is the master or whether bi-directional integration will be supported.
Conflict Resolution – What rules should be followed where there is a conflict and is there sufficient data available to resolve a conflict and/or data storage to mediate conflicts?
Data Impedance & Mismatch – The higher the resistance or more transformation required the less able you are to synchronise it.
Field Level Transformation – Destructive transformation, where the result of a derived or calculated field cannot be applied in reverse or accurately reversed can prevent bi-directional synchronisation.
Field Lengths and Type Differences – Two common issues arise: truncation of textual fields such as name, addresses & comments, and the mapping of enumerated fields i.e. drop down fields, where the choice is identified as an integer value within the source or target application.
Data Structure Mapping – Cases where the structure of the source data does not match the target and where transformation is required to massage the data into the required shape.
Data Interchange – As the sophistication, complexity and/or frequency of the synchronisation increases so does the requirement for more sophisticated means for transferring data.
Data Take On – Finally, issues around the amount of historical data, whether the load time for data is significant and whether data cleansing for old or historical data needs to take place.
In subsequent posts I will describe the various strategies available to resolve these issues. Part II will describe strategies available for synchronising data between applications, their advantages and pitfalls.
Contact
Realisable Software Ltd provides code-free, cost-effective applications integration solutions for SMEs. Our core IMan product is designed to integrate almost any application with a number of Sage solutions and online payment processors.
Looking to purchase IMan, please see our resellers here.
Realisable Software Ltd provides code-free, cost-effective applications integration solutions for SMEs. Our core IMan product is designed to integrate almost any application with a number of Sage solutions and online payment processors.
Looking to purchase IMan, please see our resellers here.
Realisable Software Ltd provides code-free, cost-effective applications integration solutions for SMEs. Our core IMan product is designed to integrate almost any application with a number of Sage solutions and online payment processors.
Looking to purchase IMan, please see our resellers here.