Tutorial: Data Synchronization in App Development

Not so long ago, I was challenged with a specific task in one of our projects. We need to share some data between two external applications. In particular, we need to synchronize data in the same structure in both apps in this way to have access to it in both applications. It is required to keep the same structure, same values, and any creation of a new data set or even update in one of the apps should impact both projects. Sounds usual and something that developers work with from time to time.

data synchronization

But there were a few more restrictions in our case, so the solution was not obvious. Let’s dive into this problem deeper and think about how we can achieve this kind of functionality, and at the end, I will present what kind of solution I came up with within this particular task.

Requirements and Restrictions

For our considerations, let's bring some changes to this task to make it simpler to understand and focus on the problem we solve.

As I mentioned before, we should have the same synchronized data in two separate applications.

So the main goal is: data should be synchronized.

Now let's add some additional conditions.

Applications are running on two separate servers.
Both applications are written in rails.
Both are using Postgres to store data
We have a small-time scope to deliver this functionality

So we have two applications on separate servers. Both in Rails and with Postgres.

Now let's think of what we will be synchronizing. We can reduce our example to only one table. So let's have a products table as simple as possible with the following schema:

products

  id bigint,

  name varchar(255),

  price smallint,

  created_at timestamp,

  updated_at timestamp

I suppose we cannot define anything simpler than that. Let's assume that we have this table in the first of our applications with some records and also the same table in the second application but with different data within. I think, or I hope at least, at this point, most of the developers should have an obvious solution to this problem.

An application programming interface

Obvious? A rather easy solution, not much effort to build REST API. Yes, in most situations, it can be a good solution. But there is one catch in our example. But even so, I will show you a much faster solution to solve this problem. Also, you need to remember that you have a time scope, the small one. So will you be able to build an API on both sides to send and accept requests, create triggers in applications to send data between apps?

Maybe, but what if I said one of those applications is a legacy app that nobody develops for a long time, and nobody knows how it works.

I hope that right now, you feel some shivers on your back to work with legacy code that all previous developers try to forget.

At this point, you probably think about how much time you will spend on knowing the legacy app to not destroy it completely after adding new code.

Let's go out of the box

So I probably discourage you from thinking about building API's. So, what magic solution do you have? You can ask. We need to go to a really low level of thinking about this problem. All we need is data in the database - nothing more, nothing less. So we can keep all work on the database level. The application can know about our work, but it's unnecessary and can only complicate the whole process, and we will only waste time. Also, this solution solves the problem of the legacy app. We can skip all legacy app layers and work only with the database.

It's all about extension

All that we need is some knowledge about Postgres extensions.

Ladies and gentlemen, I present to you the postgres_fdw extension that will take care of communication between two databases.

Exactly is a foreign-data wrapper that can be used to access data stored in external PostgreSQL servers. Now when we have divisions behind us, we can focus on coding. Shall we?

Tutorial: How to implement data synchronization?

The first step will be installing our extension and doing all the following steps for both applications. In our example, I will focus on bringing this functionality to our new application, where the legacy app will be treated only as a data provider. In the end, if you need both-way synchronization, you only need to go through the same steps for the legacy app.

To do this, all we need is to log into the Postgres console (root user with proper privileges will be required here) and execute the following instruction:

CREATE EXTENSION postgres_fdw;

Following official documentation of postgres_fdw extension (https://www.postgresql.org/docs/13/postgres-fdw.html), we should create a remote server. To complete this step, we need some information about the server, like host, port number, and obviously database name. In our case, the legacy app host is legacy_host, database name is legacy_database and port, let's say is 5432.

CREATE SERVER legacy_server FOREIGN DATA WRAPPER postgres_fdw OPTIONS (host 'legacy_host', dbname 'legacy_database', port '5432');

Now, we can check what our foreign server looks like. Let’s type \des+. I see:

     List of foreign servers
   Name        |  Owner   | Foreign-data wrapper |                      FDW options                       
---------------+----------+----------------------+---------------------------------------------------------
 legacy_server | postgres | postgres_fdw         | (host 'legacy_host', dbname 'legacy_app', port '5432')  
(1 row)

And last but not least, we need to take care of setting proper user mapping for our external database. Our legacy app user is someone called abcdef with iwillnotrevealmypassword as password. I suppose we can extend our current user that we use to access Postgres console and grant him privileges to access the legacy database. It will be totally fine for the purpose of our example.

CREATE USER MAPPING FOR current_user SERVER legacy_server OPTIONS (user 'abcdef', password 'iwillnotrevealmypassword');

Basically, now we are ready to read/write data to an external database. But how to access the desired table from the legacy database? I'm really glad you asked!

All you need is... table

For most readers, it will be obvious that we need legacy table representation in our database. So we need to do one more small step to achieve it and then, we can play with it like usual Postgres tables and query them as much as we like.

As I mentioned earlier, in both databases, we have products tables with an identical data structure.

To avoid name collision, we will use a legacy prefix for products from the legacy app. Instruction for creating a new table with legacy products is very similar to creating a regular table, with only a few differences. Let’s take a look at this example:

  CREATE FOREIGN TABLE legacy_products (
      id          bigint,
      name        varchar(255),
      description varchar(255),
      price       integer,
      created_at  timestamp without time zone NOT NULL,
      updated_at  timestamp without time zone NOT NULL
    ) server legacy_server
  OPTIONS (table_name 'products');

Yeah, I think it should be understandable. We create a foreign table named legacy_products with the following data structure, we point where the original table is - on legacy_server, and in options, we need to define what is the original table name - table_name 'products'. Now we can do whatever we want with our new table - query, insert, delete or update records - all data is accessible like in a regular table. Also, on the application level, we can create a mapping for the legacy_products table and create the corresponding ActiveRecord model and even not say anyone from the team, and they will be using it as part of the regular schema.

Of course, now is the time to do this same thing on the legacy app side. We just need to use data from our new application.

Same data on both sides aka data synchronization

At this moment, if we have access to tables we can easily fill both with the same data - it’s a matter of a couple of SQL queries and inserts. We can do it manually, write some script, but I think it's a trivial task, so let’s assume that everyone will handle this without any problem. More challenging task will be automatic synchronization in the future.

Let’s back for a minute to our requirements and I will quote myself:

“(...) any creation of a new data set or even update in one of the apps should have an impact on both projects.”

In general, in both cases, a solution will have almost the same schema but let’s do it step by step and start with creating new records.

Let's try to create a cause-and-effect sequence of what should happen.

A new record is added to the products table -> new record should be added into the legacy_products table. So if we think about it from a database level, we can say that one action should trigger another. One insert should trigger another insert into the second table. Insert trigger insert. Trigger. And this is our solution.

Pull the trigger

To achieve our goal, we need to define trigger in our database like this:

  CREATE TRIGGER AfterProductsInsert
  AFTER INSERT
    ON products
    FOR EACH ROW
    EXECUTE PROCEDURE insert_legacy_products();

It means that we create a trigger for insert action on table products, for every single row and execute insert_legacy_products() function. Now let’s define a function that will contain the INSERT procedure for our legacy products table.

        CREATE OR REPLACE FUNCTION insert_legacy_products()
          RETURNS trigger AS
        $$
        BEGIN
          INSERT INTO legacy_products(
            id,
            name,
            description,
            price,
            created_at,
            updated_at
          )
          values(
            NEW.id,
            NEW.name,
            NEW.description,
            NEW.price,
            clock_timestamp(),
            clock_timestamp()
          );
          RETURN NEW;
        EXCEPTION
          WHEN undefined_table THEN
            RETURN NEW;
        END;
        $$
        LANGUAGE 'plpgsql';

CREATE OR REPLACE FUNCTION will be really helpful in case we will need to modify our function. Also EXCEPTION can be helpful in case when there will be a problem with a non-existing table, but you should treat it rather as a concept. I suppose it should be fitted into a specific case.

Keep in sync

Now when we set up insert triggers, we need to talk about keeping all products in synchronization. What we should do in case any records will be updated. Let’s say, now we would like to sell one of our products at a higher price. After the product update, we also would like to have the same price in the legacy application for this product. It will be reasonable to use the same approach, so we will prepare another function and trigger to handle it.

The structure of both will be the same as for INSERT before.

 CREATE OR REPLACE FUNCTION update_legacy_products()
   RETURNS trigger AS
 $$
 BEGIN
   UPDATE legacy_products SET
     name        = NEW.name,
     description = NEW.description,
     price       = NEW.price,
     updated_at  = NEW.updated_at
   WHERE id = NEW.id;
   RETURN NEW;
 EXCEPTION
   WHEN undefined_table THEN
     RETURN NEW;
 END;
 $$
 LANGUAGE 'plpgsql';

The logic here is the same. We only need to use an UPDATE procedure instead of INSERT. Of course, we need to define what record needs to be updated by using the where clause and find the row with the corresponding id. The trigger needs to be defined in the same way but should be triggered after an update.

  CREATE TRIGGER AfterProductsUpdate
  AFTER UPDATE
    ON products
    FOR EACH ROW
    EXECUTE PROCEDURE update_legacy_products();

In this way, we can be sure that all data will be the same in both our applications after inserting a new or updating an existing record. Rake task to rule them all As I’m a Rails developer and basically, this problem was related to some Rails project, let’s put all triggers and functions into one place and do some automation to give tools to other developers to work with this on other environments. I decided to put this into a rake task, to have a free hand to bring synchronization whenever I need it, and to be honest, to not forget to run all necessary commands. The whole task has a namespace db_connection where I define remote_server and legacy_products namespace like this:

namespace :db_connection do
  namespace :remote_server do
  end
  namespace :legacy_products do
  end
end

In the remote_server namespace, I added tasks for creating and dropping, surprise, a remote server. Take a quick look at an example:

  namespace :remote_server do
    task create: :environment do
      execute_sql "
        CREATE EXTENSION postgres_fdw;
        CREATE SERVER new FOREIGN DATA WRAPPER postgres_fdw OPTIONS (host '#{dbconfig[:host]}', dbname '#{dbconfig[:database]}', port '#{dbconfig[:port]}');
        CREATE USER MAPPING FOR CURRENT_USER SERVER new OPTIONS (user '#{dbconfig[:user]}', password '#{dbconfig[:password]}');
      "
    end

    task destroy: :environment do
      execute_sql "
        DROP SERVER new CASCADE;
        DROP EXTENSION postgres_fdw;
      "
    end
  end

And a quick explanation for those functions:

  private

  def execute_sql(sql)
    ActiveRecord::Base.connection.execute(sql)
  end

  def dbconfig
    {
      host: ENV["LEGACY_DATABASE_HOST"],
      port: ENV["LEGACY_DATABASE_PORT"],
      database: ENV["LEGACY_DATABASE"],
      user: ENV["LEGACY_DATABASE_USER"],
      password: ENV["LEGACY_DATABASE_PASSWORD"],
    }
  end

First execute_sql is just a wrapper for ActiveRecord execute command. Whereas dbconfig is defining hash with external database configuration stored in environment variables. Much cleaner solution and fully customizable for any environment - as we like to have it done, in the Rails world. For legacy_products namespace we will define a couple of tasks to manage creating and deleting tables, triggers and functions. We will use the same execute_sql function to execute SLQ described before in this article.

  namespace :legacy_products do
    task create_foreign_table: :environment do
      ...
    end

    task drop_foreign_table: :environment do
      ...
    end

    task create_insert_trigger: :environment do
      ...
    end

    task drop_insert_trigger: :environment do
      ...
    end

    task create_or_update_insert_function: :environment do
      ...
    end

    task drop_insert_function: :environment do
      ...
    end
  end

Now we can easily run rake with a specific task and execute any required action on our database. But, to make it even simpler we can grouped tasks and with one simple trick, we can connect or disconnect database synchronization.

task create: :environment do
  Rake::Task["db_connection:remote_server:create"].execute
  Rake::Task["db_connection:legacy_products:create_foreign_table"].execute
  Rake::Task["db_connection:legacy_products:create_or_update_insert_function"].execute
  Rake::Task["db_connection:legacy_products:create_insert_trigger"].execute
end

task destroy: :environment do
  Rake::Task["db_connection:legacy_products:drop_insert_trigger"].execute
  Rake::Task["db_connection:legacy_products:drop_insert_function"].execute
  Rake::Task["db_connection:legacy_products:drop_foreign_table"].execute
  Rake::Task["db_connection:remote_server:destroy"].execute
end

Important thing is to keep proper order for subtasks, to not execute actions on structures that do not exist yet. I see some advantages in this approach. We will keep all related SLQ instructions inside one file in the project, we can easily create and destroy connection between databases, we can trigger this rake task during deployment - in short, we can make our life easier.

Keep all in sequence

The last challenge I encountered while working on data synchronization came quite unexpectedly, and I have to admit that it was something I didn't anticipate beforehand. Everything was going according to plan, models being introduced in the new app added entries in the legacy app, updates updated the data in both tables until a new product was introduced on the legacy app side. The database answered with an error message about duplicate value in the id column. After a quick investigation, I realized what had happened. At every insert of a new product on the new app side, the product sequence was incremented only in the new app database. On the legacy side, the sequence was constant. My first idea was to create a foreign sequence for legacy products ids, but with the postgres_fdw extension, we cannot define this kind of object. So I reached for another solution. After I rethink this problem.

On the legacy app side, I prepare a view like this:

CREATE VIEW products_id_seq_view AS SELECT nextval('products_id_seq') as next_id;

So now, every SELECT on this view will automatically trigger incrementation for products_id_seq. In the new app we just need to create a proper foreing table that will be a representation of products_id_seq_view from legacy app. We will use syntax used before to define a new foreing table.

  CREATE FOREIGN TABLE foreign_products_id_seq (next_id bigint)
  server legacy
  OPTIONS (table_name 'products_id_seq_view');

Now, after INSERT into the legacy products table we need to execute a query on our new defined table. We can simple run:

  PERFORM next_id FROM foreign_products_id_seq;

As we don't need any returned value, we can use PERFORM instead of SELECT. To fully maintain sequence on both databases, we need to add a view for a new app and a foreign table for the legacy app, then add PERFORM instruction after INSERT into the new_products table. From now on, we can stop worrying about the product's sequence.

Data Synchronization - Summary:

As I know, some developers prefer to see code to understand how something works or to better understand some concepts. I’m also on that side, so here you go - here, on our Railwaymen Github you will find a working example with two apps dockerized to emulate two applications with two separate database servers. More details on running this example will be found in the readme on the GitHub page.

At this moment, we can end our journey to solve this not often seen problem. This topic can be extended to some additional topics, like keeping a high-security level for a database connection or some automatic solution for synchronization in case one of our databases goes down for any reason.

The most important message for this article is to show that not always the most obvious solution will be most suitable in our circumstances. It’s always trying to split the problem, take a look at it from a different perspective and try to predict what will work in this particular case and what not, then decide how to approach the problem. As my physics teacher used to say, always think ten times before you even do the smallest work.

As we all know, developers need to expand their knowledge constantly. For more code tutorials, check our previous articles:

Railwaymen's knowledge hub about web development

Tutorial: Data Synchronization in App Development

Table of Contents:

Requirements and Restrictions

So the main goal is: data should be synchronized.

An application programming interface

Let's go out of the box

It's all about extension

Tutorial: How to implement data synchronization?

All you need is... table

Same data on both sides aka data synchronization

Data Synchronization - Summary:

Tomasz Staś

Railwaymen among exhibitors at The Big 5 in Dubai

Tutorial: Data Synchronization in App Development

Table of Contents:

Requirements and Restrictions

So the main goal is: data should be synchronized.

An application programming interface

Let's go out of the box

It's all about extension

Tutorial: How to implement data synchronization?

All you need is... table

Same data on both sides aka data synchronization

Data Synchronization - Summary:

Tomasz Staś

Google Queries, Railwaymen Responds #1 - Ruby on Rails 8 min read

Application security in a Software House from a DevOps perspective 7 min read

Design Patterns in React.js - Best Practices for Scalable Applications 16 min read

Railwaymen among exhibitors at The Big 5 in Dubai