Replication is a process that keeps two or more collections of computerized information identically synchronized. It facilitates:
- Load reduction: Keeping a complete or partial copy of a collection on a different server reduces the load on the main server.
- Improved service: Accessing a copy of the data can provide better service to users than having them access the original data.
- Restricted data access: If some users should only have access to a subset of data, replicating only part of a collection makes it easy to enforce security restrictions.
- Geographic distribution: Making only a subset of data relevant to a specific node (or location) available is beneficial in widely distributed enterprises (such as a chain of retail stores or warehouses). You can still make all data available at a central location for less frequent use.
- Disaster Recovery: Keeping a copy of the main data available allows for setting up rapid fail-over clusters (the capability to switch over to a redundant or standby computer server in case the main system fails).
- "Cloud" computing: Replicating data allows for implementing what is commonly known as cloud computing (the on-demand storage, management, and processing of Internet-based data).
The information replicated is stored as files or in a database. In the case of files, the structure and content of a file are known only to the specialized programs that use the file. Databases are managed by database management systems (DBMS) that make use of standardized descriptions of the structure of the information (such as tables, columns, rows, and data types). These descriptions are known collectively as metadata and allow a general-purpose replicator to carry out relevant operations (for example filtering and data transformations) without the need to know anything about the contents or “meaning” of the data. Because file systems do not contain metadata, operations available for replication are more limited.
During replication, a collection of data is copied from system A to system B, where A is known as the source (for this collection) and B is known as the target. A system can be a source, a target, or both (with certain restrictions). A complex replication topology has a number of sources, targets, and data collections defined.
The replication process must account for the fact that source data may be changing while being copied. It is not possible to make or maintain copies instantaneously and to stop the source computer to “freeze” the information. Therefore, replication must account for:
- Integrity: The target data must reflect the complete result of all changes made to the source data during the replication process.
- Consistency: If a change affects different tables, rows, or files, the copy must reflect these changes consistently across all affected tables, rows, or files.
- Latency: The replication process must aim at keeping latency at a minimum. Ideally, it should not exceed a few seconds.