Correlation database

A Correlation database is a database management system (DBMS) that is data model independent and designed to efficiently handle unplanned, ad hoc queries in an analytical system environment. It was developed in 2005 by database architect Joseph Foley, whose background includes more than 30 years in data warehousing and business intelligence research and development work across a variety of industries.

Unlike relational database management systems, which use a records-based storage approach, or column-oriented databases which use a column-based storage method, a correlation database uses a value-based storage (VBS) architecture in which each unique data value is stored only once and an auto-generated indexing system maintains the context for all values.

Structure of the Correlation DBMS

Because a correlation DBMS stores each unique data value only once, the physical database size is significantly smaller than relational or column-oriented databases, without the use of data compression techniques. Above approximately 30GB, a correlation DBMS may become smaller than the raw data set.

The VBS model used a CDBMS consists of three primary physical sets of objects that are stored and managed:

  • a data dictionary (metadata);
  • an indexing and linking data set (additional metadata); and
  • the actual data values that comprise the stored information.

In the VBS model, each unique value in the raw data is stored only once; therefore, the data is always normalized at the level of unique values. This eliminates the need to normalize data sets in the logical schema.

Data values are stored together in ordered sets based on data types: all integers in one set, characters in another, etc. This optimizes the data handling processes that access the values.

In addition to typical data values, the data value store contains a special type of data for storing relationships between tables. This functions similarly to foreign keys in RDBMS structures, but with a CDBMS, the relationship is known by the dictionary and stored as a data value, making navigation between tables completely automatic.

The data dictionary contains typical metadata plus additional statistical data AbOUT the tables, columns and occurrences of values in the logical schema. It also maintains information about the relationships between the logical tables. The index and linking storage includes all of the data used to locate the contents of a record from the ordered values in the data store.

While not a RAM-based storage system, a CBMDS is designed to use as much RAM as the operating system can provide. For large databases, additional RAM improves performance. GeneRally, 4GB of RAM will provide optimized access times up to about 100 million records. 8GB of RAM is adequate for databases up to 10 times that size. Because the incremental RAM consumed decreases as the database grows, 16GB of RAM will generally support databases containing up to approximately 20 billion records.

Comparison of DBMS Storage Structures

The sample records shown below illustrate the physical differences in the storage structures used in relational, column-oriented and correlation databases.

The record-based structure used in an RDBMS stores data in table form. Variations like clustered indexing may change the sequence of the rows, but all rows, columns and values will be stored as in the table.

Cust ID

Name

City

State

Region

12222

ABC Corp

Minneapolis

MN

Central

19434

A1 Mfg

Duluth

MN

North

20523

J&J Inc

St. Paul

MN

28495

Acme

Minneapolis

MN

Central

30023

XYZ Corp

Rochester

MN

South

The next set of tables represent the physical storage of a column-based structure in column-oriented databases. With basic optimization for searching, the storage would be structured as shown below. Each of the separate blocks in this diagram represent separate storage areas, either as separate files or separate areas of a large block of storage managed by the DBMS. In the column-based structure, consecutive duplicates within a single column are automatically removed, and null values are not recorded since the missing record ID implies a null value.

Cust ID

Record

Value

1

12222

2

19434

3

20523

4

28495

5

30023

Name

Record

Value

1

ABC Corp

2

A1 Mfg

3

J&J Inc.

4

Acme

5

XYZ Corp

City

Record

Value

2

Duluth

1,4

Minneapolis

5

Rochester

3

St. Paul

State

Record

Value

1-5

MN

Region

Record

Value

1

Central

2

North

4

Central

5

South

In the VBS structure used in a CDBMS, the physical storage is very different from relational or column-based designs. Each unique value is stored only once, regardless of the number of occurrences or locations in the original data set. As with column-based storage, separate indexing and linking information are stored, but the data storage is structured as shown below.

ID

Value

1

12222

2

19434

3

20523

4

28495

5

30023

6

A1 Mfg

7

ABC Corp

8

Acme

9

Central

10

Duluth

11

J&J Inc

12

Minneapolis

13

MN

14

North

15

Rochester

16

St. Paul

17

South

18

XYZ Corp

Advantages and Disadvantages of the CDBMS

For analytical data warehouse applications, a CDBMS has several advantages over alternative database structures. First, because the database engine itself indexes all data and auto-generates its own schema on the fly while loading, it can be implemented quickly and is easy to update. There is no need for physical pre-design and no need to ever restructure the database. Second, a CDBMS enables creation and execution of complex queries such as associative queries ("show everything that is related to x") that are difficult if not impossible to model in SQL. The Primary advantage of the CDBMS is that it is optimized for executing ad hoc queries - queries not anticipated during the data warehouse design phase.

A CDBMS has two drawbacks in comparison to database alternatives. Unlike relational databases, which can be used in a wide variety of applications, a correlation database is designed specifically for analytical applications and does not provide transaction management features; it cannot be used for transactional processing. Second, because it indexes all data during the load process, the physical load speed of a CDBMS is slower than relational or column-oriented structures. However, because it eliminates the need for logical or physical pre-design, the overall "time to use" of a CDBMS is generally similar to or somewhat faster than alternative structures.