Technology Definitions

Databases & Data Storage Terms

Organizing and storing digital information.

A database is an organized collection of structured information, or data, typically stored electronically in a computer system. A database is usually controlled by a database management system (DBMS).

A Database Management System (DBMS) is the software that interacts with end users, applications, and the database itself to capture and analyze the data. The DBMS software additionally encompasses the core facilities provided to administer the database.

Structured Query Language (SQL) is a domain-specific language used in programming and designed for managing data held in a relational database management system (RDBMS).

A NoSQL (originally referring to "non-SQL" or "non-relational") database provides a mechanism for storage and retrieval of data that is modeled in means other than the tabular relations used in relational databases.

Relational Database

A relational database is a type of database that stores and provides access to data points that are related to one another. Relational databases are based on the relational model, an intuitive, straightforward way of representing data in tables.

In a relational database, a table is a collection of related data held in a structured format within a database. It consists of columns and rows.

In a database table, a row, also called a record, represents a single, implicitly structured data item in a table. In simple terms, a database table can be thought of as consisting of rows and columns.

In a database table, a column is a set of data values of a particular simple type, one for each row of the table. The columns provide the structure according to which the rows are composed.

A primary key is a specific choice of a minimal set of attributes (columns) that uniquely specify a tuple (row) in a relation (table). In simple terms, it's a unique identifier for each record.

A foreign key is a column or a set of columns in a table that refers to the primary key of another table. It acts as a cross-reference between tables and establishes a link between them.

A database index is a data structure that improves the speed of data retrieval operations on a database table at the cost of additional writes and storage space to maintain the index data structure.

A query is a request for data or information from a database. The information is retrieved from the database tables and is presented to the user in a structured way.

Normalization is the process of organizing the columns (attributes) and tables (relations) of a relational database to minimize data redundancy. It involves dividing larger tables into smaller, well-structured tables and defining relationships between them.

Denormalization

Denormalization is a strategy used on a previously-normalized database to increase performance. In computing, denormalization is the process of trying to improve the read performance of a database, at the expense of losing some write performance, by adding redundant data or by grouping data.

A SQL join clause combines columns from one or more tables in a relational database. It creates a set that can be saved as a table or used as is. A join is a means for combining columns from one (self-join) or more tables by using values common to each.

ACID (Atomicity, Consistency, Isolation, Durability) is a set of properties of database transactions intended to guarantee data validity despite errors, power failures, and other mishaps.

A database transaction is a single unit of work. It is a sequence of operations performed as a single logical unit of work. A transaction must be "all or nothing": either all of its operations are executed, or none of them are.

CRUD is an acronym for Create, Read, Update, and Delete. These are the four basic functions of persistent storage. It describes the fundamental operations that can be performed on data in a database.

A data warehouse is a large, centralized repository of data that is collected from a variety of sources. It is designed specifically for fast querying and analysis, and often contains large amounts of historical data.

A data lake is a centralized repository that allows you to store all your structured and unstructured data at any scale. You can store your data as-is, without having to first structure it, and run different types of analytics.

ETL, which stands for Extract, Transform, and Load, is a data integration process that combines data from multiple data sources into a single, consistent data store which is loaded into a data warehouse or other target system.

Online Transaction Processing (OLTP) is a class of software programs capable of supporting transaction-oriented applications on the Internet. In OLTP, databases are read, written, and updated in real-time. These systems are designed for a large number of users conducting a large number of small transactions.

Online Analytical Processing (OLAP) is a category of software that allows users to analyze information from multiple database systems at the same time. It is a technology that enables analysts to quickly and easily examine and manipulate large amounts of data from many perspectives.

Document Database

A document database (or document store) is a type of NoSQL database that is designed to store and query data as JSON-like documents. Documents are self-contained and can have different structures.

Key-Value Store

A key-value store, or key-value database, is a simple database that uses a simple key/value method to store data. A key-value database stores data as a collection of key-value pairs in which a key serves as a unique identifier.

A graph database is a type of NoSQL database that uses graph structures for semantic queries with nodes, edges, and properties to represent and store data. The relationships between data are treated as first-class citizens.

Column-Family Database

A column-family (or wide-column) store is a type of NoSQL database. It organizes data into columns instead of rows. It can be seen as a two-dimensional key-value store, where each key maps to one or more columns.

Time Series Database

A time series database (TSDB) is a database optimized for storing and serving time series data, which are sequences of data points indexed in time order. Time series data has a specific timestamp associated with it.

Object storage is a data storage architecture that manages data as objects, as opposed to other storage architectures like file systems or block storage. Each object typically includes the data itself, a variable amount of metadata, and a globally unique identifier.

Block storage is a technology that is used to store data files on Storage Area Networks (SANs) or cloud-based storage environments. It breaks a file into individual, evenly sized blocks and stores them as separate pieces of data, each with a unique address.

File storage, also called file-level or file-based storage, is a hierarchical storage methodology used to organize and store data. In this system, data is stored in files, files are organized in folders, and folders are organized under a hierarchy of directories and subdirectories.

A database schema is the "blueprint" of a database that describes how the data is organized, how the relations among them are associated, and the constraints that apply to the data.

Data modeling is the process of creating a data model for the data to be stored in a database. This data model is a conceptual representation of data objects, the associations between different data objects, and the rules.

Database replication is the process of creating and maintaining multiple copies of a database. The copies, known as replicas, are kept in sync with the primary database to provide redundancy and improve data availability.

Sharding is a type of database partitioning that separates one table’s rows into multiple different tables, known as partitions or shards. Each shard has the same schema, but a different subset of the data. Shards are spread across multiple servers.

Partitioning is the process of dividing a large database table into smaller, more manageable pieces, called partitions. However, the data is still stored on a single database server. The database still treats the table as a single logical entity.

The CAP theorem, also known as Brewer's theorem, states that it is impossible for a distributed data store to simultaneously provide more than two out of the following three guarantees: Consistency, Availability, and Partition Tolerance.

Data Consistency

Consistency in database systems refers to the requirement that any given database transaction must change affected data only in allowed ways. Data written to a database must be valid according to all defined rules, including constraints, cascades, and triggers.

Data Availability

In the context of distributed systems, availability means that the system is able to process requests and provide a response. An available system is one that is responsive, even if some of its nodes are down or it is experiencing a network partition.

MySQL is an open-source relational database management system (RDBMS). Its name is a combination of "My", the name of co-founder Michael Widenius's daughter, and "SQL", the acronym for Structured Query Language.

PostgreSQL, also known as Postgres, is a free and open-source relational database management system (RDBMS) emphasizing extensibility and SQL compliance.

MongoDB is a source-available cross-platform document-oriented database program. Classified as a NoSQL database program, MongoDB uses JSON-like documents with optional schemas.

Redis (Remote Dictionary Server) is an in-memory data structure store, used as a database, cache, and message broker. It supports various data structures such as strings, hashes, lists, sets, and sorted sets.

Apache Cassandra is a free and open-source, distributed, wide-column store, NoSQL database management system designed to handle large amounts of data across many commodity servers, providing high availability with no single point of failure.

Object-Relational Mapping (ORM) is a programming technique for converting data between incompatible type systems in object-oriented programming languages. This creates a "virtual object database" that can be used from within the programming language.

A query planner (or query optimizer) is a component of a database management system that attempts to determine the most efficient way to execute a given query. It does this by considering the possible query plans and choosing the one with the lowest estimated cost.

Database Migration

In software development, a database migration refers to the management of incremental, reversible changes to a relational database schema. Migrations are used to keep the database schema in sync with the application code.

Connection Pooling

Connection pooling is a technique used to maintain a cache of database connections that can be reused for future requests. Opening a new database connection for every request is an expensive and time-consuming operation.

ACID (Atomicity, Consistency, Isolation, Durability) and BASE (Basically Available, Soft state, Eventually consistent) are two different models for database design. ACID prioritizes consistency, while BASE prioritizes availability.

BASE (Basically Available, Soft state, Eventually consistent) is a data system design model that prizes availability over consistency. It is often used in distributed systems where high availability is critical.

Eventual Consistency

Eventual consistency is a consistency model used in distributed computing that guarantees that, if no new updates are made to a given data item, all accesses to that item will eventually return the last updated value.

Strong Consistency

Strong consistency is a consistency model where all accesses to a data item are guaranteed to see the most recent completed write. Once a write is complete, any subsequent read will see that value.

A Relational Database Management System (RDBMS) is a program that allows you to create, update, and administer a relational database. Most RDBMSs use the SQL language to access the database.

In database theory, a view is the result set of a stored query on the data, which the database users can query just as they would in a persistent database collection object. This pre-established query can be used to simplify complex queries or to restrict access to data.

Stored Procedure

A stored procedure is a prepared SQL code that you can save, so the code can be reused over and over again. It is a subroutine available to applications that access a relational database system.

A database trigger is a special stored procedure that is automatically executed in response to certain events on a particular table or view in a database. The trigger is mostly used for maintaining the integrity of the information on the database.

Data integrity is the maintenance of, and the assurance of the accuracy and consistency of, data over its entire life-cycle. It is a critical aspect to the design, implementation and usage of any system which stores, processes, or retrieves data.

In SQL, a constraint is a rule that is applied to a column or a table to limit the type of data that can go into it. This ensures the accuracy and reliability of the data in the database.

Referential Integrity

Referential integrity is a property of data stating that all its references are valid. For a relational database, it requires that if a value of one attribute (column) of a relation (table) references a value of another attribute, then the referenced value must exist.

A B-tree is a self-balancing tree data structure that maintains sorted data and allows searches, sequential access, insertions, and deletions in logarithmic time. The B-tree is a generalization of a binary search tree in that a node can have more than two children.

A composite key is a primary key that consists of two or more columns. Each column on its own may not be unique, but the combination of the columns is guaranteed to be unique.

A surrogate key is a unique identifier for a record in a database that has no business meaning. It is typically an auto-incrementing integer or a UUID (Universally Unique Identifier).

SQLite is a C-language library that implements a small, fast, self-contained, high-reliability, full-featured, SQL database engine. It is the most used database engine in the world, built into all mobile phones and most computers.

Vector Database

A vector database is a type of database designed to store, manage, and search high-dimensional vector embeddings. These embeddings are mathematical representations of data, often generated by machine learning models.

Database Embedding

In machine learning, an embedding is a learned representation for text, images, or other data where items that have a similar meaning are positioned close to each other in a high-dimensional vector space.

In-Memory Database

An in-memory database (IMDB) is a database management system that primarily relies on main memory (RAM) for data storage, in contrast to databases that store data on disk or SSDs.

NewSQL is a class of modern relational database management systems that seek to provide the same scalable performance of NoSQL systems for online transaction processing (OLTP) workloads while still maintaining the ACID guarantees of a traditional database system.

Federated Database

A federated database system is a type of meta-database management system which transparently maps multiple autonomous database systems into a single federated database. It provides a unified interface to query and manage data from multiple, potentially heterogeneous, data sources.

Database Deadlock

A deadlock is a situation in which two or more transactions in a database are waiting for each other to release locks, preventing any of them from proceeding. Each transaction is waiting for a resource that the other transaction holds.

Database Locking

Locking is a mechanism used by database management systems to protect a piece of data (like a row or a table) from being accessed by more than one transaction at the same time. This is essential for preventing data corruption in a multi-user environment.

Query Optimization

Query optimization is the process of choosing the most efficient way to execute a query in a database. This is handled by a component of the DBMS called the query optimizer or query planner.

Full-Text Search

Full-text search is a technique for searching a single computer-stored document or a collection in a full-text database. It examines all of the words in every stored document as it tries to match search criteria.

An inverted index is an index data structure storing a mapping from content, such as words or numbers, to its locations in a database file, or in a document or a set of documents. It is the most popular data structure used in full-text search engines.

Elasticsearch is a distributed, free and open-source search and analytics engine for all types of data, including textual, numerical, geospatial, structured, and unstructured. It is built on top of the Apache Lucene library.

ACID Transactions

This is a redundant term. The "T" in "ACID" already stands for Transaction. The proper term is simply ACID or Transaction.

A data mart is a subset of a data warehouse that is focused on a specific business line or team. It is a smaller, more manageable version of a data warehouse that is designed to serve the needs of a particular department.

CDC (Change Data Capture)

Change Data Capture (CDC) is a set of software design patterns used to determine and track the data that has changed so that action can be taken using the changed data. It captures row-level changes (inserts, updates, deletes) in a database and makes them available in a stream of events.

Transaction Log

A transaction log (also known as a write-ahead log or WAL) is a file in which the database management system records all changes made to the database before they are written to the main data files. It is a history of actions executed by a DBMS.

A dirty read is a situation where a transaction reads data that has been written by another transaction that has not yet been committed. If the first transaction is rolled back, the data read by the second transaction becomes invalid.

Transaction Isolation

Transaction isolation is one of the four key properties of a database transaction (as part of ACID). It determines how and when changes made by one transaction become visible to others. Higher isolation levels prevent more concurrency issues, but can reduce performance.

Data migration is the process of transferring data from one storage system, data format, or computer system to another. This can involve moving from an old database to a new one, or from an on-premises server to a cloud-based database.

NoSQL Search Database

A search database is a specialized type of NoSQL database optimized for full-text search and analysis of large volumes of text-based or other unstructured data. It uses an inverted index to provide fast query responses.

A database cursor is a control structure that enables traversal over the records in a database. Cursors facilitate subsequent processing in conjunction with the traversal, such as retrieval, addition and removal of database records.

Pessimistic Locking

Pessimistic locking is a locking strategy where a resource is locked from the time it is first accessed by a transaction until the transaction is finished. It assumes that concurrent transactions will likely conflict, so it locks resources preemptively.

Optimistic Locking

Optimistic locking is a concurrency control strategy that assumes multiple transactions can complete without affecting each other. Instead of locking a record, it checks for conflicts at the time of commit. If a conflict is detected, the transaction is rolled back.

Horizontal Scaling

Horizontal scaling, or scaling out, means adding more machines to your pool of resources to spread the load. For a database, this involves distributing the data and queries across multiple servers.

Vertical Scaling

Vertical scaling, or scaling up, means adding more power (CPU, RAM, Storage) to an existing machine to handle more load.

Apache Spark is an open-source, distributed, general-purpose cluster-computing framework. Spark provides an interface for programming entire clusters with implicit data parallelism and fault tolerance.

Apache Hadoop is a collection of open-source software utilities that facilitates using a network of many computers to solve problems involving massive amounts of data and computation. It provides a software framework for distributed storage and processing of big data using the MapReduce programming model.

MapReduce is a programming model and an associated implementation for processing and generating large data sets with a parallel, distributed algorithm on a cluster.

Big data is a field that treats ways to analyze, systematically extract information from, or otherwise deal with data sets that are too large or complex to be dealt with by traditional data-processing application software. The data is often characterized by the "Three Vs": Volume, Velocity, and Variety.

Database Administrator (DBA)

A Database Administrator (DBA) is an IT professional responsible for the installation, configuration, upgrading, administration, monitoring, maintenance, and security of databases in an organization.