June 09, 2024

Types of Databases: Relational vs Columnar vs Document vs Graph vs Vector vs Key-value

As a DevOps engineer, database decisions land on your plate more often than you might expect. Development teams ask for recommendations. A service is slow and the root cause is the wrong database type for the access pattern. A new project needs infrastructure and the team has not thought through persistence yet. Understanding the database landscape means you can have an informed opinion instead of just provisioning whatever is requested.

This post covers the seven major database types, when each one fits, and the operational realities of running them in a cloud environment.

Relational Databases

Relational databases are the default choice for most data-driven applications. Data lives in tables: columns define attributes, rows define records. Tables are linked through primary and foreign keys, and queries are written in SQL.

The defining feature is ACID transactions: atomicity, consistency, isolation, and durability. Every operation is treated as all-or-nothing. The canonical example is a bank transfer: debiting one account and crediting another must both succeed or neither should happen. Relational databases enforce this guarantee at the engine level.

When to use: Structured data with clearly defined relationships, anywhere ACID compliance is required, transactional workloads, applications where data integrity matters more than raw write throughput.

Operational considerations: Most production relational workloads run on managed services (RDS, Aurora, Cloud SQL) to avoid managing replication, failover, backups, and engine upgrades manually. Scaling reads is straightforward through read replicas. Scaling writes is harder and eventually requires sharding or a different database type. Schema migrations on large tables require careful planning: use online schema change tools (gh-ost, pt-osc, or native ALGORITHM=INPLACE) to avoid locking the table.

Popular options: PostgreSQL, MySQL, MariaDB Cloud-managed: Amazon RDS, Amazon Aurora, Azure SQL, Cloud SQL

Columnar Databases

Traditional relational databases read data row by row. For analytical queries that aggregate across billions of rows but only touch a handful of columns, that row-by-row scan is extremely wasteful. Columnar databases read data column by column, touching only the columns a query needs.

The performance difference is significant. A query calculating the average transaction amount across 10 billion rows reads only the amount column in a columnar store, versus scanning the entire row width in a relational engine. Columnar databases also apply compression aggressively per column (data within a single column is typically repetitive), which reduces both storage cost and I/O.

When to use: Analytics and reporting over large datasets, data warehouses, queries that aggregate across many rows but select few columns. Not a replacement for relational databases: columnar stores are optimised for read-heavy analytical workloads, not transactional writes.

Operational considerations: Columnar databases typically ingest data in batches rather than individual row inserts. In AWS environments, Redshift is the standard choice, often loaded from S3 via COPY. Retention and partitioning strategies matter: old data that is never queried should be archived or tiered to reduce cost.

Popular options: Apache Cassandra, ClickHouse, Apache Druid, Amazon Redshift, Google BigQuery

Document Databases

Document databases store data as JSON-like documents rather than rows and columns. Each document can have a different structure, and nested objects and arrays are first-class citizens. There is no schema to define or migrate before inserting data.

The real benefit is development speed for data with irregular or evolving shapes. A product catalogue where each product has a different set of attributes maps naturally to documents: add a new attribute to one product type without touching any schema definition.

When to use: Unstructured or semi-structured data, schemas that evolve frequently, catalogues, content management, user profiles, event data with variable fields.

Operational considerations: Document databases scale horizontally through sharding, which is one reason they became popular for high-write-volume applications. MongoDB Atlas automates much of the operational complexity. The trade-off is that complex relational queries (multi-table joins, strict consistency guarantees) are harder or impossible to express cleanly. Plan your document structure around your access patterns from the start. Retrofitting a document model is expensive.

Popular options: MongoDB, Amazon DocumentDB, Couchbase Cloud-managed: MongoDB Atlas, Amazon DocumentDB

Graph Databases

Graph databases model data as nodes (entities) and edges (relationships between entities). Edges have a direction and can carry properties. The query engine is optimised for traversing these relationships efficiently, even across many levels of indirection.

Relational databases struggle with multi-hop relationships. Finding all engineers who share a manager with another engineer, or tracing a dependency chain five levels deep, requires recursive CTEs or multiple self-joins that perform badly at scale. Graph databases handle these queries natively.

When to use: Social graphs, fraud detection (finding connected accounts), recommendation engines (related products, similar users), network topology (infrastructure dependency mapping), access control graphs (who can reach what through which path).

Operational considerations: Graph databases are specialised tools. The data modelling approach is fundamentally different from relational thinking: you design around relationships and traversal patterns, not around normalised tables. Running graph databases in production requires specific operational expertise. Most teams use managed services (Neo4j AuraDB, Amazon Neptune) and treat the graph store as a specialised layer rather than a primary store.

Popular options: Neo4j, Amazon Neptune

Vector Databases

Vector databases store data as high-dimensional vectors: numerical representations of content (text, images, audio, embeddings) generated by machine learning models. Instead of exact matches, they answer similarity queries: find the 10 vectors closest to this query vector in the embedding space.

This is the foundational data layer for Retrieval-Augmented Generation (RAG) systems, semantic search, and recommendation engines. When an LLM-powered application needs to retrieve relevant context from a knowledge base, it converts the user query into an embedding vector and queries the vector database for similar documents.

When to use: Semantic search, AI/ML applications with embedding-based retrieval, recommendation engines based on content similarity, image or document similarity matching.

Operational considerations: Vector databases are relatively new in the managed services landscape. Latency matters: vector search is approximate nearest-neighbour search, and index type (HNSW, IVF) affects the speed/recall trade-off. For many use cases, the pgvector PostgreSQL extension is sufficient and simpler to operate than a standalone vector database, keeping your data in a familiar system with full SQL support alongside the vector search.

Popular options: Milvus, Pinecone, Weaviate, pgvector (PostgreSQL extension), Amazon OpenSearch with k-NN

Key-Value Databases

Key-value databases store data as simple key-value pairs. The key is a unique identifier, the value can be anything from a string to a complex object. Lookup is always by key, and operations are extremely fast because there is no query parser, no join engine, and no schema validation overhead.

When to use: Session management (store session data with a TTL), rate limiting (increment counters per user per time window), caching (cache the result of an expensive database query or API call), feature flags, distributed locking.

Operational considerations: Redis is the dominant choice for in-memory key-value storage. It supports rich data structures (lists, sets, sorted sets, hashes) beyond simple strings, which is why it can serve as a queue, a pub/sub broker, and a cache simultaneously. Persistence modes (RDB snapshots, AOF logs) trade durability against performance. For production Redis, always run at least a primary plus one replica, and configure Sentinel or use a managed service (ElastiCache, Redis Cloud) for automatic failover. The most common operational problem is memory pressure: Redis evicts keys when memory fills up, and the eviction policy you choose determines which keys survive.

Popular options: Redis, Amazon DynamoDB, Memcached Cloud-managed: Amazon ElastiCache, Amazon DynamoDB, Redis Cloud

Time-Series Databases

Time-series databases are optimised for data indexed primarily by timestamp: measurements, events, and metrics collected continuously over time. They provide fast ingestion of high-cardinality time-stamped data, efficient downsampling for older data, and built-in retention policies that automatically drop data older than a defined window.

In infrastructure and operations contexts, time-series databases are everywhere. Prometheus stores your cluster and application metrics. InfluxDB is commonly used for IoT sensor data and custom application metrics. Grafana queries these stores to build dashboards.

When to use: Infrastructure metrics (CPU, memory, disk, network), application performance monitoring, IoT sensor data, financial tick data, any data where the timestamp is a primary dimension and you need fast aggregation over time windows.

Operational considerations: Time-series databases handle ingestion and aggregation efficiently but are not designed for arbitrary relational queries or long-term retention of raw data. Prometheus has a retention window (typically 15-30 days) and is designed for operational monitoring, not historical analysis. For long-term storage, Thanos or Cortex extends Prometheus by writing metrics to object storage (S3) and enabling global querying across multiple Prometheus instances.

Popular options: Prometheus, InfluxDB, TimescaleDB (PostgreSQL extension), Amazon Timestream

Summary

Each database type has a specific strength. Using the wrong one for the job does not just cause performance problems: it creates operational debt that compounds over time.

Database Type	Best For	Managed Services
Relational	Structured data, ACID transactions, complex queries	RDS, Aurora, Cloud SQL
Columnar	Large-scale analytics, data warehousing	Redshift, BigQuery, ClickHouse Cloud
Document	Flexible schema, high write throughput, nested data	MongoDB Atlas, DocumentDB
Graph	Multi-hop relationship traversal, connected data	Neo4j Aura, Amazon Neptune
Vector	Semantic search, embedding-based retrieval, AI/ML	Pinecone, Weaviate, pgvector
Key-Value	Caching, sessions, rate limiting, distributed locking	ElastiCache, DynamoDB
Time-Series	Infrastructure metrics, monitoring, IoT data	Prometheus, InfluxDB, Timestream

The practical DevOps takeaway: most systems use more than one database type. A typical production stack might have PostgreSQL for the transactional core, Redis for caching and sessions, Prometheus for metrics, and an OpenSearch cluster for log indexing. The skill is knowing which database type solves which problem, and keeping the operational footprint as small as possible by using managed services wherever the trade-offs make sense.