Why Do Databases Exist in Such Large Numbers?
I honestly had no understanding why there were so many distinct databases available when I initially began working as a data analyst in the data sector. I merely need a place to store and query my data, isn't that right? Does it make a difference if I retrieve the information from that database or this one?
But when I became more involved in the technical side of things, I saw that database selection can have a significant impact. I have witnessed folks lose their employment on multiple occasions as a result of making poor choices. Those poor engineers occasionally only required a simple pocket guide to point them in the correct direction.This page attempts to serve as that guide by providing a brief explanation of all current database architectures (as of this writing), listing actual databases that you may experiment with, and providing examples of when to use them—and when to avoid using them at all costs! Since my writing is fairly opinionated and based on my personal experience, I will be dissatisfied if at least three individuals are upset with me in the comments.
I divide contemporary database systems into nine major categories. Each category has its own section below.
1.Data warehouse 2.Data lake 3.Transactional database 4.Document database 5.Key-value store 6.Graph database 7.Time-series database 8.Vector database 9.Multi-model database
Data Warehouse
Data warehouses have become the go-to option for the majority of contemporary data engineers because to the growth of big data over the past ten to fifteen years.
All analytical data sources can be piped into data warehouses, which are built to manage heavy workloads. They are excellent at executing intricate aggregated queries on enormous volumes of data.
With the ability to scale up and down processing power and storage to your heart's content, cloud-native systems are by far the most popular type of data warehouse. Many vendor-specific options are available, such as Synapse Analytics (Azure), Redshift (AWS), and BigQuery (Google). The easiest way to determine which of these to employ is to look at the sources of the majority of your data. For instance, use BigQuery if it's from a Google service and Redshift if it's from AWS.
Having said that, Snowflake is a well-liked option right now and isn't vendor-specific. Its user-friendliness helped it gain popularity, but more lately, cloud vendor solutions have been catching up. Over the next five years, I anticipate Snowflake's popularity will decline.
Because DuckDB is made to run locally rather than in the cloud, it's an intriguing data warehouse. That presents a challenge, though, as data warehouses only prove to be superior to conventional relational databases when they must process large amounts of data. However, the majority of local computers are only capable of processing 10GB of data at a time. I'm now working on finding a use case for it.
When to utilise it: If your company prioritises business intelligence and data analysis and has a large amount of data (at least 100 terabytes).
When to stay away from: If you lack huge data.
All analytical data sources can be piped into data warehouses, which are built to manage heavy workloads. They are excellent at executing intricate aggregated queries on enormous volumes of data.
With the ability to scale up and down processing power and storage to your heart's content, cloud-native systems are by far the most popular type of data warehouse. Many vendor-specific options are available, such as Synapse Analytics (Azure), Redshift (AWS), and BigQuery (Google). The easiest way to determine which of these to employ is to look at the sources of the majority of your data. For instance, use BigQuery if it's from a Google service and Redshift if it's from AWS.
Having said that, Snowflake is a well-liked option right now and isn't vendor-specific. Its user-friendliness helped it gain popularity, but more lately, cloud vendor solutions have been catching up. Over the next five years, I anticipate Snowflake's popularity will decline.
Because DuckDB is made to run locally rather than in the cloud, it's an intriguing data warehouse. That presents a challenge, though, as data warehouses only prove to be superior to conventional relational databases when they must process large amounts of data. However, the majority of local computers are only capable of processing 10GB of data at a time. I'm now working on finding a use case for it.
When to utilise it: If your company prioritises business intelligence and data analysis and has a large amount of data (at least 100 terabytes).
When to stay away from: If you lack huge data.
Data Lake
Data lakes are typically thought of as an alternative to data warehouses. It's more of an architectural pattern that's quickly gaining a lot of traction thanks to Apache Iceberg, rather than a database in the traditional sense. The basic concept is really straightforward: store all of your raw data in inexpensive file storage (like AWS S3) and use specialised engines to query it directly.
Three elements usually comprise the contemporary data lake approach:
Three elements usually comprise the contemporary data lake approach:
- Storage: Cloud object storage, such as Google Cloud Storage, AWS S3, or Azure Blob Storage, is almost always used. One of the main reasons for moving from a warehouse to a lake is that it is far less expensive than most database storage. For instance, 1TB of uncompressed data would cost about $25 per month in AWS S3 compared to about $180 per month in AWS Redshift. A little less than $2 million can be saved annually if you have a petabyte of data.
- Formats for tables: These give simple files database-like capabilities. The top candidates in this space are Iceberg, Delta Lake, and Hudi, which offer features like time travel (querying previous iterations), schema evolution, and ACID (Atomicity, Consistency, Isolation, Durability) transactions.
- Query engines are programmes that let you perform SQL or other analyses on your stored data directly. Common possibilities include Trino, Spark, and cloud-native solutions like Athena or BigQuery External Tables.
Revolut was the first place I saw a data lake implemented well, and I must admit that as an end user, I found it to be quite practical. It was great for dismantling data silos and allowing analysts to access data from any location (as long as they had permission, which was Revolut's major problem). If done correctly, the expenses can also be far lower than those of traditional data warehouses.
However, the setup of data lakes is far more complicated. A data lake is not necessary for the typical company, and I've read of several "data swamps" where businesses have thrown everything onto S3 without adequate governance or a well-defined query strategy. Additionally, I've had problems with Trino, the most widely used query engine at the moment. When it comes to larger-than-memory operations, it has several irksome limits when compared to purpose-built databases. It can occasionally simply fail by reaching its stage count limit and can be really slow.
I should bring up the concept of a "data lakehouse" right away. Initially, companies like Databricks used it as a marketing buzzword. With a focus on querying capabilities and a few additional data warehouse functions included, this is basically simply a data lake. Given how little it differs from a conventional data lake, it seems absurd to give it a name of its own, but hey, if it sells, it sells.
When to use it: When working with large amounts of data, traditional warehouses become prohibitively expensive. It's also excellent for businesses with a variety of data types that don't easily fit into a warehouse's columnar structure or for situations where you want to keep raw data in its original format while still allowing for querying.
Avoid this when you need constant, low-latency query performance or when dealing with tiny datasets (less than a few TB). Even for basic queries, the latency of query initialisation in a data lake might be several seconds. Additionally, stay away if your company lacks specialised data engineering staff because successful implementation of data lakes necessitates a high level of technical know-how.
However, the setup of data lakes is far more complicated. A data lake is not necessary for the typical company, and I've read of several "data swamps" where businesses have thrown everything onto S3 without adequate governance or a well-defined query strategy. Additionally, I've had problems with Trino, the most widely used query engine at the moment. When it comes to larger-than-memory operations, it has several irksome limits when compared to purpose-built databases. It can occasionally simply fail by reaching its stage count limit and can be really slow.
I should bring up the concept of a "data lakehouse" right away. Initially, companies like Databricks used it as a marketing buzzword. With a focus on querying capabilities and a few additional data warehouse functions included, this is basically simply a data lake. Given how little it differs from a conventional data lake, it seems absurd to give it a name of its own, but hey, if it sells, it sells.
When to use it: When working with large amounts of data, traditional warehouses become prohibitively expensive. It's also excellent for businesses with a variety of data types that don't easily fit into a warehouse's columnar structure or for situations where you want to keep raw data in its original format while still allowing for querying.
Avoid this when you need constant, low-latency query performance or when dealing with tiny datasets (less than a few TB). Even for basic queries, the latency of query initialisation in a data lake might be several seconds. Additionally, stay away if your company lacks specialised data engineering staff because successful implementation of data lakes necessitates a high level of technical know-how.
Transactional Database
These databases have existed for ages and are likely to outlive all the other database types on this list. Transactional databases, often known as relational databases, are typically used to execute your actual applications, while data warehouses are used to analyse your data.
They use ACID features, which are crucial when working with things like financial transactions since they guarantee that your data stays consistent even when several operations are taking place at once.
Since PostgreSQL is open source, incredibly strong, and comes with built-in support for JSON, spatial data, full-text search, and other features, I believe it's the greatest option for the majority of applications. Another well-liked option is MySQL, which is frequently selected because "that's what we've always used." However, PostgreSQL outperformed MySQL in terms of functionality years ago, even though MySQL is still ideal for 99 percent of applications.
A special mention should go to SQLite, which is an embedded database that operates directly within your application process rather than a client-server database. Because your entire database is contained in a single file, backups and deployment are quite easy. It is therefore regarded as the most extensively used database globally and the best embedded database solution for desktop and mobile apps.
"Distributed SQL" systems, such as CockroachDB, have been more popular recently. They are intended to address the scaling issues of conventional relational databases, primarily by automatically dividing your data among several servers. Theoretically nice, however it adds a lot of complexity and is only required if
They use ACID features, which are crucial when working with things like financial transactions since they guarantee that your data stays consistent even when several operations are taking place at once.
Since PostgreSQL is open source, incredibly strong, and comes with built-in support for JSON, spatial data, full-text search, and other features, I believe it's the greatest option for the majority of applications. Another well-liked option is MySQL, which is frequently selected because "that's what we've always used." However, PostgreSQL outperformed MySQL in terms of functionality years ago, even though MySQL is still ideal for 99 percent of applications.
A special mention should go to SQLite, which is an embedded database that operates directly within your application process rather than a client-server database. Because your entire database is contained in a single file, backups and deployment are quite easy. It is therefore regarded as the most extensively used database globally and the best embedded database solution for desktop and mobile apps.
"Distributed SQL" systems, such as CockroachDB, have been more popular recently. They are intended to address the scaling issues of conventional relational databases, primarily by automatically dividing your data among several servers. Theoretically nice, however it adds a lot of complexity and is only required if
- Even with appropriate optimisation, sharding, and read replicas, you're still running into serious scalability issues with normal PostgreSQL (usually 5TB+ of data with high write throughput).
- Even in the event of regional outages, your application must continue to function (multi-region resilience).
- You must keep user latency low across several geographical areas.
- Your company has the engineering know-how to solve problems with distributed systems.
The operational difficulty of administering a distributed SQL database is substantial, and the majority of systems never achieve this scale.
When to use it: For almost any application that requires dependable transactions and structured data. Anywhere data consistency is important, including banking systems, e-commerce websites, and reservation platforms.
Avoid this if your application needs extreme write performance at sizes larger than what a single node can manage, or if you're working with unstructured data that doesn't fit cleanly into tables. However, the majority of businesses who say they have "outgrown" SQL haven't; they have simply not adequately optimised their database.
When to use it: For almost any application that requires dependable transactions and structured data. Anywhere data consistency is important, including banking systems, e-commerce websites, and reservation platforms.
Avoid this if your application needs extreme write performance at sizes larger than what a single node can manage, or if you're working with unstructured data that doesn't fit cleanly into tables. However, the majority of businesses who say they have "outgrown" SQL haven't; they have simply not adequately optimised their database.
Document Database
Because of the entire "NoSQL" movement, they gained a lot of popularity between 2010 and 2015. The core principle is straightforward: you save flexible JSON-like "documents," each of which can have a completely different form, rather than arranging data into rigid tables.
The largest player in this space is MongoDB, and they've done a great job at attracting engineers who aren't interested in working with database schemas. However, it does remind me of the Coachman luring guys to Pleasure Island, considering the havoc that no-schema databases can create.
All of this flexibility eventually catches up with you, and as your application becomes serious, you'll probably need to untangle the mess and put schema validations in place to prevent future data discrepancies.
Google's document database, Firestore, is closely linked with their Firebase platform. While I worked for a small mobile app firm, I got interested in this. It has real-time updates by default, which is a really great feature that is frequently overlooked. It turns out that it costs a lot as well; turning off this function resulted in a big reduction in our billing.
In my opinion, document databases are far more common than they ought to be, mostly because "NoSQL" still looks nice on an engineer's resume. Over the next five years, I anticipate a decrease in the utilisation of this database type.
When to utilise it: For content management systems, quick prototyping, or applications where your data structure is truly unpredictable and ever-changing. Fantastic when your data naturally conforms to a document structure (hierarchical, nested, variable attributes).
When to stay away from: if you simply use it because you don't want to bother structuring your data.
The largest player in this space is MongoDB, and they've done a great job at attracting engineers who aren't interested in working with database schemas. However, it does remind me of the Coachman luring guys to Pleasure Island, considering the havoc that no-schema databases can create.
All of this flexibility eventually catches up with you, and as your application becomes serious, you'll probably need to untangle the mess and put schema validations in place to prevent future data discrepancies.
Google's document database, Firestore, is closely linked with their Firebase platform. While I worked for a small mobile app firm, I got interested in this. It has real-time updates by default, which is a really great feature that is frequently overlooked. It turns out that it costs a lot as well; turning off this function resulted in a big reduction in our billing.
In my opinion, document databases are far more common than they ought to be, mostly because "NoSQL" still looks nice on an engineer's resume. Over the next five years, I anticipate a decrease in the utilisation of this database type.
When to utilise it: For content management systems, quick prototyping, or applications where your data structure is truly unpredictable and ever-changing. Fantastic when your data naturally conforms to a document structure (hierarchical, nested, variable attributes).
When to stay away from: if you simply use it because you don't want to bother structuring your data.
Key-Value Store
These databases are the quickest and easiest to use. It all comes down to this: I give you a value in exchange for a key. That's all. Simple lookups with no relationships or querying. A key-value store is basically just a persistent, distributed dictionary, if you know Python.
On-disk and in-memory key-value stores are the two primary varieties.
With a market share of more than 94%, Redis is the industry leader in in-memory. Redis has probably been utilised in some capacity in every business application I've worked on in the past ten years. It has evolved (somewhat) beyond the fundamental key-value structure to enable data structures like lists, sets, and sorted sets, and is most frequently used for caching, session storage, or rate restriction.
The other well-liked key-value product is DynamoDB. DynamoDB is a disk-based distributed database designed for scalability and durability above speed, in contrast to Redis's in-memory emphasis on speed. Despite being key-value stores, they serve very distinct purposes. When you need to store large volumes of straightforward, schema-less data with consistent access patterns—such as device logs, game states, or user preferences that are always accessed by user ID—DymoDB might be useful.
When it comes to caching, session management, rate limiting, and basic real-time functions like leaderboards or counters, Redis ought to be your first choice. When you have large scale requirements but simple access patterns, DynamoDB makes sense.
Avoid this if you need to run intricate queries or merge your data. Essentially, you will be creating a less-than-ideal relational database by adding an inconvenient query layer on top of your key-value store.
On-disk and in-memory key-value stores are the two primary varieties.
With a market share of more than 94%, Redis is the industry leader in in-memory. Redis has probably been utilised in some capacity in every business application I've worked on in the past ten years. It has evolved (somewhat) beyond the fundamental key-value structure to enable data structures like lists, sets, and sorted sets, and is most frequently used for caching, session storage, or rate restriction.
The other well-liked key-value product is DynamoDB. DynamoDB is a disk-based distributed database designed for scalability and durability above speed, in contrast to Redis's in-memory emphasis on speed. Despite being key-value stores, they serve very distinct purposes. When you need to store large volumes of straightforward, schema-less data with consistent access patterns—such as device logs, game states, or user preferences that are always accessed by user ID—DymoDB might be useful.
When it comes to caching, session management, rate limiting, and basic real-time functions like leaderboards or counters, Redis ought to be your first choice. When you have large scale requirements but simple access patterns, DynamoDB makes sense.
Avoid this if you need to run intricate queries or merge your data. Essentially, you will be creating a less-than-ideal relational database by adding an inconvenient query layer on top of your key-value store.
Graph Database
These databases are the OCaml of DBs; while nearly everyone finds them fascinating, very few people actually require them. Data management where the relationships between things are as significant as the entities themselves is the speciality of graph databases.
The difficulty with graph databases is that they are really good at solving a certain set of problems, but those problems are not as frequent as most developers first believe. In fact, I used Neo4j, the most widely used graph database, for a personal project before switching to PostgreSQL after realising how specialised graph databases are.
When to apply: The most well-known example is social networks. For instance, figuring out user connection pathways or locating groups of connected users. If you need to find suspicious patterns in transactions that don't seem to be related, fraud detection is another excellent use case.
When to stay away from: Almost everything else.
The difficulty with graph databases is that they are really good at solving a certain set of problems, but those problems are not as frequent as most developers first believe. In fact, I used Neo4j, the most widely used graph database, for a personal project before switching to PostgreSQL after realising how specialised graph databases are.
When to apply: The most well-known example is social networks. For instance, figuring out user connection pathways or locating groups of connected users. If you need to find suspicious patterns in transactions that don't seem to be related, fraud detection is another excellent use case.
When to stay away from: Almost everything else.
Time-Series Database
You haven't encountered time-series databases if you believed graph databases were specialised. The management of data points gathered over time is the sole purpose for which these specialised databases were created. Consider stock prices, sensor readings, monitoring metrics, or anything else where you routinely record numbers and timestamps.
The organisation and compression of data is the main innovation in time-series databases. They are usually best suited for time-based querying and append-only workloads (you hardly ever update historical data). Additionally, they typically have built-in features for retention policies (automatically removing outdated data) and downsampling (aggregating data at various time granularities).
For many years, InfluxDB has led the category. Flux, its proprietary query language, is strong yet has a high learning curve. Compared to other options, I've found their cloud offering to be very pricey; but, the open-source version is reliable.
TimescaleDB adopts an alternative strategy; it is not designed as a stand-alone database but rather as an extension to PostgreSQL. This method is my personal favourite since it allows you to take advantage of all the time-series optimisations while still using PostgreSQL's full capabilities for other data requirements.
When to use it: When gathering metrics from IoT devices, infrastructure, or any other system where you need to monitor numbers as they change over time and look for patterns. especially useful if you need to store and query data efficiently and have a large volume of data.
When to avoid: For timestamp-containing common application data. You don't necessarily require a time-series database just because your data contains a "created_at" column; almost all databases can handle timestamps with ease. Time-series databases are not general-purpose data storage systems; rather, they are specialised instruments for collecting large volumes of metrics.
The organisation and compression of data is the main innovation in time-series databases. They are usually best suited for time-based querying and append-only workloads (you hardly ever update historical data). Additionally, they typically have built-in features for retention policies (automatically removing outdated data) and downsampling (aggregating data at various time granularities).
For many years, InfluxDB has led the category. Flux, its proprietary query language, is strong yet has a high learning curve. Compared to other options, I've found their cloud offering to be very pricey; but, the open-source version is reliable.
TimescaleDB adopts an alternative strategy; it is not designed as a stand-alone database but rather as an extension to PostgreSQL. This method is my personal favourite since it allows you to take advantage of all the time-series optimisations while still using PostgreSQL's full capabilities for other data requirements.
When to use it: When gathering metrics from IoT devices, infrastructure, or any other system where you need to monitor numbers as they change over time and look for patterns. especially useful if you need to store and query data efficiently and have a large volume of data.
When to avoid: For timestamp-containing common application data. You don't necessarily require a time-series database just because your data contains a "created_at" column; almost all databases can handle timestamps with ease. Time-series databases are not general-purpose data storage systems; rather, they are specialised instruments for collecting large volumes of metrics.
Vector Database
The AI explosion has caused them to become the most popular new database category. The storage and search of numerical representations of content that capture semantic meaning is the area of expertise for vector databases.
You are interacting with systems that translate language into these vector representations if you have utilised ChatGPT or Claude. Semantic search (identifying texts with comparable meaning, not simply matching keywords) is made possible by the ability to efficiently search for similar vectors.
With its fully maintained vector database and API-first methodology, Pinecone was a pioneer in this field. With more configuration options for various vector index types, Milvus is the top open-source alternative.
The market is moving really quickly, as you could imagine. The majority of conventional database providers have begun to incorporate vector capabilities; PostgreSQL, for instance, currently offers respectable vector support with the pgvector extension. This may be sufficient for many more straightforward use cases without the requirement for a specialised vector database.
When to utilise it: When developing a retrieval-augmented generation (RAG) system for LLMs, an AI-powered search engine that must comprehend meaning in addition to matching keywords, or any other application that requires finding "things like this other thing" using embeddings from an AI model.
When to stay away from: when "fuzzy" matching and semantic understanding are not required. Save yourself the trouble and expense by sticking with traditional databases if your needs can be satisfied by exact matching, filtering, or basic keyword search.
You are interacting with systems that translate language into these vector representations if you have utilised ChatGPT or Claude. Semantic search (identifying texts with comparable meaning, not simply matching keywords) is made possible by the ability to efficiently search for similar vectors.
With its fully maintained vector database and API-first methodology, Pinecone was a pioneer in this field. With more configuration options for various vector index types, Milvus is the top open-source alternative.
The market is moving really quickly, as you could imagine. The majority of conventional database providers have begun to incorporate vector capabilities; PostgreSQL, for instance, currently offers respectable vector support with the pgvector extension. This may be sufficient for many more straightforward use cases without the requirement for a specialised vector database.
When to utilise it: When developing a retrieval-augmented generation (RAG) system for LLMs, an AI-powered search engine that must comprehend meaning in addition to matching keywords, or any other application that requires finding "things like this other thing" using embeddings from an AI model.
When to stay away from: when "fuzzy" matching and semantic understanding are not required. Save yourself the trouble and expense by sticking with traditional databases if your needs can be satisfied by exact matching, filtering, or basic keyword search.
Multi-Model Database
These databases are attempting to satisfy everyone's needs. "Why use multiple specialised databases when one database can support multiple data models?" is an intriguing basic idea. Usually, they integrate key-value, document, graph, and occasionally relational capabilities into a single system.
Among them, FaunaDB is arguably the most ambitious, providing temporal versioning, graph-like interactions, and document storage with robust consistency guarantees. The drawback? Oh, and it's a jack-of-all-trades. You'll need to master a fully proprietary query language (FQL).
I don't see many applications for these kinds of databases because many of the most widely used versions of other database types are so easy to integrate with one another. They frequently provide a simplified version of each model, and you wind up dealing with restrictions that wouldn't be present in a database that was created specifically for that purpose.
Not sure when to utilise.
When to avoid: When your use case is obviously dominated by one data model, or when performance is crucial.
In conclusion
Here is my personal journey through the vast and constantly growing realm of databases. Don't worry if you're feeling overpowered by all of these choices. It's probably not worth your time to keep up with all the new alternatives because the database landscape has evolved in the previous ten years. Instead, make sure you have a general understanding of each category of DB type and only become an expert in a particular DB when absolutely necessary.
Despite all the options, I would probably advise using PostgreSQL when beginning a new programme, just like an expert did twenty years ago. 90% of use cases will be handled flawlessly by it. When you have a particular, verified need that PostgreSQL is unable to meet as a minimal viable product (MVP), you should look for alternative databases. You can search for optimisations, such as replacing PostgreSQL, once your programme has reached a mature stage.
Keep in mind that any database option entails long-term obligations. In many cases, the "boring" option is the best one, particularly for startups and smaller teams. Put more attention into the aspects of your system that will really set you apart from the competitors.
Finally, you should gently but firmly take someone out of the premises if they advise using blockchain as your database.
Among them, FaunaDB is arguably the most ambitious, providing temporal versioning, graph-like interactions, and document storage with robust consistency guarantees. The drawback? Oh, and it's a jack-of-all-trades. You'll need to master a fully proprietary query language (FQL).
I don't see many applications for these kinds of databases because many of the most widely used versions of other database types are so easy to integrate with one another. They frequently provide a simplified version of each model, and you wind up dealing with restrictions that wouldn't be present in a database that was created specifically for that purpose.
Not sure when to utilise.
When to avoid: When your use case is obviously dominated by one data model, or when performance is crucial.
In conclusion
Here is my personal journey through the vast and constantly growing realm of databases. Don't worry if you're feeling overpowered by all of these choices. It's probably not worth your time to keep up with all the new alternatives because the database landscape has evolved in the previous ten years. Instead, make sure you have a general understanding of each category of DB type and only become an expert in a particular DB when absolutely necessary.
Despite all the options, I would probably advise using PostgreSQL when beginning a new programme, just like an expert did twenty years ago. 90% of use cases will be handled flawlessly by it. When you have a particular, verified need that PostgreSQL is unable to meet as a minimal viable product (MVP), you should look for alternative databases. You can search for optimisations, such as replacing PostgreSQL, once your programme has reached a mature stage.
Keep in mind that any database option entails long-term obligations. In many cases, the "boring" option is the best one, particularly for startups and smaller teams. Put more attention into the aspects of your system that will really set you apart from the competitors.
Finally, you should gently but firmly take someone out of the premises if they advise using blockchain as your database.
.png)
Join the conversation