Multiple Catalog
- RFC PR: datafuselabs/databend#8255
- Tracking Issue: datafuselabs/databend#0000
Summary
Allow users to maintain multiple catalogs for the databend.
Motivation
Databend organize data in three layers:
catalog -> database -> table
catalog
: the biggest layer for databend, contains all databases and tables, provided byCatalog
database
: the container of tables, provided byDatabase
table
: the smallest unit in databend, provided byTable
By default, all databases and tables will be stored in default
catalog (powered by metasrv
).
Databend supports multiple catalogs now, but only in a static way.
To allow accessing the hive
catalog, users need to configure hive
inside databend-query.toml
in this way:
[catalog]
meta_store_address = "127.0.0.1:9083"
protocol = "binary"
Users can't add/alter/remove the catalogs during runtime.
By allowing users to maintain multiple catalogs for the databend, we can integrate more catalogs like iceberg
more quickly.
Guide-level explanation
After this RFC has been implemented, users can create new catalogs like:
CREATE CATALOG my_hive
TYPE='hive'
CONNECTION = (URL='<hive-meta-store>' THRIFT_PROTOCOL=BINARY);
SELECT * FROM my_hive.DB.table;
Besides, users can alter or drop a catalog:
DROP CATALOG my_hive;
Users can add more catalogs like:
CREATE CATALOG my_iceberg
TYPE='iceberg'
CONNECTION = (URL='s3://my_bucket/path/to/iceberg');
SELECT * FROM my_iceberg.DB.table;
With this feature, users can join data from different catalogs now:
select
my_iceberg.DB.purchase_records.Client_ID,
my_iceberg.DB.purchase_records.Item,
my_iceberg.DB.purchase_records.QTY
from my_hive.DB.vip_info
inner join my_iceberg.DB.purchase_records
on my_hive.DB.vip_info.Client_ID = my_iceberg.DB.purchase_records.Client_ID;
Reference-level explanation
Databend has a framework for multiple catalogs now. The only change for us is to store catalog-related information in metasrv instead.
To make it possible to start a query without metasrv
, we will also support configuring catalogs in config like:
[catalogs.my_hive]
meta_store_address = "127.0.0.1:9083"
protocol = "binary"
[catalogs.my_iceberg]
URL = "s3://bucket"
Static catalogs will always be loaded from configs and can't be altered or dropped.
Drawbacks
None.
Rationale and alternatives
None.
Prior art
Presto
Presto is an open-source SQL query engine that's fast, reliable, and efficient at scale. It doesn't have persisted states, so all its connectors will be configured.
Take iceberg as an example:
connector.name=iceberg
hive.metastore.uri=hostname:port
iceberg.catalog.type=hive
While using:
USE iceberg.tpch;
CREATE TABLE IF NOT EXISTS ctas_nation AS (SELECT * FROM nation);
DESCRIBE ctas_nation;
Unresolved questions
None.
Future possibilities
Iceberg Catalog
Discussed in RFC Iceberg External Table
Delta Sharing Catalog
Discussed in Tracking issues of integration with delta sharing