Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Show HN: MyDuck Server – Supercharge MySQL and Postgres Analytics with DuckDB (github.com/apecloud)
32 points by fanyang01 on Nov 28, 2024 | hide | past | favorite | 6 comments
Hello HN!

We're excited to announce MyDuck Server, an open-source project that seamlessly integrates the analytical power of DuckDB with your existing MySQL & Postgres databases.

*Backstory*

Currently, there are no fully satisfactory open-source OLAP solutions for MySQL & Postgres. In the MySQL ecosystem, HeatWave offers close integration, but it's a proprietary, commercial product from Oracle. The Postgres community has seen promising DuckDB-based extensions emerge, including the official pg_duckdb. However, extensions can introduce isolation concerns in mission-critical environments.

Consequently, many organizations resort to setting up complex and costly data movement pipelines using tools like Debezium, Flink, or other commercial solutions to replicate data from MySQL & Postgres to OLAP systems (e.g., Snowflake, BigQuery, ClickHouse) or Lakehouses (e.g., Delta Lake + Spark). This approach introduces significant operational overhead and expense.

Another emerging strategy is the zero-ETL approach, increasingly advocated by cloud providers. This model simplifies data integration by allowing the cloud provider to manage ETL pipelines, while necessitating reliance on specific cloud ecosystems and services.

*Key features*

MyDuck Server offers a real-time analytical replica that leverages DuckDB's native columnar storage and processing capabilities. It operates as a separate server, ensuring isolation and minimizing impact on your primary database. Key features include:

- Easy Zero-ETL: Built-in real-time replication from MySQL & Postgres with no complex pipelines to manage. It feels like a standard MySQL replica or Postgres standby. With the Docker image, passing a connection string is enough.

- MySQL & Postgres Protocol Compatibility: We take this seriously and are working to make this project integrate well with the existing ecosystem around MySQL & Postgres. Currently, it is already possible to connect to MyDuck with standard MySQL & PostgreSQL clients in many programming languages.

- HTAP Support: A standard database proxy can be deployed in front of a MySQL/Postgres primary and its MyDuck replica to route write operations to the primary and read operations to the replica. It just works.

- DuckDB SQL & Columnar I/O over Postgres Protocol: It's unnecessary to restrict ourselves to MySQL/Postgres's SQL expressiveness and row-oriented data transfer. The Postgres port accepts all DuckDB-valid SQL queries, and you can retrieve query results in columnar format via `COPY (SELECT ...) TO STDOUT (FORMAT parquet/arrow)`.

- Standalone Mode: It does not need to be run as a replica. It can also act as a primary server that brings DuckDB into server mode and accepts updates from multiple connections, breaking DuckDB's single-process limitation.

*Relevant Previous HN Threads*

- pg_duckdb [1] (https://news.ycombinator.com/item?id=41275751) is the official Postgres extension for DuckDB. It uses DuckDB as an execution engine to accelerate analytical queries by scanning Postgres tables directly.

- pg_mooncake [2] (https://news.ycombinator.com/item?id=41998247) is a Postgres extension that adds columnstore tables for PG. It uses pg_duckdb under the hood but stores data in Lakehouse formats (Iceberg & Delta Lake).

- BemiDB [3] (https://news.ycombinator.com/item?id=42078067) is also a DuckDB-based Postgres replica. Unlike us, they focus on storing data in Lakehouse format.

We believe MyDuck Server offers a compelling solution for those seeking high-performance analytics on their MySQL & Postgres data without the complexities and costs of traditional approaches. We're eager to hear your feedback and answer any questions you might have. Let me know what you think!

[0] https://github.com/apecloud/myduckserver

[1] https://github.com/duckdb/pg_duckdb

[2] https://github.com/Mooncake-Labs/pg_mooncake

[3] https://github.com/BemiHQ/BemiDB



*Update from the Authors:*

We have successfully tested this project with the official Model Context Protocol (MCP) server for Postgres. For those interested in integrating this project with LLMs, you can find the guide here: https://github.com/apecloud/myduckserver/blob/main/docs/tuto...

We were pleasantly surprised that the integration worked seamlessly on our first attempt. Moreover, we observed some fascinating behavior: DuckDB's clear and suggestive error messages proved incredibly helpful for the model. During our testing, the free-tier Claude Haiku model initially hallucinated table and column names, but it impressively corrected them based on DuckDB's suggestions.

This integration highlights the conveniences offered by this project:

- DuckDB’s native SQL support: We can create a table directly from a CSV URL without manually determining the schema.

- Standard Postgres protocol support: The official Postgres MCP server works out-of-the-box.

We’d love to hear your thoughts and questions about LLM + MyDuck integration!


Bridging the gap between TP databases like PostgreSQL and AP database DuckDB is really amazing. Maybe the next step is building a serverless AP service on top of this and the object storage.


I recently learned about pg_duckdb and I'm excited about this developing ecosystem.

Can you explain a bit about the difference between how this project works and what pg_duckdb offers?


Thanks for your interest! TL;DR: pg_duckdb is an extension that runs on the same server as Postgres, while MyDuck operates as a standalone replica that can run on a separate server.

Details:

pg_duckdb uses DuckDB as an execution engine while keeping the data stored in Postgres in its usual row-oriented format. It identifies queries that can be optimized using DuckDB and executes them there. This is achieved by registering a replacement scan in DuckDB, allowing it to directly scan Postgres data, converting it into an analytics-optimized columnar format on the fly. MyDuck, on the other hand, functions as a standby server replicating changes from a primary Postgres server. The replicated data is stored in DuckDB’s native columnar format. This approach leverages both DuckDB’s optimized columnar storage and its execution engine, generally resulting in faster query performance compared to scanning Postgres data directly. Additionally, pg_duckdb is currently limited to full table scans — it doesn’t yet support skipping unnecessary data even with restrictive WHERE conditions (though there are plans to add index-based scans: https://github.com/duckdb/pg_duckdb/issues/243). In contrast, DuckDB’s native storage, used by MyDuck, includes extensive support for data skipping (a.k.a. filter pushdown).

Hope this clarifies the differences! Feel free to reach out with any more questions.


The current system diagram implies using duckdb's default storage format directly. I wonder how well this would actually work with the proposed zero-ETL design of basically treating this as a live replica. I was under the impression that as an OLAP solution DuckDB makes performance compromises when it comes to writes - so wouldn't live replication become problematic?


Great question! Improving the speed of writing updates to DuckDB has been a significant focus for us. Early in the project, we identified that DuckDB is quite slow for single-row writes, as discussed in this issue: https://github.com/apecloud/myduckserver/issues/55

To address this, we implemented an Arrow-based columnar buffer that accumulates updates from the primary server. This buffer is flushed into DuckDB at fixed intervals (currently every 200ms) or when it exceeds a certain size threshold. This approach significantly reduces DuckDB's write overhead.

Additionally, we developed dedicated replication message parsers that write directly to the Arrow buffer, minimizing allocations.

A community contributor has validated & enhanced the effectiveness of our approach: https://github.com/apecloud/myduckserver/pull/207

We plan to publish benchmarks on replication latency and query performance in the coming weeks. Stay tuned!




Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: