Show HN: MyDuck Server – Supercharge MySQL and Postgres Analytics with DuckDB

fanyang01 · on Nov 29, 2024

*Update from the Authors:*

We have successfully tested this project with the official Model Context Protocol (MCP) server for Postgres. For those interested in integrating this project with LLMs, you can find the guide here: https://github.com/apecloud/myduckserver/blob/main/docs/tuto...

We were pleasantly surprised that the integration worked seamlessly on our first attempt. Moreover, we observed some fascinating behavior: DuckDB's clear and suggestive error messages proved incredibly helpful for the model. During our testing, the free-tier Claude Haiku model initially hallucinated table and column names, but it impressively corrected them based on DuckDB's suggestions.

This integration highlights the conveniences offered by this project:

- DuckDB’s native SQL support: We can create a table directly from a CSV URL without manually determining the schema.

- Standard Postgres protocol support: The official Postgres MCP server works out-of-the-box.

We’d love to hear your thoughts and questions about LLM + MyDuck integration!

vwagen · on Nov 29, 2024

Bridging the gap between TP databases like PostgreSQL and AP database DuckDB is really amazing. Maybe the next step is building a serverless AP service on top of this and the object storage.

aradox66 · on Nov 29, 2024

I recently learned about pg_duckdb and I'm excited about this developing ecosystem.

Can you explain a bit about the difference between how this project works and what pg_duckdb offers?

fanyang01 · on Nov 30, 2024

Thanks for your interest! TL;DR: pg_duckdb is an extension that runs on the same server as Postgres, while MyDuck operates as a standalone replica that can run on a separate server.

Details:

pg_duckdb uses DuckDB as an execution engine while keeping the data stored in Postgres in its usual row-oriented format. It identifies queries that can be optimized using DuckDB and executes them there. This is achieved by registering a replacement scan in DuckDB, allowing it to directly scan Postgres data, converting it into an analytics-optimized columnar format on the fly. MyDuck, on the other hand, functions as a standby server replicating changes from a primary Postgres server. The replicated data is stored in DuckDB’s native columnar format. This approach leverages both DuckDB’s optimized columnar storage and its execution engine, generally resulting in faster query performance compared to scanning Postgres data directly. Additionally, pg_duckdb is currently limited to full table scans — it doesn’t yet support skipping unnecessary data even with restrictive WHERE conditions (though there are plans to add index-based scans: https://github.com/duckdb/pg_duckdb/issues/243). In contrast, DuckDB’s native storage, used by MyDuck, includes extensive support for data skipping (a.k.a. filter pushdown).

Hope this clarifies the differences! Feel free to reach out with any more questions.

robertclaus · on Nov 29, 2024

The current system diagram implies using duckdb's default storage format directly. I wonder how well this would actually work with the proposed zero-ETL design of basically treating this as a live replica. I was under the impression that as an OLAP solution DuckDB makes performance compromises when it comes to writes - so wouldn't live replication become problematic?

fanyang01 · on Nov 29, 2024

Great question! Improving the speed of writing updates to DuckDB has been a significant focus for us. Early in the project, we identified that DuckDB is quite slow for single-row writes, as discussed in this issue: https://github.com/apecloud/myduckserver/issues/55

To address this, we implemented an Arrow-based columnar buffer that accumulates updates from the primary server. This buffer is flushed into DuckDB at fixed intervals (currently every 200ms) or when it exceeds a certain size threshold. This approach significantly reduces DuckDB's write overhead.

Additionally, we developed dedicated replication message parsers that write directly to the Arrow buffer, minimizing allocations.

A community contributor has validated & enhanced the effectiveness of our approach: https://github.com/apecloud/myduckserver/pull/207

We plan to publish benchmarks on replication latency and query performance in the coming weeks. Stay tuned!