tjgreen's comments

tjgreen · 2026-03-31T21:39:20 1774993160

A little birdie told me that efforts are underway to support the extension in Alloy, at least!

7thpower · 2026-04-01T01:26:51 1775006811

I hope there is an even smaller bird that will bring this to cloudsql for us plebs.

tjgreen · 2026-03-31T21:11:19 1774991479

There is indeed such a tradeoff. The architecture is designed with an eye towards making this tradeoff tunable (frequency of memtable spills, aggressiveness of compaction) but the work here is not yet finished. We chose to prioritize optimizing bulk-indexing and query performance for GA, since this is already enough for many applications. I'm excited to get to the point where we have brag-worthy benchmark numbers for high-frequency updates as well!

tjgreen · 2026-03-31T21:05:59 1774991159

Yes, hybrid search is one of the main current use cases we had in mind developing the extension, but it works for old-fashioned standalone keyword-only search as well. There is a lot of art to how you combine keyword and semantic search (there are entire companies like Cohere devoted to just this step!). We're leaving this part, at least for now, up to application developers.

tjgreen · 2026-03-31T20:56:53 1774990613

You'll have to ask Supabase!

tjgreen · 2026-03-31T20:40:43 1774989643

Yep, there are numbers in the blog post and repo. We are able to index MS-MARCO v2 (138M documents, around 50GB of raw data) in a bit under 18 minutes.

tjgreen · 2026-03-31T20:43:30 1774989810

For 2M scale dataset, you should be able to index in about 1 minute on low-end hardware. See the MS-MARCO v1 (8M documents) numbers, measured on cheap Github runners.

tjgreen · 2026-03-31T20:24:42 1774988682

Okay then!

nathanmills · 2026-03-31T23:19:46 1774999186

If you agree with something, you can just upvote it. I don't see what your comment adds to the conversation.

Ultimatt · 2026-04-01T07:06:42 1775027202

The irony of there being a downvote button too.

tjgreen · 2026-03-31T20:24:04 1774988644

I actually don't love this example either, for the reasons you mention, but at some point we had questions about how to filter based on numeric ranking. Thanks for the reminder to revisit this.

Re filtering, there are often reasonable workarounds in the SQL context that caused me to deprioritize this for GA. With your example, the workaround is to apply post-filtering to select just matches with all desired terms. This is not ideal ergonomics since you may have to play with the LIMIT that you'll need to get enough results, but it's already a familiar pattern if you're using vector indexes. For very selective conditions, pre-filtering by those conditions and then ranking afterwards is also an option for the planner, provided you've created indexes on the columns in question.

All this is just an argument about priorities for GA. Now that v1.0 is out, we'll get signal about which features to prioritize next.

mbreese · 2026-03-31T21:14:02 1774991642

While we’re talking about filtering — is there a way to set a WHERE clause when you’re setting up the index? I’ve been working on this a lot recently for a hybrid vector search in pg. One of the things that I’m running up against is setting a good BM25 index for a subset of a table (the where clause). I have a document subsets with very different word frequencies, so I’m trying to make sure that the search works on a set subset.

I think I can also setup partitions for this, but while you’re here… I’m very excited to start to roll this out.

tjgreen · 2026-03-31T21:24:34 1774992274

Partitions would be one option, and we've got pretty robust partitioned table support in the extension. (Timescaledb uses partitioning for hypertables, so we had to front-load that support). Expression indexes would be another option, not yet done but there is a community PR in flight: https://github.com/timescale/pg_textsearch/pull/154