Hacker Newsnew | past | comments | ask | show | jobs | submit | tjgreen's commentslogin

A little birdie told me that efforts are underway to support the extension in Alloy, at least!


I hope there is an even smaller bird that will bring this to cloudsql for us plebs.


There is indeed such a tradeoff. The architecture is designed with an eye towards making this tradeoff tunable (frequency of memtable spills, aggressiveness of compaction) but the work here is not yet finished. We chose to prioritize optimizing bulk-indexing and query performance for GA, since this is already enough for many applications. I'm excited to get to the point where we have brag-worthy benchmark numbers for high-frequency updates as well!


Yes, hybrid search is one of the main current use cases we had in mind developing the extension, but it works for old-fashioned standalone keyword-only search as well. There is a lot of art to how you combine keyword and semantic search (there are entire companies like Cohere devoted to just this step!). We're leaving this part, at least for now, up to application developers.


You'll have to ask Supabase!


Yep, there are numbers in the blog post and repo. We are able to index MS-MARCO v2 (138M documents, around 50GB of raw data) in a bit under 18 minutes.


For 2M scale dataset, you should be able to index in about 1 minute on low-end hardware. See the MS-MARCO v1 (8M documents) numbers, measured on cheap Github runners.


Okay then!


If you agree with something, you can just upvote it. I don't see what your comment adds to the conversation.


The irony of there being a downvote button too.

I actually don't love this example either, for the reasons you mention, but at some point we had questions about how to filter based on numeric ranking. Thanks for the reminder to revisit this.

Re filtering, there are often reasonable workarounds in the SQL context that caused me to deprioritize this for GA. With your example, the workaround is to apply post-filtering to select just matches with all desired terms. This is not ideal ergonomics since you may have to play with the LIMIT that you'll need to get enough results, but it's already a familiar pattern if you're using vector indexes. For very selective conditions, pre-filtering by those conditions and then ranking afterwards is also an option for the planner, provided you've created indexes on the columns in question.

All this is just an argument about priorities for GA. Now that v1.0 is out, we'll get signal about which features to prioritize next.


While we’re talking about filtering — is there a way to set a WHERE clause when you’re setting up the index? I’ve been working on this a lot recently for a hybrid vector search in pg. One of the things that I’m running up against is setting a good BM25 index for a subset of a table (the where clause). I have a document subsets with very different word frequencies, so I’m trying to make sure that the search works on a set subset.

I think I can also setup partitions for this, but while you’re here… I’m very excited to start to roll this out.


Partitions would be one option, and we've got pretty robust partitioned table support in the extension. (Timescaledb uses partitioning for hypertables, so we had to front-load that support). Expression indexes would be another option, not yet done but there is a community PR in flight: https://github.com/timescale/pg_textsearch/pull/154


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: