Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Method Chaining in Pandas: Bad Form or a Recipe for Success? (davidamos.dev)
13 points by da12 on Nov 2, 2022 | hide | past | favorite | 14 comments


The problem with this snippet isn't really the chaining; it's all the inlining. All the lists, and the many lambdas used, could be variables. Does this approach make it "professional code"?

The responses seem out of context, too:

>David: What's the elevator pitch for writing pandas code the way that you do?

>Matt: One common thing that you'll see in the data science world is this notion that there's like Untitled1.ipynb and Untitled2.ipynb[...]. My goal is to help with that so (...) you have Analysis_for_ClientA.ipynb and that's the only notebook you have. And you can come back to it tomorrow and pick it up where you left off and you're going to be productive. Your code will be easier to read[...].

This is a tweet. Filenames aren't even argued. This doesn't answer the interviewer's question either. Writing code != naming files.

>David: What is it that separates beginner pandas code from professional pandas code?

>Matt: I would say that if you want to write good pandas code (...) you should know how to write lambdas. You should know how to do list and dictionary comprehensions. Dictionary unpacking (...) is super useful in pandas world.

Absolutely. But professionals use variables, too. Possibly even more so.


> In my 20-plus years of working with data, I have multiple steps and I don't care about the intermediate steps.

Oh boy, do i care about every single intermediate step though!

Especially in pandas, where we play "where's the NaN" all the damn time.


I've done chaining myself and seen people do it as well. The folks writing these massive functions may think they are gurus, but it makes functions virtually impossible to debug in prod. It flies against the wisdom of "make your functions small"

I think is one area where pandas and Polaris can be improved. How do you write long chains and slot in breaks and testing?


I had a whole rant queued up on "Pandas and its consequences have been a disaster for the human race" (well, at least for newbie programmers), but I think instead I want to focus on the damn dictionary splats. I just don't get it - it's pure "clever" code in the pejorative Dijkstra sense. It's hard to edit, it's hard to typecheck. Why not pay the very low whitespace tax to give each key/value pair its own longhand line:

  .astype({
    'central_air': bool,
    'ms_subclass': 'uint8',
    ...
  })
Now if, say, ms_subclass and overall_qual need different types, that's an easy diff to read. Ah, but I suppose that wouldn't be as Twitter-friendly.


I’m on the same page with formatting, it’s not a pandas thing. (Doesn’t actually look like you’re saying it’s only a pandas thing) The same formatting patterns happen in all languages I’ve seen, sql quite often.

It’s just a bad programming habit thing.


Random lists of strings are hard to decipher. What is that set of values supposed to represent? And it interrupts the flow of figuring out what’s going on.

I prefer assigning lists like that to informatively name variable rather than have leave them the subject of speculation. It’s easier yo add add clarifying comments that way too.

In sql or pandas, long lists of values not broken up are hard to read. It’s easy to scan down a single value on each row, not random length values spread randomly across the screen.

Also That is chaining far too much in a single go


I personally don’t like method chaining in Pandas because it makes troubleshooting difficult for me. On the other hand I love piping functions in tidyverse in R. I think there are a few libraries in Python that bring pipes to Pandas. I haven’t used any though so can’t comment on their usefulness.

Edit: Here is a library that brings pipes to pandas https://github.com/pwwang/datar


I love the function chaining. It's basically functional programming with "immutable" intermediates (yes I know they're not really immutable, but we don't modify them in place).

Another good example of this style is tf.data pipelines. Also a very nice API.


Why does pandas code often feel ugly and clunky compared to the equivalent SQL? Is there no better way to do this?


I find Pandas vs. SQL to be complimentary, rather than an either-or type situation. For anything in the tens of GB range or smaller, it’s easy enough to move between the two with read_sql_query and to_sql.

The general strategy is to build the core of any dataset as a SQL query that handles joins and performance-sensitive parts of the query, then polish/plot/yeet into weird shapes with Pandas since it offers much greater expressivity.


What bugs me about pandas is that it is so copy heavy. I just wanted to know if there was some pythonic way to get performance without just writing normal SQL


Any specific examples you have in mind?



Quite frankly this is unreadable and unmaintainable code.

He doesn't articulate any of the virtues of it either, aside from some hand waving about 'memory' that doesn't get fleshed out.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: