The problem with this snippet isn't really the chaining; it's all the inlining. All the lists, and the many lambdas used, could be variables. Does this approach make it "professional code"?
The responses seem out of context, too:
>David: What's the elevator pitch for writing pandas code the way that you do?
>Matt: One common thing that you'll see in the data science world is this notion that there's like Untitled1.ipynb and Untitled2.ipynb[...]. My goal is to help with that so (...) you have Analysis_for_ClientA.ipynb and that's the only notebook you have. And you can come back to it tomorrow and pick it up where you left off and you're going to be productive. Your code will be easier to read[...].
This is a tweet. Filenames aren't even argued. This doesn't answer the interviewer's question either. Writing code != naming files.
>David: What is it that separates beginner pandas code from professional pandas code?
>Matt: I would say that if you want to write good pandas code (...) you should know how to write lambdas. You should know how to do list and dictionary comprehensions. Dictionary unpacking (...) is super useful in pandas world.
Absolutely. But professionals use variables, too. Possibly even more so.
I've done chaining myself and seen people do it as well. The folks writing these massive functions may think they are gurus, but it makes functions virtually impossible to debug in prod. It flies against the wisdom of "make your functions small"
I think is one area where pandas and Polaris can be improved. How do you write long chains and slot in breaks and testing?
I had a whole rant queued up on "Pandas and its consequences have been a disaster for the human race" (well, at least for newbie programmers), but I think instead I want to focus on the damn dictionary splats. I just don't get it - it's pure "clever" code in the pejorative Dijkstra sense. It's hard to edit, it's hard to typecheck. Why not pay the very low whitespace tax to give each key/value pair its own longhand line:
I’m on the same page with formatting, it’s not a pandas thing. (Doesn’t actually look like you’re saying it’s only a pandas thing) The same formatting patterns happen in all languages I’ve seen, sql quite often.
Random lists of strings are hard to decipher. What is that set of values supposed to represent? And it interrupts the flow of figuring out what’s going on.
I prefer assigning lists like that to informatively name variable rather than have leave them the subject of speculation. It’s easier yo add add clarifying comments that way too.
In sql or pandas, long lists of values not broken up are hard to read. It’s easy to scan down a single value on each row, not random length values spread randomly across the screen.
I personally don’t like method chaining in Pandas because it makes troubleshooting difficult for me. On the other hand I love piping functions in tidyverse in R. I think there are a few libraries in Python that bring pipes to Pandas. I haven’t used any though so can’t comment on their usefulness.
I love the function chaining. It's basically functional programming with "immutable" intermediates (yes I know they're not really immutable, but we don't modify them in place).
Another good example of this style is tf.data pipelines. Also a very nice API.
I find Pandas vs. SQL to be complimentary, rather than an either-or type situation. For anything in the tens of GB range or smaller, it’s easy enough to move between the two with read_sql_query and to_sql.
The general strategy is to build the core of any dataset as a SQL query that handles joins and performance-sensitive parts of the query, then polish/plot/yeet into weird shapes with Pandas since it offers much greater expressivity.
What bugs me about pandas is that it is so copy heavy. I just wanted to know if there was some pythonic way to get performance without just writing normal SQL
The responses seem out of context, too:
>David: What's the elevator pitch for writing pandas code the way that you do?
>Matt: One common thing that you'll see in the data science world is this notion that there's like Untitled1.ipynb and Untitled2.ipynb[...]. My goal is to help with that so (...) you have Analysis_for_ClientA.ipynb and that's the only notebook you have. And you can come back to it tomorrow and pick it up where you left off and you're going to be productive. Your code will be easier to read[...].
This is a tweet. Filenames aren't even argued. This doesn't answer the interviewer's question either. Writing code != naming files.
>David: What is it that separates beginner pandas code from professional pandas code?
>Matt: I would say that if you want to write good pandas code (...) you should know how to write lambdas. You should know how to do list and dictionary comprehensions. Dictionary unpacking (...) is super useful in pandas world.
Absolutely. But professionals use variables, too. Possibly even more so.