Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I have been on the saving end of some of these AWS mistakes (billing/usage alerts are important people), but alerts aren't always enough to keep them from happening completely if they are fast

- Not stopping AWS Glue with an insane amount of DPUs attached when not using it. Quote "I don't know I just attached as many as possible so it would go faster when I needed".

- Bad queueing of jobs that got deployed to production before the end of month reports came out. Ticket quote "sometimes jobs freeze and stall, can kick up 2 retry instances then fails", not a big problem in the middle of the month when there was only a job once a week. End of the month comes along and 100+ users auto upload data to be processed, spinning up 100+ c5.12xlarge instances which should finish in ~2 mins but hang overnight and spin up 2 retry instances each

- Bad queueing of data transfer (I am sensing a queueing problem) that led to high db usage so autoscaling r5.24xlarge (one big db for everything) to the limit of 40 instances



Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: