this post was submitted on 22 Jul 2024

258 points (95.1% liked)

Programming

17418 readers

24 users here now

Welcome to the main community in programming.dev! Feel free to post anything relating to programming here!

Cross posting is strongly encouraged in the instance. If you feel your post or another person's post makes sense in another community cross post into it.

Hope you enjoy the instance!

Rules

Follow the programming.dev instance rules
Keep content related to programming in some way
If you're posting long videos try to add in some form of tldr for those who don't want to watch videos

Wormhole

Follow the wormhole through a path of communities !webdev@programming.dev

founded 1 year ago

MODERATORS

snowe@programming.dev

Ategon@programming.dev

MaungaHikoi@lemmy.nz

258

Let's blame the dev who pressed "Deploy" - by Dmitry Kudryavtsev (yieldcode.blog)

submitted 3 months ago* (last edited 3 months ago) by mac@programming.dev to c/programming@programming.dev

75 comments fedilink hide all child comments

top 50 comments

sorted by: hot top controversial new old

[–] quinkin@lemmy.world 141 points 3 months ago (1 children)

If a single person can make the system fail then the system has already failed.

[–] corsicanguppy@lemmy.ca 21 points 3 months ago (1 children)

If a single outsider can make Your system fail then it's already failed.

Now consider in the context of supply-chain tuberculosis like npm.

[–] slazer2au@lemmy.world 11 points 3 months ago

Left-pad right?

[–] eager_eagle@lemmy.world 134 points 3 months ago (14 children)

Many people need to shift away from this blaming mindset and think about systems that prevent these things from happening. I doubt anyone at CrowdStrike desired to ground airlines and disrupt emergency systems. No one will prevent incidents like this by finding scapegoats.

[–] over_clox@lemmy.world 19 points 3 months ago* (last edited 3 months ago) (1 children)

Hey, why not just ask Dave Plummer, former Windows developer...

https://youtube.com/watch?v=wAzEJxOo1ts

When anywhere from 8.5 million to over a billion systems went down, numbers I've read so far vary significantly, still that's way too much failure for a simple borked update to a kernel level driver, not even made by Microsoft.

[–] eager_eagle@lemmy.world 7 points 3 months ago

that's a huge sign that their rollout process is garbage

[–] people_are_cute@lemmy.sdf.org 19 points 3 months ago (1 children)

That means spending time and money on developing such a system, which means increasing costs in the short term.. which is kryptonite for current-day CEOs

[–] eager_eagle@lemmy.world 8 points 3 months ago* (last edited 3 months ago)

Right. More than money, I say it's about incentives. You might change the entire C-suite, management, and engineering teams, but if the incentives remain the same (e.g. developers are evaluated by number of commits), the new staff is bound to make the same mistakes.

[–] azertyfun@sh.itjust.works 18 points 3 months ago* (last edited 3 months ago) (8 children)

I strongly believe in no-blame mindsets, but "blame" is not the same as "consequences" and lack of consequences is definitely the biggest driver of corporate apathy. Every incident should trigger a review of systemic and process failures, but in my experience corporate leadership either sucks at this, does not care, or will bury suggestions that involve spending man-hours on a complex solution if the problem lies in that "low likelihood, big impact" corner.
Because likely when the problem happens (again) they'll be able to sweep it under the rug (again) or will have moved on to greener pastures.

What the author of the article suggests is actually a potential fix; if developers (in a broad sense of the word and including POs and such) were accountable (both responsible and empowered) then they would have the power to say No to shortsighted management decisions (and/or deflect the blame in a way that would actually stick to whoever went against an engineer's recommendation).

load more comments (8 replies)

load more comments (11 replies)

[–] solrize@lemmy.world 82 points 3 months ago (1 children)

Note: Dmitry Kudryavtsev is the article author and he argues that the real blame should go to the Crowdstrike CEO and other higher-ups.

[–] mac@programming.dev 23 points 3 months ago

Edited the title to have a by in front to make that a bit more clear

[–] iAvicenna@lemmy.world 52 points 3 months ago (1 children)

sure it is the dev who is to blame and not the clueless managers who evaluate devs based on number of commits/reviews per day and CEOs who think such managers are on top of their game.

[–] Kissaki@programming.dev 7 points 3 months ago (1 children)

Is that the case at CrowdStrike?

[–] iAvicenna@lemmy.world 12 points 3 months ago (1 children)

I don't have any information on that, this was more like a criticism of where the world seems to be leading to

[+] FlorianSimon@sh.itjust.works 11 points 3 months ago* (last edited 4 days ago) (1 children)

[deleted]

[–] iAvicenna@lemmy.world 8 points 3 months ago* (last edited 3 months ago) (1 children)

NGL I am also a second hand witness to it. This particular example may be a few but there are a lot of others to the same effect: evaluating performance based on number of lines of code, trying to combine multiple dev responsibilities into a single position, unrealistic deadlines which can usually be met very superficially, managers looking for opportunities to replace coders with AI and further tasking other devs with AI code checking responsibilities, replacing experienced coders with newly graduates because they are willing to work more for less. All of these are some form of quantity over quality and usually end up with some sort of crisis.

[–] Ephera@lemmy.ml 7 points 3 months ago (1 children)

Yeah, and at the end of the day, it is just as much a very rare exception that a dev actually gets enough time to complete their work at a level of quality they would take responsibility for.
Hell, it is standard industry practice to ship things and then start fixing the issues that crop up.

load more comments (1 replies)

[–] Kissaki@programming.dev 51 points 3 months ago (4 children)

CrowdStrike ToS, section 8.6 Disclaimer

[…] THE OFFERINGS AND CROWDSTRIKE TOOLS ARE NOT FAULT-TOLERANT AND ARE NOT DESIGNED OR INTENDED FOR USE IN ANY HAZARDOUS ENVIRONMENT REQUIRING FAIL-SAFE PERFORMANCE OR OPERATION. NEITHER THE OFFERINGS NOR CROWDSTRIKE TOOLS ARE FOR USE IN THE OPERATION OF AIRCRAFT NAVIGATION, NUCLEAR FACILITIES, COMMUNICATION SYSTEMS, WEAPONS SYSTEMS, DIRECT OR INDIRECT LIFE-SUPPORT SYSTEMS, AIR TRAFFIC CONTROL, OR ANY APPLICATION OR INSTALLATION WHERE FAILURE COULD RESULT IN DEATH, SEVERE PHYSICAL INJURY, OR PROPERTY DAMAGE. […]

It's about safety, but truly ironic how it mentions aircraft-related twice, and communication systems (very broad).

It certainly doesn't impose confidence in the overall stability. But it's also general ToS-speak, and may only be noteworthy now, after the fact.

[–] goferking0@lemmy.sdf.org 10 points 3 months ago (1 children)

Weren't the issues at airports because of the ticketing and scheduling systems going down, not anything with aircraft?

load more comments (1 replies)

[–] sukhmel@programming.dev 7 points 3 months ago (1 children)

That's just covering up, like a disclaimer that your software is intended to only be used on 29ᵗʰ of February. You don't expect anyone to follow that rule, but you expect the court to rule that the user is at fault.

Luckily, it doesn't always work that way, but we will see how it turns out this time

load more comments (1 replies)

load more comments (2 replies)

[–] ByteOnBikes@slrpnk.net 45 points 3 months ago (2 children)

It's never a single person who caused a failure.

[–] jonne@infosec.pub 4 points 3 months ago

Yeah exactly. You'd think they'd have a test suite before pushing an update, or do a staggered rollout where they only push it to a sample amount of machines first. Just blaming one guy because you had an inadequate UAT process is ridiculous.

load more comments (1 replies)

[–] over_clox@lemmy.world 40 points 3 months ago* (last edited 3 months ago)

I hope this incident shines more light on the greedy rich CEOs and the corners they cut, the taxes they owe, the underpaid employees and understaffed facilities, and now probably some hefty fines, as just a slap on the wrist of course..

[–] polle@feddit.org 27 points 3 months ago (1 children)

Microsoft also started blaming th eu. Its such a shitshow its ridiculous.

https://www.tomshardware.com/software/windows/microsofts-eu-agreement-means-it-will-be-hard-to-avoid-crowdstrike-like-calamities-in-the-future

[–] trolololol@lemmy.world 13 points 3 months ago (1 children)

OMG the article conflates kennel API calls and kennel drivers such as what crowdstrike actually does. I refuse to read it until the end.

[–] Gestrid@lemmy.ca 6 points 3 months ago (1 children)

Kennel? You mean kernel?

load more comments (1 replies)

[–] MTK@lemmy.world 25 points 3 months ago* (last edited 3 months ago) (1 children)

If only we had terms for environments that were ment for testing, staging, early release and then move over to our servers that are critical...

I know it's crazy, really a new system that only I came up with (or at least I can sell that to CrowdStrike as it seems)

[–] gedhrel@lemmy.world 22 points 3 months ago (1 children)

Check Crowdstrike's blurb about the 1-10-60 rule.

You can bet that they have a KPI that says they can deliver a patch in under 15m; that can preclude testing.

Although that would have caught it, what happened here is that 40k of nuls got signed and delivered as config. Which means that unparseable config on the path from CnC to ring0 could cause a crash and was never covered by a test.

It's a hell of a miss, even if you're prepared to accept the argument about testing on the critical path.

(There is an argument that in some cases you want security aystems to fail closed; however that's an extreme case - PoS systems don't fall into that - and you want to opt into that explicitly, not due to a test omission.)

[–] Mischala@lemmy.nz 20 points 3 months ago (1 children)

That's the crazy thing. This config can't ever been booted on a win10/11 machine before it was deployed to the entire world.

Not once, during development of the new rule, or in any sort of testing CS does. Then once again, never booted by MS during whatever verification process they (should) have before signing.

The first win11/10 to execute this code in the way it was intended to be used, was a customer's machine.

Insane.

[–] gedhrel@lemmy.world 5 points 3 months ago (4 children)

Possibly the thing that was intended to be deployed was. What got pushed out was 40kB of all zeroes. Could've been corrupted some way down the CI chain.

load more comments (4 replies)

[–] Kissaki@programming.dev 24 points 3 months ago* (last edited 3 months ago) (2 children)

It's a systematic multi-layered problem.

The simplest, least effort thing that could have prevented the scale of issues is not automatically installing updates, but waiting four days and triggering it afterwards if no issues.

Automatically forwarding updates is also forwarding risk. The higher the impact area, the more worth it safe-guards are.

Testing/Staging or partial successive rollouts could have also mitigated a large number of issues, but requires more investment.

[–] wizardbeard@lemmy.dbzer0.com 11 points 3 months ago (2 children)

The update that crashed things was an anti-malware definitions update, Crowdstrike offers no way to delay or stage them (they are downloaded automatically as soon as they are available), and there's good reason for not wanting to delay definition updates as it leaves you vulnerable to known malware longer.

[–] merc@sh.itjust.works 7 points 3 months ago

And there's a better reason for wanting to delay definition updates: this outage.

load more comments (1 replies)

[–] Gestrid@lemmy.ca 5 points 3 months ago (2 children)

Four days for an update to malware definitions is how computers get infected with malware. But you're right that they should at least do some sort of simple test. "Does the machine boot, and are its files not getting overzealously deleted?"

load more comments (2 replies)

[–] BB_C@programming.dev 23 points 3 months ago

Yesterday I was browsing /r/programming

:tabclose

[–] Seasm0ke@lemmy.world 20 points 3 months ago

Reading between the lines, crowdstrike is certainly going to be sued for damages, putting a Dev on the hook means nobody gets - or pays - anything so long as one guy's life gets absolutely ruined. Great system

[–] hamid@vegantheoryclub.org 17 points 3 months ago* (last edited 3 months ago)

Crowdstrike CEO should go to jail. The corporation should get the death sentence.

Edit: For the downvoters, they for real negligently designed a system that killed people when it fails. The CEO as an officer of the company holds liability. If corporations want rights like people when they are grossly negligent they should be punished. We can't put them in jail so they should be forced to divest their assets and be "killed." This doesn't even sound radical to me, this sounds like a basic safe guard against corporate overreach.

[–] luciole@beehaw.org 14 points 3 months ago (3 children)

That is a lot of bile even for a rant. Agreed that it's nonsensical to blame the dev though. This is software, human error should not be enough to cause such massive damage. Real question is: what's wrong with the test suites? Did someone consciously decided the team would skimp on them?

As for blame, if we take the word of Crowdstrike's CEO then there is no individual negligence nor malice involved. Therefore this it is the company's responsibility as a whole, plain and simple.

[–] thingsiplay@beehaw.org 10 points 3 months ago

Real question is: what’s wrong with the test suites?

This is what I'm asking myself too. If they tested it, and they should have, then this massive error would not happen: a) controlled test suites and machines in their labors, b) at least one test machine connected through internet and acting like a customer, tested by real human, c) update in waves throughout the day. They can't tell me that they did all of these 3 steps. -meme

[–] Hector_McG@programming.dev 9 points 3 months ago

Therefore this it is the company’s responsibility as a whole.

The governance of the company as a whole is the CEO's responsibility. Thus a company-wide failure is 100% the CEO's fault.

If the CEO does not resign over this, the governance of the company will not change significantly, and it will happen again.

load more comments (1 replies)

[–] Mubelotix@jlai.lu 13 points 3 months ago

I blame the users for using that software in the first place

[–] corsicanguppy@lemmy.ca 11 points 3 months ago (1 children)

We don't blame the leopards who ate the guy's face. We blame the guy who stuck his face near the leopards.

load more comments (1 replies)

[–] MonkderVierte@lemmy.ml 6 points 3 months ago* (last edited 3 months ago)

That one, then go up the chain of command.

load more comments