this post was submitted on 19 Jul 2024

1191 points (99.5% liked)

Technology

59594 readers

3301 users here now

This is a most excellent place for technology news and articles.

Our Rules

Follow the lemmy.world rules.
Only tech related content.
Be excellent to each another!
Mod approved content bots can post up to 10 articles per day.
Threads asking for personal tech support may be deleted.
Politics threads may be removed.
No memes allowed as posts, OK to post as comments.
Only approved bots from the list below, to ask if your bot can be added please contact us.
Check for duplicates before posting, duplicates may be removed

Approved Bots

founded 1 year ago

MODERATORS

[email protected]

1191

Major IT outage affecting banks, airlines, media outlets across the world (www.abc.net.au)

submitted 4 months ago* (last edited 4 months ago) by [email protected] to c/[email protected]

564 comments fedilink hide all child comments

All our servers and company laptops went down at pretty much the same time. Laptops have been bootlooping to blue screen of death. It's all very exciting, personally, as someone not responsible for fixing it.

Apparently caused by a bad CrowdStrike update.

Edit: now being told we (who almost all generally work from home) need to come into the office Monday as they can only apply the fix in-person. We'll see if that changes over the weekend...

you are viewing a single comment's thread
view the rest of the comments

[–] [email protected] 216 points 4 months ago (8 children)

Reading into the updates some more... I'm starting to think this might just destroy CloudStrike as a company altogether. Between the mountain of lawsuits almost certainly incoming and the total destruction of any public trust in the company, I don't see how they survive this. Just absolutely catastrophic on all fronts.

[–] [email protected] 127 points 4 months ago (1 children)

If all the computers stuck in boot loop can't be recovered... yeah, that's a lot of cost for a lot of businesses. Add to that all the immediate impact of missed flights and who knows what happening at the hospitals. Nightmare scenario if you're responsible for it.

This sort of thing is exactly why you push updates to groups in stages, not to everything all at once.

[–] [email protected] 77 points 4 months ago (2 children)

Looks like the laptops are able to be recovered with a bit of finagling, so fortunately they haven't bricked everything.

And yeah staged updates or even just... some testing? Not sure how this one slipped through.

[–] [email protected] 130 points 4 months ago (1 children)

Not sure how this one slipped through.

I'd bet my ass this was caused by terrible practices brought on by suits demanding more "efficient" releases.

"Why do we do so much testing before releases? Have we ever had any problems before? We're wasting so much time that I might not even be able to buy another yacht this year"

[–] [email protected] 25 points 4 months ago (1 children)

At least nothing like this happens in the airline industry

[–] [email protected] 42 points 4 months ago

Certainly not! Or other industries for that matter. It's a good thing executives everywhere aren't just concentrating on squeezing the maximum amount of money out of their companies and funneling it to themselves and their buddies on the board.

Sure, let's "rightsize" the company by firing 20% of our workforce (but not management!) and raise prices 30%, and demand that the remaining employees maintain productivity at the level it used to be before we fucked things up. Oh and no raises for the plebs, we can't afford it. Maybe a pizza party? One slice per employee though.

[–] [email protected] 3 points 4 months ago

One of my coworkers, while waiting on hold for 3+ hours with our company’s outsourced helpdesk, noticed after booting into safe mode that the Crowdstrike update had triggered a snapshot that she was able to roll back to and get back on her laptop. So at least that’s a potential solution.

[–] [email protected] 48 points 4 months ago (2 children)

Agreed, this will probably kill them over the next few years unless they can really magic up something.

They probably don't get sued - their contracts will have indemnity clauses against exactly this kind of thing, so unless they seriously misrepresented what their product does, this probably isn't a contract breach.

If you are running crowdstrike, it's probably because you have some regulatory obligations and an auditor to appease - you aren't going to be able to just turn it off overnight, but I'm sure there are going to be some pretty awkward meetings when it comes to contract renewals in the next year, and I can't imagine them seeing much growth

[–] [email protected] 22 points 4 months ago (2 children)

Nah. This has happened with every major corporate antivirus product. Multiple times. And the top IT people advising on purchasing decisions know this.

[–] [email protected] 13 points 4 months ago

Yep. This is just uninformed people thinking this doesn't happen. It's been happening since av was born. It's not new and this will not kill CS they're still king.

[–] [email protected] 2 points 4 months ago

At my old shop we still had people giving money to checkpoint and splunk, despite numerous problems and a huge cost, because they had favourites.

[–] [email protected] 6 points 4 months ago* (last edited 4 months ago)

Don't most indemnity clauses have exceptions for gross negligence? Pushing out an update this destructive without it getting caught by any quality control checks sure seems grossly negligent.

[–] [email protected] 40 points 4 months ago (4 children)

It's just amatuer hour across the board. Were they testing in production? no code review or even a peer review? they roll out for a Friday? It's like basic level start up company "here's what not to do" type shit that a junior dev fresh out of university would know. It's like "explain to the project manager with crayons why you shouldn't do this" type of shit.

It just boggles my mind that if you're rolling out an update to production that there was clearly no testing. There was no review of code cause experts are saying it was the result of poorly written code.

Regardless if you're low level security then apparently you can just boot into safe and rename the crowdstrike folder and that should fix it. higher level not so much cause you're likely on bitlocker which...yeah don't get me started no that bullshit.

regardless I called out of work today. no point. it's friday, generally nothing gets done on fridays (cause we know better) and especially today nothing is going to get done.

[–] [email protected] 11 points 4 months ago

explain to the project manager with crayons why you shouldn't do this

Can't; the project manager ate all the crayons

[–] [email protected] 3 points 4 months ago (2 children)

Why is it bad to do on a Friday? Based on your last paragraph, I would have thought Friday is probably the best week day to do it.

[–] [email protected] 21 points 4 months ago

Most companies, money included, try to roll out updates during the middle of start of a week. That way if there are issues the full team is available to address them.

[–] [email protected] 5 points 4 months ago (1 children)

Because if you roll out something to production on a friday whose there to fix it on the Saturday and Sunday if it breaks? Friday is the WORST day of the week to roll anything out. you roll out on Tuesday or Wednesday that way if something breaks you got people around to jump in and fix it.

[–] [email protected] 2 points 4 months ago

And hence the term read-only Friday.

[–] [email protected] 1 points 4 months ago (1 children)

Was it not possible for MS to design their safe mode to still “work” when Bitlocker was enabled? Seems strange.

[–] [email protected] 3 points 4 months ago

I'm not sure what you'd expect to be able to do in a safe mode with no disk access.

[–] [email protected] 1 points 4 months ago

rolling out an update to production that there was clearly no testing

Or someone selected "env2" instead of "env1" (#cattleNotPets names) and tested in prod by mistake.

Look, it's a gaffe and someone's fired. But it doesn't mean fuck ups are endemic.

[–] [email protected] 23 points 4 months ago (1 children)

I think you're on the nose, here. I laughed at the headline, but the more I read the more I see how fucked they are. Airlines. Industrial plants. Fucking governments. This one is big in a way that will likely get used as a case study.

[–] [email protected] 13 points 4 months ago

The London Stock Exchange went down. They're fukd.

[–] [email protected] 18 points 4 months ago (1 children)

Yeah saw that several steel mills have been bricked by this, that's months and millions to restart

[–] [email protected] 10 points 4 months ago (2 children)

Got a link? I find it hard to believe that a process like that would stop because of a few windows machines not booting.

[–] [email protected] 13 points 4 months ago (1 children)

a few windows machines with controller application installed

That's the real kicker.

[–] [email protected] 15 points 4 months ago (4 children)

Those machines should be airgapped and no need to run Crowdstrike on them. If the process controller machines of a steel mill are connected to the internet and installing auto updates then there really is no hope for this world.

[–] [email protected] 4 points 4 months ago (1 children)

But daddy microshoft says i gotta connect the system to the internet uwu

[–] [email protected] 2 points 4 months ago

No, regulatory auditors have boxes that need checking, regardless of the reality of the technical infrastructure.

[–] [email protected] 2 points 4 months ago

I work in an environment where the workstations aren't on the Internet there's a separate network, there's still a need for antivirus and we were hit with bsod yesterday

[–] [email protected] 1 points 4 months ago

then there really is no hope for this world.

I don't know how to tell you this, but....

[–] [email protected] 1 points 4 months ago

There is no unsafer place than isolated network. AV and xdr is not optional in industry/healthcare etc.

[–] [email protected] 2 points 4 months ago* (last edited 4 months ago)

There are a lot of heavy manufacturing tools that are controlled and have their interface handled by Windows under the hood.

They're not all networked, and some are super old, but a more modernized facility could easily be using a more modern version of Windows and be networked to have flow of materials, etc more tightly integrated into their systems.

The higher precision your operation, the more useful having much more advanced logs, networked to a central system, becomes in tracking quality control.

Imagine if after the fact, you could track a set of .1% of batches that are failing more often and look at the per second logs of temperature they were at during the process, and see that there's 1° temperature variance between the 30th to 40th minute that wasn't experienced by the rest of your batches. (Obviously that's nonsense because I don't know anything about the actual process of steel manufacturing. But I do know that there's a lot of industrial manufacturing tooling that's an application on top of windows, and the higher precision your output needs to be, the more useful it is to have high quality data every step of the way.)

[–] [email protected] 16 points 4 months ago (1 children)

Testing is production will do that

[–] [email protected] 10 points 4 months ago (1 children)

Not everyone is fortunate enough to have a seperate testing environment, you know? Manglement has to cut cost somewhere.

[–] [email protected] 9 points 4 months ago

Manglement is the good term lmao

[–] [email protected] -1 points 4 months ago (2 children)

Don't we blame MS at least as much? How does MS let an update like this push through their Windows Update system? How does an application update make the whole OS unable to boot? Blue screens on Windows have been around for decades, why don't we have a better recovery system?

[–] [email protected] 11 points 4 months ago

Crowdstrike runs at ring 0, effectively as part of the kernel. Like a device driver. There are no safeguards at that level. Extreme testing and diligence is required, because these are the consequences for getting it wrong. This is entirely on crowdstrike.

[–] [email protected] 3 points 4 months ago* (last edited 4 months ago)

This didn't go through Windows Update. It went through the ctowdstrike software directly.

[–] [email protected] -1 points 4 months ago (2 children)

What lawsuits do you think are going to happen?

[–] [email protected] 5 points 4 months ago (1 children)

They can have all the clauses they like but pulling something like this off requires a certain amount of gross negligence that they can almost certainly be held liable for.

[–] [email protected] -1 points 4 months ago (1 children)

Whatever you say my man. It's not like they go through very specific SLA conversations and negotiations to cover this or anything like that.

[–] [email protected] 1 points 4 months ago (1 children)

I forgot that only people you have agreements with can sue you. This is why Boeing hasn't been sued once recently for their own criminal negligence.

[–] [email protected] -2 points 4 months ago (1 children)

👌👍

[–] [email protected] 1 points 4 months ago

😔💦🦅🥰🥳

[–] [email protected] 1 points 4 months ago (1 children)

Forget lawsuits, they're going to be in front of congress for this one

[–] [email protected] -2 points 4 months ago

For what? At best it would be a hearing on the challenges of national security with industry.