A single DNS race condition brought AWS to its knees (go.theregister.com)
from mhzawadi@lemmy.horwood.cloud to selfhosted@lemmy.world on 23 Oct 06:52
https://lemmy.horwood.cloud/post/881296

#selfhosted

threaded - newest

IsoKiero@sopuli.xyz on 23 Oct 06:56 next collapse

So it is always DNS

Dubiousx99@lemmy.world on 23 Oct 07:10 next collapse

It’s always DNS

mhzawadi@lemmy.horwood.cloud on 23 Oct 07:11 next collapse

can confirm, its always DNS. Even when it looks like a network issue, its DNS

aarRJaay@lemmy.world on 23 Oct 09:23 collapse

Spotted the Network guy

ramble81@lemmy.zip on 23 Oct 12:09 collapse

Oh man. One of my old companies, the Devs would always blame the network. Even after we spent a year upgrading and removing all SPOFs. They’d blame the network……

“Your application is somehow producing 2 billion packets per second and your SQL queries are returning 5GB of data”…. “See! The network is too slow and it has problems”

fushuan@lemmy.blahaj.zone on 24 Oct 01:50 next collapse

They might be referring to their brain network being to slow and having problems.

rumba@lemmy.zip on 24 Oct 02:24 next collapse

Dev: My app’s getting a 400 hitting the server. Your firewall changes broke it.

Me: You’re getting to the server, it’s giving you back a malformed request error. Most likely it’s a problem in your client.

Dev: it worked fine until you made that change in QA.

Me: Your server is in production.

After that, I just get too busy to look at it for a while… They figure it out eventually.

lka1988@lemmy.dbzer0.com on 24 Oct 08:03 collapse

Ah, klugerblickdummkopf

ijhoo@lemmy.ml on 23 Oct 07:50 next collapse

isitdns.com

NickwithaC@lemmy.world on 23 Oct 08:15 next collapse

I always view the source of websites like this and this is one of the worst I’ve seen. 217 lines of code (including inline Javascript?!) and a Google tag for some reason, all to put the word YES in green on black.

Cyber@feddit.uk on 23 Oct 08:32 next collapse

Agreed, could be static HTML and a GIF.

Thanks, I won’t click that link.

ijhoo@lemmy.ml on 23 Oct 09:13 next collapse

Did not think of doing that.

I guess i never expected anyone to have a fcking JavaScript on a simple page as that

Randelung@lemmy.world on 24 Oct 02:05 collapse

How else would you center a div??

NickwithaC@lemmy.world on 24 Oct 05:24 collapse
Xylight@lemdro.id on 23 Oct 21:43 next collapse

this made me mad so i made a single, ultra minimal html page in 5 minutes that you can just paste in your url box

data:text/html;base64,PCFkb2N0eXBlaHRtbD48Ym9keSBzdHlsZT10ZXh0LWFsaWduOmNlbnRlcjtmb250LWZhbWlseTpzYW5zLXNlcmlmO2JhY2tncm91bmQ6IzAwMDtjb2xvcjojMmYyPjxoMT5JcyBpdCBETlM/PC9oMT48cCBzdHlsZT1mb250LXNpemU6MTJyZW0+WWVz

source code:

<!doctypehtml><body style=text-align:center;font-family:sans-serif;background:#000;color:#2f2><h1>Is it DNS?</h1><p style=font-size:12rem>Yes
aBundleOfFerrets@sh.itjust.works on 25 Oct 01:28 collapse

Your website no longer uses DNS invalidating its use as a diagnostic tool lmao

Xylight@lemdro.id on 25 Oct 10:51 collapse

i never thought about that. i assumed the first page was just a joke website like “days since last JavaScript framework” always being zero

hexagonwin@lemmy.sdf.org on 24 Oct 00:18 next collapse

lmao, considering some of the meaningless comments there i’m starting to think it’s “vibe coded”.

rumba@lemmy.zip on 24 Oct 02:31 collapse

There have been 209 versions of that site

web.archive.org/web/…/www.isitdns.com/

it predated AI, but likely seems to have had some AI cleanup.

If it was truly just vibecoded, the comments would usually be on every element.

rumba@lemmy.zip on 24 Oct 02:26 collapse

I just did the same f’ing thing and came here to write your comment!

well done.

[deleted] on 23 Oct 11:02 collapse

.

AtariDump@lemmy.world on 24 Oct 04:28 collapse

<img alt="" src="https://lemmy.world/pictrs/image/fbe8d16d-38ae-4b13-84f7-815f6bf3f968.png">

HeartyOfGlass@piefed.social on 23 Oct 07:04 next collapse

Racist DNS!

Flax_vert@feddit.uk on 23 Oct 07:19 next collapse

Makes sense. DNS is quite a single point of failure

non_burglar@lemmy.world on 23 Oct 07:55 next collapse

Its true.

It comes up at work, it comes up in discussions on Linux podcasts I listen to, it comes up here…

We have a big, dangerous impending problem in DNS.

Flax_vert@feddit.uk on 23 Oct 08:28 collapse

The issue here isn’t DNS. The issue here is a large portion of the internet relying on a single data centre on the US East coast. Ideally, a lot of competing hosting companies would exist so if one goes down, it’s just one service and very few people notice.

Onomatopoeia@lemmy.cafe on 23 Oct 08:49 next collapse

So much this.

Why is Signal hosted in one location on AWS, for example? That’s the sort of thing that should be in multiple places around the world with automatic fail over.

victorz@lemmy.world on 23 Oct 08:55 next collapse

I hope they work towards mitigating this risk from now on.

Flax_vert@feddit.uk on 23 Oct 10:04 collapse

I prefer end to end encrypted xmpp

chisel@piefed.social on 23 Oct 17:43 collapse

I prefer face to face communication, speaking in code and whispering in eachother’s ears so nobody else can hear.

aBundleOfFerrets@sh.itjust.works on 25 Oct 01:31 collapse

Get a little tongue in there, maybe.

non_burglar@lemmy.world on 23 Oct 10:31 collapse

Yes, that’s true, I guess it’s a separate issue. But the way DNS currently runs is a problem waiting to happen.

possiblylinux127@lemmy.zip on 23 Oct 11:14 collapse

It is designed to not to be. The RFC literally warns against single points of failure

slothrop@lemmy.ca on 23 Oct 07:21 next collapse

I DNS see that coming.

MadMadBunny@lemmy.ca on 23 Oct 07:31 next collapse

They got off sync.

falseWhite@lemmy.world on 23 Oct 07:40 next collapse

That’s what you get when you let go hundreds of employees from your cloud computing unit in favour of AI.

I hope they end up having to compensate all the billions of losses they caused to all the businesses and people.

otacon239@lemmy.world on 23 Oct 07:49 next collapse

Consequences? For Amazon?

lol… lmao even

falseWhite@lemmy.world on 23 Oct 07:58 next collapse

They do have contracts and are obligated to provide a certain “up time”, which is usually 99% or so. If they fail to provide that, they are liable to compensate for the losses.

Or do you think that Amazon is above the law and no other company could sue them?

It all depends on what kind of contracts they have.

BCsven@lemmy.ca on 23 Oct 08:33 next collapse

Most services have a clause that they are not liable for unforseen issues… Depends how good the lawyers were when formalizing the contracts.

Passerby6497@lemmy.world on 23 Oct 10:48 collapse

Good luck arguing that a missed config counts as an ‘unforeseen issue’. If they go that route, people will be all over them for not being SOC compliant wrt change control.

BCsven@lemmy.ca on 23 Oct 14:33 collapse

They can try to argue that latency issue and the stale state were an unknown / unanticipated problem. Like when half of Canadas Rogers network went down affecting most debit payment systems. Testing of routing showed it OK, realworld flip went haywire.

BakerBagel@midwest.social on 23 Oct 08:33 next collapse

Amazon has more money than most countries. They can outlast any company in court, or just ban you from their services in the future.

Onomatopoeia@lemmy.cafe on 23 Oct 08:48 next collapse

Depends on who we’re talking about. Companies like finance orgs are all about legal contracts and would be able to hold their feet to the fire.

You don’t want to go to court against a finance company or any very large org where contract law is their bread and butter (basically any large/multinational corp).

Amazon’s not hosting just small operations.

fushuan@lemmy.blahaj.zone on 24 Oct 01:59 collapse

Most banks have their data on Amazon/Azure. You don’t want to enrage banks.

Onomatopoeia@lemmy.cafe on 23 Oct 08:44 next collapse

Much of this stuff is automatic - I’ve worked with such contracted services where uptime is guaranteed. The contracts dictate the terms and conditions for refunds, we see them on a monthly basis when uptime is missed and it’s not done by a person.

I imagine many companies have already seen refunds for outage time, and Amazon scrambled to stop the automation around this.

They’ll have little to stand on in court for something this visible and extensive, and could easilyose their shirt with fines and penalties when a big company sues over breech when they choose to not renew.

Just cause they’re big doesn’t mean all their clients are small or don’t have legal teams of their own.

Passerby6497@lemmy.world on 23 Oct 10:46 next collapse

99% uptime in a year gives you 3.65 days of downtime, which I think would still be within SLA (assuming nothing else happened this year). Though, once you get to 1 9 reliability (99.9%), you’ve got a shift and change you can be down before you breach SLA.

If their reliability metrics are monthly, 99% gets you less than a shift of down time, so they’d be out of SLA and could probably yell to get money back.

phoenixz@lemmy.ca on 23 Oct 10:56 collapse

I worked at a datacenter that sold clients 99.99% uptime.

Fun times with a maximum of about one hour of downtime per year for hundreds of servers

WASTECH@lemmy.world on 23 Oct 11:47 next collapse

These contracts do not stipulate reimbursement for lost revenue. The “uptime guarantee” just gets you a partial discount or service refund for the impacted services.

It is on the customer to architect their environment for high availability (use multiple regions or even multiple hyperscalers, depending on the uptime need).

Source: I work at an enterprise that is bound by one of these agreements (although not with AWS).

village604@adultswim.fan on 23 Oct 12:07 next collapse

It’s not at all uncommon for fines to be built into an SLA

CheezyWeezle@lemmy.world on 23 Oct 12:30 collapse

SLA contracts can have a plethora of stipulations, including fines and damages for missing SLO. It really depends on how big and important the customer is. For example, you can imagine government contracts probably include hefty fines for causing downtime or data loss, although I am not involved with or familiar with public sector/ government contracts or their terms.

You can imagine that a customer that is big enough to contract a cloud provider to build new locations and install a bunch of new hardware just for them, would also be big enough to leverage contract terms that include fines and compensation for extended downtime or missing SLO.

I work at a data center for a major cloud provider, also not AWS

87Six@lemmy.zip on 24 Oct 04:12 collapse

Oh yea, other companies will sue them, and when amazon completely fails they will be bailed out with consumers’ tax money. Or did we already forget that’s what happens?

SeeMarkFly@lemmy.ml on 23 Oct 08:06 collapse

They have ORANGE ass makeup on their lips. How did THAT get there???

Zwuzelmaus@feddit.org on 23 Oct 08:03 next collapse

That’s what you get when you let go hundreds of employees

OK but then… what happens when their boss jerk fires hundreds of thousands?

lemmy.ca/post/53821900

Auli@lemmy.ca on 23 Oct 09:32 next collapse

Silly peon rich people don’t suffer consequences.

phoenixz@lemmy.ca on 23 Oct 10:53 next collapse

Was it proven that AI wa the cause?

In not saying it wasn’t, just that if it really was, I’d like a source for that claim

jaybone@lemmy.zip on 23 Oct 13:10 next collapse

There was an article in my lemmy all feed yesterday claiming so. But it was a super questionable shady site, which people were calling out.

Serinus@lemmy.world on 23 Oct 17:25 next collapse

No, but it clearly wasn’t the solution. They likely could have used some of those people they fired for that.

FreedomAdvocate@lemmy.net.au on 23 Oct 23:12 collapse

There was never any evidence to even suggest that AI was the cause, but as you’re on lemmy I’m sure you know that AI is currently blamed for pretty much everything.

phoenixz@lemmy.ca on 24 Oct 08:16 collapse

Just because this may NOT have been caused by AI doesn’t mean that AI in 99% of places isn’t absolute horse shit

FreedomAdvocate@lemmy.net.au on 24 Oct 16:20 collapse

Saying it was caused by AI despite zero evidence of AI causing it is dumb. It wasn’t AI, it was a DNS change made by a person.

The whole thing has nothing to do with AI, other than people who hate AI trying to make it about AI.

possiblylinux127@lemmy.zip on 23 Oct 11:12 next collapse

Mistakes happen with or without AI

The problem is that the current internet is structured in a way that creates high risk systems that can cause a massive outage. We went from having thousands of independent companies to a handful of massive ones. A mistake by a single company shouldn’t be able to black out half the internet.

bigboitricky@lemmy.world on 23 Oct 12:17 collapse

Oops! All slop!

ReedReads@lemmy.zip on 23 Oct 07:40 next collapse

Ironically, my pihole is blocking that link. So here’s a clean one: www.theregister.com/…/amazon_outage_postmortem/

joeldebruijn@lemmy.ml on 23 Oct 07:41 next collapse

<img alt="" src="https://lemmy.ml/pictrs/image/b97c82f9-9311-412d-8c5b-4d6e7646b524.jpeg">

Laser@feddit.org on 23 Oct 08:10 collapse

Luckily, it’s not the entire Internet, just the unfun part.

amino@lemmy.blahaj.zone on 23 Oct 10:05 collapse

Signal is definitely part of the fun internet, they just decided to rely on AWS due to techbro culture I assume?

dubyakay@lemmy.ca on 23 Oct 19:06 collapse

They rely on AWS due to favourable contract in hosting it, and also proving the proof of concept that they can be hosted securely on a hostile provider, without the provider having any clues at all in what data is being sent between the parties.

amino@lemmy.blahaj.zone on 23 Oct 19:53 collapse

sure, proving to the audience that you can kick yourself in the nuts over and over while maintaining the privacy of your testicle’s innards is impressive from a biological standpoint but it still looks stupid to a normal person. I don’t hate signal, I will continue using it but this and their crypto scam makes me doubt some of their choices and how they’ll operate in the future

dubyakay@lemmy.ca on 23 Oct 19:56 collapse

Huh? What crypto scam?

amino@lemmy.blahaj.zone on 23 Oct 20:13 collapse

Moxie works for MobileCoin and implemented a crypto wallet for it inside of Signal which could be an attempt to sneakily monetize the Signal userbase

amino@lemmy.blahaj.zone on 23 Oct 20:51 collapse

after looking further into it, he might’ve also been involved in the MobileCoin pump and dump

dubyakay@lemmy.ca on 23 Oct 23:57 collapse

I finished reading both articles and they are very heavy on conjecture. It does raise an eyebrow though, but still.

Yes, in an ideal world we would be using SimpleX, XMPP, or similar. However it was already hard enough getting ten non-tech contacts to switch to Signal. The barrier to entry is higher on other platforms or protocols.

amino@lemmy.blahaj.zone on 24 Oct 02:51 collapse

you just described the same exact reason why I’m on signal. it’s friendly to normies and most other options aren’t even though they should be

dubyakay@lemmy.ca on 24 Oct 13:09 collapse

This thread has set me on the path of getting two of my nerds over onto simplex or XMPP. Maybe matrix.

magic_lobster_party@fedia.io on 23 Oct 08:02 next collapse

It’s not DNS

There’s no way it’s DNS

It was DNS

the_q@lemmy.zip on 23 Oct 08:44 next collapse

<img alt="" src="https://lemmy.zip/pictrs/image/8205977a-ac09-4ba8-b6fb-8fc9434b1cb8.gif">

possiblylinux127@lemmy.zip on 23 Oct 09:52 next collapse

That and BGP

MelodiousFunk@slrpnk.net on 23 Oct 11:47 collapse

If I had a nickel for every time clearing the ARP tables fixed a problem, I’d have a shitload of nickels.

possiblylinux127@lemmy.zip on 23 Oct 12:35 collapse

If clearing the ARP tables fixes the issue you have bigger problems

MelodiousFunk@slrpnk.net on 23 Oct 12:48 collapse

These things happen when a skinflint company contracts out network setup for a decade, gets acquired by another skinflint company who axes the contractors and doesn’t hire on-site network personnel, gradually builds out infra on top of the unsupported foundation, and then hires c suite buddies who want to bring in their own people to further muddy the waters.

sleepmode@lemmy.world on 24 Oct 11:05 collapse

Like every MSP ever. When your CEO that started the company in college suddenly shows up in a green Lamborghini it is time to spruce up the resume.

evidences@lemmy.world on 23 Oct 11:34 collapse

<img alt="" src="https://lemmy.world/pictrs/image/64905b99-811c-497d-acc5-46f4d05e3fca.jpeg">

popcornpizza@lemmy.blahaj.zone on 23 Oct 08:12 next collapse

So, in the end they turned off the thing that caused this whole mess and everything is still working.

What’s the point of having it, then?

Cyber@feddit.uk on 23 Oct 08:37 next collapse

I’m glad these things happen… it keeps everyone aware that cloud is fragile and Plan B should be considered for mission critical tasks.

I’m also hoping that it will improve cloud resiliency because a complete / partial restart of cloud systems needs a whole different approach than maintaining a running system.

possiblylinux127@lemmy.zip on 23 Oct 09:51 collapse

Many different companies abruptly realized they need a DR plan for cloud outages

TommySoda@lemmy.world on 23 Oct 09:20 next collapse

This is purely anecdotal, but I have been running into a lot of DNS issues over the past couple months where I work. 3 of the computers and even one of the laptops for remote work were having DNS issues that needed to be fixed. One even needed Windows reinstalled after fixing the DNS issue (Which was probably unrelated, but worth mentioning)

I’m honestly starting to think that the internet in general might be imploding. Not sure why, but replacing so many developers and programmers with AI might be responsible. Who knows, but it’s definitely very strange.

possiblylinux127@lemmy.zip on 23 Oct 09:50 next collapse

The biggest issue is how centralized the internet has become. It went from a bunch of local servers to a handful of cloud providers.

We need to spread things out again

metaStatic@kbin.earth on 23 Oct 14:26 next collapse

That's not how capitalism works though

Canopyflyer@lemmy.world on 24 Oct 06:47 collapse

But but Bezos has to pay for another rocket and yacht and he just got married!!! Think about his quarterly statement! My god are you heartless!!!

/s

(just in case it’s not obvious)

ubergeek@lemmy.today on 23 Oct 12:16 collapse

A huge problem are developers who lack a fundamental understanding of how the internet even works. I’ve had to explain how short, unqualified names resolve vs how fqdns resolve. Or why even you may not be able to reach another node in your proverbial cluster, because they are on different subnets. Or, why using GUIDs as hostnames is a generally bad idea, and will cause things to fail in unpredictable ways, especially with deeply nested subdomains.

GreenKnight23@lemmy.world on 23 Oct 13:20 next collapse

I have worked with too many devs that didn’t even know what the 7 layers/OSI are or why they exist.

they didn’t know what a network port was used for and why it’s important to not expose 3306 to the internet.

they couldn’t understand that fragmentation of a message bus occurs when you don’t dedupe the contents.

you know, morons.

metaStatic@kbin.earth on 23 Oct 14:27 collapse

Ah, the common clay of the new Web

Appoxo@lemmy.dbzer0.com on 23 Oct 15:23 collapse

GUIDs?
Could you expand on that topic? :)

ubergeek@lemmy.today on 23 Oct 16:33 collapse

guids like these: guidgenerator.com

aesthelete@lemmy.world on 23 Oct 20:48 next collapse

Why the fuck would anyone use a guid as a hostname?

My favorite I’ve seen in the category was when they had hostnames that were basically the IP address decorated with some bullshit. Like yeeeeeeeeah, that totally makes fucking sense. 😆

Appoxo@lemmy.dbzer0.com on 23 Oct 22:52 collapse

I’ve seen those with public routing servers.
Example: IP-127.0.0.1.dtag.de

Makes sense there or for webservers.
But anywhere else? Lol not really

Appoxo@lemmy.dbzer0.com on 23 Oct 22:51 collapse

Why would someone want that as their hostname???
I’d understand mountpoint but that?

oeuf@slrpnk.net on 23 Oct 10:09 next collapse

They should check out YUNOhost.

WhatsHerBucket@lemmy.world on 23 Oct 11:13 next collapse

It was the best race anyone has ever seen 🫲🍊🫱

BrianTheeBiscuiteer@lemmy.world on 23 Oct 15:01 collapse

Let’s be honest, not all races are equal<br> 🫲🍊🫱

Kolanaki@pawb.social on 24 Oct 02:34 collapse

Worst Race: Daytona 500.

Best Race: Kentucky Derby.

GreenKnight23@lemmy.world on 23 Oct 13:14 next collapse

oh sure, when they fuck up DNS it’s a “race condition”.

when I fuck up DNS it’s a “fireable offense”.

sommerset@thelemmy.club on 24 Oct 01:13 collapse

It’s funny aws report didn’t mention 40% sysops were replaced by AI. blog.stackademic.com/aws-just-fired-40-of-its-dev…

StopSpazzing@lemmy.world on 24 Oct 07:22 next collapse

Wasnt that source from a year ago?

sommerset@thelemmy.club on 24 Oct 08:55 collapse

No

StopSpazzing@lemmy.world on 24 Oct 10:09 collapse

You are right, was from july and there was no other confirmed layouts from credible sources since.

sommerset@thelemmy.club on 25 Oct 03:49 collapse

Do you mean are you saying that you believe America has fair and open media that would publish some of this again bezos?

ZILtoid1991@lemmy.world on 25 Oct 02:00 next collapse

They need to uphold the AI hype, at any cost possible.

finitebanjo@lemmy.world on 25 Oct 02:49 collapse

I KNEW IT. It feels good to have my suspicions validated like this. The biggest companies are the ones most hyped over useless AI, and it’s going to destroy them.

sommerset@thelemmy.club on 24 Oct 01:13 next collapse

It’s funny aws report didn’t mention 40% of aws sysops people were replaced by AI right prior blog.stackademic.com/aws-just-fired-40-of-its-dev…

aBundleOfFerrets@sh.itjust.works on 25 Oct 01:26 collapse

this is unconfirmed and unlikely

sommerset@thelemmy.club on 25 Oct 02:36 collapse

“Leave a billion dollar company alone, leave it alone” Bro it’s most likeliest thing ever

regedit@lemmy.zip on 24 Oct 05:33 next collapse

Unbelievable, racism even exists in networking!

StopSpazzing@lemmy.world on 24 Oct 07:20 next collapse

Beat me to it!

Zron@lemmy.world on 25 Oct 01:47 collapse

Those damn ones

theoriginalcows@lemmings.world on 24 Oct 05:59 next collapse

I love it when meme-tech fails.

pokexpert30@jlai.lu on 24 Oct 06:05 collapse

Just one more layer bro, just one more automated planning system bro and this time it will be entirely faultless please bro one more layer

HurlingDurling@lemmy.world on 24 Oct 18:11 collapse

I know a dude that talks like this… Like I hear his voice when I read this.