grits - Load-sharing software for Lemmy

0 readers

0 users here now

(See the initial post)

founded 1 year ago

MODERATORS

[email protected]

Current Status: Work on the Transport Layer (lemmy.world)

submitted 1 year ago by [email protected] to c/[email protected]

0 comments fedilink

Hello!

So, a lot has happened since the last post, but there's still not too much that's working. Here's the update:

So, I looked into file sharing solutions (primarily IPFS and Holepunch), but in the end I decided to do my own solution instead. There are three main reasons:

It looks to me like they do a bunch of work that we don't need to do. They establish stateful connections, they authenticate signatures (which we then won't use because we're doing content-addressing), they attempt to collaborate to maintain the map of the nodes instead of relying on a central server's authoritative map. So we don't need some of their work, and we're going to be partially duplicating some of their work in different form, up at a higher layer on top of them, and the whole solution may wind up taking on this awkward nature because of it.
Correspondingly, it seems unlikely to me that their latency can be brought down as low as we could get it by rolling a custom transport layer. Web pages make a bunch of requests and some of them are quite small and numerous, and the performance of the web page is critically dependent on the latency of some of them. In a perfect world, the web app would be adapted so it's requesting stuff it needs in big chunks and expecting a certain amount of latency before the whole chunk comes back, but real-world web apps are rarely perfect in that regard.
Honestly a certain amount of it is pure hubris on my part.

So obviously, a reimplemented-from-scratch transport layer adds quite a bit of time and risk to the whole project, but I honestly think that if it can get done in a robust form, it'll make the final product quite a lot better.

So the reason why is this: My imagined reality for what happens when one of the proxies needs a piece of data on behalf of the client, is as follows:

Each node has a fairly up-to-date full map of the network, as a DHT, and it looks up what are the closest N nodes to the data it's looking for (probably N=5 or 10 or so)
It sends a spray of UDP packets to those nodes, requesting the data
At t=50-100ms, it starts getting back the first responses from the closer nodes, with a list of nodes where it can find its data. It sends another spray of requests, for different non-overlapping chunks of the data it's looking for, to N of the nodes that are holding the data it's looking for
At t=100-200ms, it starts getting back the first responses, with data included, from the faster ones of the nodes it requested data from. As data comes in, it can adjust its strategy for requesting what it needs based on which nodes are performing well at getting data to it.

Basically, the upshot is that because it's talking in parallel with a bunch of different nodes, within a very short window of time it should be able to saturate its downstream pipe with the content it needs. If the ultimate result is that apps served by the grits network can work faster than even content served by a fairly powerful central instance, then that'll be a big step in favor of its adoption.

Of course the devil is in the details. This type of problem is famous for being fairly difficult in the real world, but I feel like all the problems are solvable. I put up on github the current state of some code which attempts to achieve that 100-200ms latency I was talking about up above; it's not complete even in barely-working-prototype stage yet, but I wanted to post up the current progress just so there wasn't too long a silence. My guess is that within 1-2 weeks, it should start to be ready for testing and some amount of careful experimentation in an actual networked setup.

Comments? Questions? Feedback? I have a Lemmy instance set up; probably before overly long I'll want to do up a little content-serving network and start doing actual-network testing on that instance, but the caching software has to get done first obviously. If you're interested in working on any of the complementary pieces, or testing it out on grits.dev once it's ready, or have comments or criticisms or anything of what I've done so far I'm 100% open to it.

Cheers!

What grits is and what it's trying to achieve (lemmy.world)

submitted 1 year ago* (last edited 1 year ago) by [email protected] to c/[email protected]

0 comments fedilink

Overview

So I made this forum to work on one specific piece of software that I think could benefit Lemmy (and the overall fediverse community) substantially. I'll lay out what I want to make and why, in some detail. I apologize for the length, but I can't really do this without some level of support and agreement from the community, so hopefully the wall of text is worth it if it resonates with some people and they're swayed to support the idea.

If something like this already exists please let me know. I looked and couldn't find it, which is why I'm making this extensive pitch about it being a good idea. But, if it's already in the works, I'd be just as happy working on existing tech instead of reinventing it.

So:

The Problem

In short, the problem is that you have to pay for hosting. Reddit started as a great community, just like Lemmy is now, but because it was great it got huge, which meant they had to pay millions of dollars to run their infrastructure, and now all of a sudden they're not a community site anymore. They're a business, whether they like that or not. Fast forward fifteen years and look how that turned out.

I think this will impact Lemmy in the future, in very different ways but still substantially. It's actually already, at this very early stage, impacting Lemmy: There are popular instances that are struggling under the load, and people are asking for donations because they have hosting bills. Sure, donations are great, and I'm sure these particular load problems will get solved -- but the underlying conflict, that someone who wants to run a substantial part of the network has to make a substantial financial investment, will remain.

Because of its federated nature, Lemmy is actually a lot better positioned to resist this problem. But, it'll still be a problem on some level (esp. for big instances), and wouldn't it be better if we just didn't have to worry about it?

The Solution

Basically, I propose that all users help run the network. Lemmy is a big step forward because a lot more of the users can help than before, but even in Lemmy, only a small fraction of people will choose to make instances, and you'll still have big instances serving lots of content. I propose to make it trivially easy for the end-users to carry the load. They can install an app on their phones, or a browser plugin, or run something on their home computer, but they have absolutely trivial ways to use their hardware to add load capacity. The load on the instances will be way reduced just from that option existing, I think. I would actually argue for taking it a step further and having instance operators be able to require load-carrying by their users, but that's a choice for the individual operators and the community, based on observation of how this all plays out in practice.

One Implementation

It's easy to talk in generalities. I'm going to describe one particular way I could envision this being implemented. This proposed approach is actually not specific to Lemmy -- it would benefit Lemmy quite a lot I think, but you could just as easily use this technology to carry load for a Mastadon instance or a traditional siloed web site. It's complementary to Lemmy, but not specific to it. Also, this is going to be somewhat technical, so feel free to just skip to the next section if you're just interested in the broad picture.

So like I said, I propose to make peer software that provides capacity to the system to balance out the load you're causing as an end-user. The peer is extremely simple -- mostly it runs a node in a shared data store like IPFS or Holepunch, and it serves content-addressable chunks of data to other users. You can run it as an app on your phone if you have unlimited data, you can run it as a browser plugin (which speeds up your experience as a user, since it'll have precached some of the data the app will need), you can run it on your computer back at home while you access Lemmy from the road, etc. The peer doesn't need to be trusted (since it's serving content-addressable data that gets double-checked), and it doesn't need to be reliable or always on. The system keeps rough track of how much capacity your peer(s) have added, and as long as it's less then your user has consumed, you're fine if your peer goes away for a couple of days or something.

When you, as a user, open your Lemmy page served by the instance, what you get served back is tiny: Just a static chunk of bootstrapping javascript, a list of good peers you can talk to, and a content hash of the "root" of the data store. What the bootstrapping code does, is to start at the "root" of what it got told was the current state of the content, and walk down from there through the namespace, fetching everything it needs (both the data and the Lemmy app to render it and interact with it) by making content-addressable requests to peers. Since it all started with a verified content hash, it's all trustable.

It's important that the bootstrapping code in the browser verifies everything that it gets from every peer. You can't trust anything you get from the peers, so you verify it all. Also, you don't trust the peers to be available -- the bootstrapping code keeps track of which ones are providing good performance, and doesn't talk to just a single one, so if one is overloaded or suddenly drops out, the user's experience isn't ruined. Also, you're able to configure a peer you're running to always keep full a mirror of some part of the data store that you're invested in. That's vital, because this system can't magically make all data always available without anyone thinking about it -- it just decouples (1) an instance you can always reach, which is probably on paid hosting, from (2) a peer which provides the heavy lifting of load capacity, but might drop out at any time, i.e. can run on unmetered consumer internet. You as a moderator still need to ensure that (1) and (2) are both present if you want to ensure that your content is going to exist on the system.

The end result of this is that the end-user's interaction with the system only places load on the instance when it first fetches the bootstrapping packge. My hope would be that it can be small enough that you can run a fairly busy instance on a $20/month hosting package, instead of paying hundreds or thousands of dollars a month. Also, like I said, I think culturally it would be way better if running a peer was a requirement to access the instance. That's up to the individual instance operators, obviously, but to me people shouldn't just be entitled to use the system. They have to help support it if they're going to add load (since it's become trivial enough that that's reasonable to ask). Aside from ensuring load capacity, I actually think that would be a big step up culturally -- look at the moderation problems every online forum has right now because people are empowered to come onto shared systems and be dicks. I think having your use of the system contingent on fulfilling a social contract is going to empower the operators of the system a lot. If someone's being malicious, you don't have to play whack-a-mole with their IP addresses to try to revoke their entitlement to be there -- you just remove their status as a peer and their privilege to even use the system you've volunteered to make available in the first place.

I've handwaved aside some important details to paint the broad picture. How do updates to the content happen? How do you index the data or make it relational so you make real apps on top of this? How do you prevent malicious changes to the data store? How is a peer that's port-restricted or behind NAT still able to help? These are obviously not minor issues, but they're also not new or extraordinary challenges. This is already long enough, so I'll make a separate post addressing more of the nitty-gritty details.

What's the Result?

So to zoom back out: One result, hopefully, is that the experience becomes faster from the end-user perspective. Hopefully. I believe that the increase in capacity will more than make up for the slowness introduced by distributing the data store, but that's just theory at this point. I would also argue that this will start to open up possibilities like video streaming that are hard to do if instances host all the content. But regardless of that, I think big popular instances not having to pay ever-increasing hosting costs is huge. It's necessary. It's not a trivial benefit. And, in addition to that and the cultural issues, I think this improves the overall architecture of the system in one more very significant way:

Because the Lemmy app itself becomes static (AJAX-utilizing javascript which exists fully within the shared data store), it becomes trivial to make your own custom changes to the app even if you don't want to run an instance. You can clone the Lemmy app in the data store, make revisions, and then tell the system that you want to see your same data but rendered with the new version of the web app. Ultimately the entire system becomes a lot more transparent and flexible from a tech-savvy user's perspective. You don't have to interact with "the Lemmy API" in the same way people had to interact with "the Reddit API" -- your modified or independent app just interacts directly with the data. This is a huge shift further in the same direction that started with federating the servers in the first place. Part of the further future beyond this document is the possibility of opening up a lot of tinkering possbilities for tech-savvy end users, and expanding what even non-techy end users would be able to do with the apps they're interacting with.

Getting It Done

So I think I'm hitting a length limit, so I'll fill in the details of the first steps I want to take, down in the comments.