Routing Working Group
23rd May 2024
2 p.m.
BEN CARTRIGHT‑COX: Okay, I think we are ready to get things rolling. Hello, welcome to routing Working Group in glorious Krakow Poland. First of all meet your Chairs, Ignas is not with us right now but I am sure he will be around on Meetecho. So let's look at the line‑up today. Today, we have BGP pipe, we have a talk by me and Zbynek, we have Job Snijders on revalidation ‑‑ whatever. Then we finish up with low latency RPKI validation and we will have some administrative matters with the Chair replacement procedures. Please rate the talks as far as you go, this should be a reminder to open your laptop or phone and login please do this, we normally get about three ratings per talk, I would really like to have more data points than three data points so please rate them as you go, it makes a massive difference.
Quick poll, so we clash this year, or this session, or this meeting, with Open Source, so us and Open Source are curious to know would you have gone to Open Source if routing was not at the same time, if you could just raise your hands. I am going to say 25% maybe, that's good to know, thank you.
With that, I will take it to the next talk, which is BGP pipe.
PAWEL FOREMSKI: Hello, it is a Polish word you can learn, it means Hi in Polish. I came here from a city close to Krakow, so I had to travel the less probably out of us here.
I work at the institute of the Applied Informatics of the Polish Economy of Sciences, this is the meaning of this abbreviation here. Besides that, I have also been writing Open Source software for networks for quite a while, currently besides my academic employment I am also employed in the cybersecurity industry.
My scientific interest are simple protocols that run the Internet, that is DNS, BGP and IPv6 and this is actually my second RIPE talk.
Previous was a few years ago.
So, I am here today to talk about BG pipe, as you all know what BGP means, BG pipe is here to bridge the gaps, a while ago I was working on a paper about an attack called cubing against BGP and with my co‑authors we wanted to provide practical fix besides just theory, so this is one of the reasons why I started BG pipe. But the goal is more broad; I want with this project to enable more innovation in BGP, that is to bring theory and ideas into life in a laboratory but the ultimate goal is also closer to protection of routing.
You probably remember ExaBGP presented at the last meeting, but BGP pipe is different, it doesn't require modification to the BGP speakers, it's an old style proxy, you can deploy it pretty easily.
In case I find some great flaws in BGP routers and your vendors are not willing to cooperate the hope for BG pipe is to help you to extinguish everyday fires in case you need to and finally, I wanted a better alternative to existing common tools in Open Source BGP.
So, what is BG pipe? The tag‑line is reverse proxy in the firewall. What does it mean? It's a proxy that allows you to build pipelines for processing messages between BGP peers so for instance right now you can build a man in the middle proxy that would dump all the conversation between the peers in JSON format, just for archive or to build some metrics and explore ‑‑ for instance. It can also translate BGP to JSON and back so you can actually speak JSON to it and it will translate that into real BGP messages. You can already tunnel BGP sessions over web sockets, maybe that's a little bit crazy idea but you can transport part of your BGP processing to some remote machine.
I will show why it might be useful.
And finally, it allows you to pipe your sessions through external programmes like Python scripts or set scripts, please don't do that. In my talk I will show how to add RPKI validation to possibly ought routers and in future it can enable even validation once it starts off.
The general goal is more broad. I want this project to be universal processer for tinkering with BGP and ambitious goal is to enable to do that, to make it reliable enough so that you can feel confident that if you have this fire and you need to extinguish it BG pipe might be the tool to do that.
It consists of two main parts, the one that is not so much visible is the library that took me most of the time to implement, called BGPFix. And BG pipe is on top of that, common line tool to make use that have library it started last year and is licensed under MIT licence so it's pretty flexible.
I believe it's ready for experimentation, for playing with it, and I am looking for early adopters and feedback and what to do in future with this project.
Quick overview of how BG pipe works. Surprise, imagine two BGP peers left and right, so my vision is that I will be able to use transparent proxy techniques so you will not need to make any medications to configuration on the left and right and speakers won't even notice there is a man in the middle proxy. You need to modify just IP address on one of the peers. BG pipe ‑‑ socket on one side and connect on the other side but it doesn't by default modify any BGP attributes so it's not like an ordinary BGP router, so it is invisible on the BGP layer.
The basic idea is simple: Whenever a BGP message arrives from a peer, BG pipe can run this message through a pipeline of processing stages, and those stages is what you configure, you can build your pipeline configure as you want, as you need to. And a stage can be a background process operating on the JSON representation of this BGP message, such a stage can for instance, modify the message create ‑‑ once you are done, BG pipe will Marshal this JSON representation back to the wire format and certainly a modified message onwards and this obviously happens in both directions.
The back‑end for each stage is the BGPFix library I mentioned, it's a piece of software allows you to register a set of call back functions to get called for matching BGP messages, such a function can modify, drop or create BGP messages. It can also emit events and an event is event broadcast to all stages and can subscribe to particle events and a BGP message. For instance if there is an event that ‑‑ you also have a reference to the BGP open message with all the attributes capabilities, sorry, for that.
What you can already do with BG pipe? On the left‑hand side you see the stages that are already implemented and these are kind of plug‑ins you configure in a stage. You can see a few categories, obviously there is TCP/IP, listen and connect stage, it supports TCP MD 5. There is a category for file input/output, also read/write means file read/write and you can choose the format like you can write MRT, JSON files or BGP messages. For filtering I would highlight plug‑in which is for background process pipe to the standard input on the JSON representation of messages, the process can do whatever it needs to with the message and then write the outcome to the sound output which is again marshalled and sent to the final peer on the next stage in the pipeline. If you want to do that remotely there is the WebSocket stage so it pretty much does the same over possibly encrypted http connection. Finally for BGP there is simple speaker implementation that allows you to send open messages, keep alive messages and the limit stage implements sophisticated IP prefix limits.
On the right‑hand side, there are a few examples. First, quick session, it's not a proxy yet. We write BG pipe‑O which means standard output. Speakers means react open messages and means open TCP connect to this IP address and we are expecting this IP address around BGP speaker on port 179 and if an open message comes the speaker will reply in the best format that is ASN and ID received plus one, it's meant for debug, please don't kill me for doing that, I know it's not best practice.
Then, proxy. The first stage is on public IP address and second stage is connect to a private IP address, but adding TCP MD 5, I know from my practice running multi hop and adding TCP MD 5 the last hop can be problematic.
Crazy example set in the middle, if you want to really you can run a set script, write ASNs in open messages with that.
More serious example, if you want to do your own route archive now you can, this example shows how to do remote BGP message over WebSocket on the edge basically, you run a speaker that connects to your router, in the middle between speaker and the TCP connect you put WebSocket which for every received message in the left direction will write the JSON representation to this socket URL. If you have a host you run the WebSocket listener and you write the output to JSON file, with time stamp so every 15 minutes it will generate a new file with JSON representation of the GP messages received on the edge and it will compress the output.
Here, the proof of concept of how to add RPKI validation with Routinator, excuse me for abusing it, I know it's not the optimal solution but it's just for the sake of argument. Here the pipe is connection between two public and private IP addresses. In the middle we have this validator script and on the left‑hand side you can see it's just 20 minutes of Python with two functions, we just iterate over the every line of the input. We describe the origin using the JSON representation and for every prefix we run RPKI check. If any of these prefixes is invalid, which ‑‑ it is withdrawn so we remove, of course this is not IPv6, it's for IPv4 only. However, if all prefixes are okay, we just print what we read to the standard output and it means accept the message as is.
Final proof of concept example, stopping the Kirin attack, I encourage you to check out the paper, remote peering, anyway, we found out with my co‑authors well, maybe there is time to propose a new types of max prefix limits so this stage implements the idea we describe in the paper and highlight here it's stage implementing ‑ so it's different compared with the Python script I showed you, it's an example how to implement things more low level in BG pipe.
All right. I encourage you to use the knowledge you just gained today. BG pipe is easy to start with, you can use the information I put on the right‑hand side, if you have installed it's quite simple, please start the GitHub project to spread information about that and my take away is my goal for this project is I want it to be sustainable project and stable, I don't want it to be one man show so I am looking for collaboration. It means software engineering like PR, and these are also welcome but also research projects.
All right. That's it, thank you so much.
(Applause)
Any questions?
SPEAKER: Which was your experience implementing your own BGP speaker as a library, how much work was it and how complete is your coverage of the relevant RFC?
PAWEL FOREMSKI: As for completeness it's more difficult, BGP is so simple, as you all know. At least it can connect to popular BGP speakers already, once I had these translation between Y formats and JSON and supports for BGP attributes it wasn't that difficult. Just the BGP speaker that I implemented is just opening the session. I don't handle RIP or any state handling so it was much easier compared to an ordinary BGP speaker that you can find in BGP routers.
AUDIENCE SPEAKER: Alexander: Thank you for your input, I believe it is in line with a trend that moves a place where we make selection process from the edge to maybe some place inside the network, but I have a question: So, what happens if one of the BGP sessions, where your proxy is in between, goes down?
PAWEL FOREMSKI: Yeah, it's an open ‑ problem to solve. Currently, BG pipe will just quit so you would need to have some script that could restart it or some process that would control the G pipe but definitely it's open and it depends on your use case, I guess.
AUDIENCE SPEAKER: Okay, thank you.
AUDIENCE SPEAKER: Peter Hessler: I was curious if there was the ability to modify what BG pipe is doing within the sessions? Like, for example, you have start setting ‑ script, later on you determine you need to add something else to do. Is there a need to refeed information or do you need to restart the process and critically all the BGP communications?
PAWEL FOREMSKI: I think I didn't catch the end of your sentence, can you repeat the last part?
PETER HESSLER: Yeah, because if you need to restart the full BGP pipe then you have to reset all your BGP sessions?
PAWEL FOREMSKI: Yes, yes of course. So, I would imagine that you would need to have, I don't know readies running at the side or another database to store this information, you can't lose. Currently, if BG pipe quits you lose everything but I am not sure that should be the goal given open over source solutions for handling that ‑‑
PETER HESSLER: Thank you.
BEN CARTRIGHT‑COX: I think that's all for the questions, thank you very much.
(Applause)
So, a talk from yours truly, so, I am going to talk about reclaiming 240/4.
So, 240 shall 4, Class E space, the thing that lives between 240000 and just before the end of the Internet, or the end of the IPv4 Internet. Right before multicast space. Currently sits in reserved state and has existed for a really long time and it didn't really matter until IPv4 exhaustion happened, the IPv4 machine stopped providing IPv4 addresses at scale. People would then go and look for other addresses and there will always think IPv6 adoption will never happen so we are here in this conversation, Class E is not the only address space that is being eyed up for reclaiming. Technically claiming 000/8 could be reclaimed and 127, the loop back, could also be partially reclaimed with some very obvious exceptions. So it's an interesting thing.
I think it's interesting to go through some of the hearsay of how we got here in the first place. It was not obvious that the Internet was actually going to be used, says otherwise. It was obvious ‑‑ not obvious at the start famously that we would run out of 32‑bit address space. So several large blocks were left unusable for Unicast, so 000, the thing that you normally BIND on, normally there's only one address here that you use, but there is an entire /8 that's reserved for it, similar with the lookback one, same thing, there's 240, which was basically as far as I can understand, reserved for a mythical third type of routing that has yet to be discovered, it is still yet to be discovered. Finally, there is multicast, an entire /4. There's a draft proposal out there that hasn't ‑‑ I don't think it's been submitted to the IETF, but it suggests that maybe some of those /8s could be reclaimed because while multicast is ‑‑ one chunk is used, there is very poor usage for the others.
So, let me make my views known. You should be deploying IPv6. And I think the Class E and the other aforementioned blocks will likely never get into the global routing table. This is because policy is really hard and changing end‑user devices as we have been proven with IPv6, is even harder. And finally of course, even if you do manage to change the devices today, who wants addresses that might not work for some users? We kind of already have that, it's currently IPv6. So, Class E, however, is an interesting idea for local addressing.
So, it happens for a lot of companies that 10 /8 is not actually as big as someone hoped it was going to be. It's very easy to subdivide and when you start your organisations, AWS users normally run into this one first, where you make very large allocations in your /8 space and expand your company and run out of 10/8. However, 240/4 is very big so there's some precedent in using it space for this. Amazon has publicly stated on the record they use 240/4 in some of their link nets inside AWS. They also exist in some home networks too, I did some research that suggested some SMB/home users are using 240/4 space. I don't know how they are doing it and I would love to know but some of them are. It also, RIPE Labs did a great blog post exploring RIPE Atlas data that suggests some other users are using 240/4 already in their link nets that aren't Amazon. And finally, Conical has a container product that uses 240/4 in all of their link nets.
People also use it as a weird bodge, Cloudflare that hashes an IPv6 address into the host bits, my own experience is nearly no one uses this, it's very easy for a malicious client to abuse it but it exists, it's something you would have to potentially contend with.
So, if you wanted to use this, unfortunately you hit against the thing that plagues us all and that is vendors. All of the ‑‑ most stuff do ‑‑ Arista, the bunch of vendors have some mitigations against this and there's one thing that doesn't work and that's Windows and the fact that ‑‑ unfortunately, because Windows doesn't work means for foreseeable future this is basically a no‑go because while I am sure everybody loves their corporate installation of Windows 11 there are users still on Windows 7 or XP so there's stuff here not going to change for a very, very long time.
I was interested in testing purely what the router vendors are up to, so I built a router petting zoo where they can all peer view each other on the relevant protocols, I didn't test IS‑IS because not all vendors support, see if they can send, transmit deal with Class E routes and here are the basic results:
This is just using a test of who can set Class E link nets, so IOXR doesn't seem to care at all, which is great. Router OS doesn't care at all, which is great. JunOS and Arista have special options that let you override and let's you set it just fine and ‑‑ the rest flat out refuse, not valid space. It's not clear whether this is a software limitation or hardware limitation but it doesn't work so a different experiment is set up here. Instead we make a routing petting zoo that has got end clients have RFC1918 space and then their hosts around them transmit Class E, announced Class E space
JunOS has no problem configuring the magic words, the server is the final call, it will not let you hand out DHCP addresses.
Arista, the testing platform, it says the hardware is not supported for testing for that particular option. I assume it works. The fact that the option exists means that it is documented on their end, means it probably does work. I just can't test it. I don't have a spare Arista device.
Something I discovered which I thought was pretty interesting and kind of fatal if you wanted to deploy this, is OSPF and other flooding protocols I can assume will, the software layer will transmit your Class E prefixes in the SLAs and, however, it won't install them, you can get to a point you may route through something, while it stays in the OSPF area, that it could potentially do it, it's not installed into the hardware, which means it is going to blackhole, which is very bad or even worse might hut a default route and do something nuts. This is pretty lethal. As aforementioned you have to make sure every one of your devices in your network is set up correctly in order for this to work which is ‑‑ at the bottom you can see a 666 trace router here and 240, the system thinks it can reach it but it's going through a router that doesn't necessarily support installing that into hardware.
Testing vendors themselves with this we see basically the same thing except iOSXE. ...and basically will relay these routes for both BGP and OSPF but they won't install them into the FIB, which is part of the problem, means advertiser but it won't go anywhere.
So, I thought at this point all of my testing was concluded. And then out of nowhere, I got a surprise e‑mail from one of the mailing lists, from Quanticom who stated they were also interested in testing Class E space and had announced a /16 out of Class E space from one of their experiments to all of their peers which is super interesting. So I went and did the logical thing and went to Atlas and tested from all of their customers or all of their own Atlas probes on their networks wherever they could reach them, it was about 50%, which is not bad but it's also not necessarily good. 50% for your own network of being able to reach something is... they had downstreams so I tested them as well and also funnily enough 50%, there's six data points so the numbers could vary in reality.
So I think the interesting bit here is: Who wants address space that doesn't work? We could put a whole bunch of effort in doing this; however, I have already had experience, other people may have had experience of this. Pre‑total depletion of IPv4, one of my old employers got a /12 so it was a very big deal and immediately announced and put some testing on it and it was immediately clear some bits of the Internet could not see it and this was after prefix list generation and things like that.
So it turned out some people have or had bogon filters and since the new /12 was a bogon it was unallocated and this was being filtered out and was a pain in the ass to fix We moved some customers to the /12 and then worked with everybody involved to try and find the last places on the Internet where this /12 was being filtered out. This took about ten weeks and nearly every offender had fixed the problem at that point and we assumed that people who were discover didn't work that was the ISP's, extremely the ISP's problem at that point.
Imagine doing that but the router vendors need to be altered and the configure needs to be significantly altered. I don't see a way of making this viable. At least, this is in the case of the global routing table, at least, so trying to be fair and balanced as possible. What are the pros and cons? Your link nets can be completely separated, moved away. Also provide a temporary of course fix to your internal IP spacing problems, it might let you temporarily allow you to defrag stuff. We all how good temporary solutions are, however.
The downside is the network vendor support is pretty poor and the end point is poor because Windows doesn't support it, and unfortunately most things are still Windows. The other downside is the ownership, so‑called, of 240/4 is up for debate which means that should it be reclaimed, then there might be a bit of a food fight involved and all the RIRs trying to figure out not even amount of /8s, you can't distribute them locally, I am sure that would be fun. As previously mentioned, people's previous uses of the 240 space might collide with you and that might be pretty bad.
In summary, in my eyes only, getting 240/4 to work in Unicast networks, not too insane, it seems to be working well for some people. For internetwork Unicast that might be a bit nuts. As far as getting 000 or like restarting or resubneting the loop back and the my‑own‑host addresses, that's a lot of work for no really gain. You only recover a /8 from that or two, and that seems like an unbelievable amount of work for gaining two /8s no matter how bad the IPv4 crisis is.
However, of course, practice is not theory and we actually have our friend from Quanticom here to give his view of what things look like from his end on their experiments.
ZBYNEK POSPICHAL: Thank you, Ben. In the last year, I had seen a pretty nice topic in NANOG mailing list, which I am sometimes reading and somebody really seriously advised people to use address space from 240/4 for their project. Okay, after five minutes I have been logging, I got an idea that it would be interesting to just try it, if it is feasible or not. Okay.
So, first ‑‑ first we started with RIPE at class probes, of course, to see what happens, because we send it just to the Internet, we know when we did it that it would be dropped by all operators and dropped also by route servers in IXPs. But for many of the peering sessions, so we found, okay, a lot of operators accepted it and if all the ‑ were complied with it somehow worked on few occasions. So we found some RIPE Atlas probes from which we are able to trace route the IP address 200 ‑‑ .11. And it was really interesting from how much of countries and ASNs it was possible to reach it. My favourite one is this one, it's in Russia, about 6,000 kilometres from this location behind ‑‑ by lake, it's interesting like one thing, the trans Siberian to China so now it has other things, if possible to reach Class E IP address from that space, from that network, but then I got an idea, especially from Ben, to try to ping the rest of the Internet or the ‑‑ all these 3.6 billion IP addresses. So I tried it and I got more than 180,000 responses for my pings, which is pretty much, but it's still 0.005%. And it worked, somehow.
Yeah, now we know it was mostly because of iOS XR and and support and ‑‑ filtering of some specific networks.
Here, we have a list of some networks we have been able to reach. The most interesting was probably Quad9 name server, which dropped such a so‑called support in the end of April, then we reached of course networks in Romania, Cloudflare POP, half of Russia, something in Vietnam, United Arab Emirates, etc., etc. To believe me, yes, this is a DNS request to Quad9 for ISC.org. So, yeah, it's somehow possible to use it, but, yeah, it's also pretty insane idea to think it would be useful for any commercial projects.
BEN CARTRIGHT‑COX: So, yeah, should you use 240? No, unless you are completely out of options but it's an interesting experiment to see how this would pan in in an universe where Windows didn't exist.
I think that is it. Do we have any questions? There is some raw data if you are looking for this. If you are looking for some of the raw interface on the ‑‑ that did respond to the scans but other than that I will take questions.
AUDIENCE SPEAKER: I have got one question on Meetecho. Alexander asks: Do I get it right that we ‑‑ that if we do not start using 240.0.0./4, if we do reach the end of the Internet, it's two different questions combined, so I can't make heads or tails from this.
BEN CARTRIGHT‑COX: Thank you for the comment.
SPEAKER: Another one from Nico. Even if we could practically reclaim the prefix, is merely delaying the address exhaustion problem worth the amount of engineering work? Wouldn't even internal uses be better served by going on link nets etc.?
BEN CARTRIGHT‑COX: Of course. Some vendors ‑‑ I think there was a talk on the IPv6 Working Group that had talked about doing v6 ‑‑ v4 next hops with v6 so you can can totally get away from this, sometimes your situation may not allow you to and the easiest thing to do is some horrors with 240/4.
AUDIENCE SPEAKER: France‑IX. One thing that is often invoked by the people who oppose IPv6 is that IPv6 doesn't work and IPv4 does work. So, let's start using 240/4 and break IPv4 so that we bring the two at the same level. With IPv4 not working IPv6 ‑‑ not working and everything is the same.
BEN CARTRIGHT‑COX: Yes.
AUDIENCE SPEAKER: Move IPv6 after we break IPv4. Cox cock it's funny you say that I think the current members are actually 50% as well so we can hit feature parity
Tom Hill: One problem with where you are con conclusion, using it as private IP space, I don't think we should encourage this and there's one particular reason why: We have some examples in larger organisations where we have run out of RFC19 and combined ‑‑ it's about 22 million addresses, I think, in total, which is obviously very large networks. I have heard of this from some sort of enterprise corporate massive global sort of oil and gas banking type as well but certainly from telco perspective, the solution to this many years ago was of course we have got to do IPv6 and I don't want to discourage that from being the answer. It's also well worth noting if you are looking at this and thinking Amazon have done it, definitely take a little step back, as anyone who is listening not yourself in particular, but definitely take a step back and think about how much effort Amazon have gone to to build their networking underlay.
BEN CARTRIGHT‑COX: Totally. I think it's always worth pointing out that probably by higher percentage, most of this room is not Amazon or Microsoft and Google, and I think there's a saying people can't manufacture like Apple, very much these entities are in a separate level of class and have their own problems to come with.
TOM HILL: If I went back nine years or so and we had this suggestion, we could just move from private addressing to Class E addressing to solve our internal networking exhaustion, we wouldn't have quite as many eyeballs on IPv6.
BEN CARTRIGHT‑COX: I mean, ultimately, this is only a viable solution for link nets, as pointed out, so you know, the network isn't just there to provide BGP communication between your shiny routers, that end users do exist in the end. So hence, the only solution there is IPv6. So this is more of a you could bodge it, if you are the network operator, but do not bodge it if you are the end user or services, you just should provide IPv6.
TOM HILL: I agree.
JEN LINKOVA: I also had a problem with conclusion but I suggest phrasing it differently. Using well known phrase I suggest all my competitors are doing this. Also, I have ‑‑ I am thinking now, I know I am in the wrong probably, we knows if something exist in the Internet it eventually will end up in DNS and I am curious if anyone look, puts that in DNS would be very interesting.
BEN CARTRIGHT‑COX: It's funny you say that so I'm not going to hit back enough times to get to that slide but the thing I was talking about SMB and home networks that was learned through DNS.
PETER HESSLER: When doing the BGP announcements were you able to create route objects, RPKI ROAs, get them out of the tables?
BEN CARTRIGHT‑COX: Can I answer that on your behalf? I think the answer is ‑‑ the answer was laughter.
PETER HESSLER: That's a very loaded question because I know the answer, everyone knows the answer, that's very clear and it's very silly that your neighbours up‑streams are accepting unsigned or unattributed address space to your ASN
BEN CARTRIGHT‑COX: I think the interesting thing to critically point out here, I assume everybody who accepted the route will have direct peering are con come and are not filtering at all, like any filter, they have except all good. So, there's definitely a second question around routing security. There's routing security implications on this research.
Thank you very much.
(Applause)
PAUL HOOGSTEDER: And now it's Job with RPKI validation reconsidered.
JOB SNIJDERS: Hello, everyone. I am here to talk about Lord and saviour, RPKI.
Reasoning about trust in the RPKI is easier if we learn some terminology. In the RPKI, there is a thing called assumed trust and derived trust. Assumed trust is that you, as network operator, install an RPKI validator and you pick which of the trust anchors you include into your validator in order to make routing decisions. Derived trust is what the RPKI validator does based on the Trust Anchor that you gave it in its trust store and then it securely computes using subjective logic, what the outcome is.
Subjective logic in this context sort of means like that we as group agree that bit 5 is you can sign CRLs and bit 6 is you can sign other certificates or the other way around. So subjective logic means that not like one and one is three, no; it means that we establish a set of rules, how we parse ‑ objects and conclude that they are valid.
So, in other words, it is possible, in RPKI or a system like it, to have multiple validation algorithms that produce the deterministic outcomes, and then the question comes up: Which of the algorithms is the right one? And that may be subjective.
So, the current RPKI validation algorithm has some very sharp edges. It is defined in RFC3779 and 6487 and the implication of the current algorithm is that if a ROA contains entries unrelated to the thing at hand that is being checked, everything contained in that ROA or certificate will be considered invalid, and in future slides I will step through this process about how the blast radius in the current algorithm is a big, bigger than I think we want it to be.
Why would you encounter situations where there is a so‑called overclaim? This could happen if you transfer IP space from one LIR to another LIR or transfer space from one RIR to another RIR. And transfers happen multiple times a week and this means with the current algorithm, multiple operations multiple times a week are at risk.
So, in the next few slides I will argue that the validation algorithm described in 6487 is disproportional in terms of outcome in the RPKI.
So how does the current algorithm work? If you look at the blue box in the lower left corner, this is the ROA payloads. The ROA contains a payload and a certificate. The payload must be contained within the resources listed on the certificate in the ROA. The resources on the E E certificate in the ROA all must be contained within the resources listed on its parent certificate. All resources on the parent certificate must be contained in its parent certificate. So, even though we only wanted to validate 10 /24, as we walk up the chain towards the Trust Anchor you have to verify containment of more and more resources that are unrelated to the ROA that we eventually wanted to validate.
There is an alternative algorithm, which I called a George Geoff algorithm because George Michaelson and Geoff Huston were, as far as I understand it, the first to propose a different way of validating RPKI objects. Well they argued, and this is more than a decade ago, is that if you validate a ROA payload, you only need to validate all along the chain up to the Trust Anchor that that specific prefix is contained. So, let's step through an example of how the algorithm has some down sides and a new algorithm might be better (old).
We here have a small structure in the lower left corner and lower right corner, there are two ROAs, each with their own payloads. Each of them chains up to the E E certificate and entity certificates in the ROA and they are signed by a mutual ‑‑ by the same certification authority.
And that one is signed by the green box that contains even more resources.
Now, if the parent of the parent delists one /24, all ROAs underneath that node in the graph, disappear. And this means an outage for two ROAs even though only one of them actually was affected by the delisting that happened in the parent's parents.
What we want to happen is that if the parent's delists 17216 /24, that only the ROA that actually said something about 17216 /24 becomes invalid and 10 /24, which is unrelated to 17216, and 10 /24 which has a valid path all the ‑‑ valid path all the way up to the parents' parents, should continue to exist. So, the new algorithm isolates separate resources from each other and I think this is helpful.
Now, as I said, this new algorithm is actually more than ten years old, and right from the get‑go with ‑‑ when RPKI was invented and went through standardisation in IETF, it was recognised that the algorithm that all validators use today has those hairy aspects, and an alternative was proposed. And IETF had to pick one of the two proposals and they ended up with what I think is the sub‑optimal choice. But the new algorithm was described in an RFC8360. And this approach was not deployable; it basically required all validators to implement the new algorithm and ‑‑ and this is the bad part, all issuers of RPKI products to reissue everything with new code points and that sort of makes for a ‑ like, if you ask everybody to upgrade to IPv6, you know it becomes a very hard proposition. So, 8360 did describe the algorithm we want but it was unable to capture a deployment strategy that would actually work on the real Internet.
So, what I think the path forward would be, is to deprecate 8360 and say, we take the algorithm but not the deployment strategy. Update the RFCs that outline the current algorithm and only implement the new algorithm in RPKI validators like RPKI clients, routinator, RPKI approver. By reducing the burden of work to just the RPKI validator project, this adoption of the new algorithm becomes feasible because it doesn't require coordination across the entire Internet with every RPKI issuer.
So, this proposal has been written down in an Internet draft, it is now adopted by the cyber Ops Working Group which governs RPKI, the to‑do list extends into projects like open SSL and ‑ SSL to add a flak to the validator API to disable the old algorithm so that applications can implement the new algorithm in their own code.
Then the next step to implement this in the likes of RPKI clients and other validators.
With that, I open the floor for questions, comments? I hope I was able to articulate what this is about to some degree.
AUDIENCE SPEAKER: Tim RIPE NCC, one of the co‑authors of the reconsidered draft, actually.
I can assure you that the co‑authors of that draft did want it to be deployable, but at the time that was what we could get through the IETF. Also, back in the day, this was actually implemented in the RIPE NCC RPKI validator as well. So in that sense I think we also have proof of concept that it's very much doable from that side.
I think it would be really helpful for the network operator community to speak up and say that they want this because we might meet some of the same reinsistencence in the IETF that was brought up then, and having support from network operators we want this, we don't want to be exposed to this mass radius I think can really help.
JOB SNIJDERS: Thank you, Tim. Any other questions?
Then, a little bit of promotional work because the RPKI is not done and finished. You can download the slides on the website. There are many other exciting improvements passing through CIDR Ops at this point in time, and here are some references if you have trouble sleeping and want to read something.
Right, thank you all.
(Applause)
BEN CARTRIGHT‑COX: Time for our final full talk, low latency RPKI validation.
MIKHAIL PUZANOV: Good afternoon, everybody. I am Mikhail, I work for RIPE NCC, but this talk is on sort of more on personal account. And dedicated to some work that I was doing with the project that I am maintaining.
So, it's a little bit of a catchy title, so I have to explain what is meant here. So generally when we talk about RPKI validation, it's, we have the relying party, the software referred as validators as well, which in a nutshell, download the big bunch of cryptographically signed objects from usually quite long list of repositories, nowadays I believe it's something like 90 to 100 of them. They validate this whole hierarchy of objects, extract this payloads and they end up in routers, in some way directly through the RTR protocol, some inter immediate software or something like this. This process repeats over time.
So, when we talk about latency, we mean basically two things here: So the first one refers to the paper, it's a very interesting one to read, actually, and the literal quote from there is mentioned here, so the validator introduce the biggest delay in propagation of ROAs from repositories to the routers. And so, the assumption here is that there is some kind of operational incentive to reduce this time, so one can, in principle, say that it's not actually that important but I would assume that there is ‑‑ there are situations that actually that would be necessary thing to have.
The other aspect of latency is, essentially, that ‑‑ every validator should be ideally working fine with repositories that are well behaving, regardless of what the others are doing, and there's been at least two, academic papers that I'm aware of, that are constructing all this specially made up repositories to ‑ crash validators, to block them and make them hold, so the things I am going to talk about are implemented in these RPKI validator that I am maintaining for a while, it is sort of a pet project and it turned out to be ‑‑ validator nowadays.
The very basic things about repositories and we have to talk about repositories in all this regard, is that ‑‑ and I am pretty sure that's implemented more or less everywhere to some extent and every validator out there, is that every fetch of every repositories, it should probably be a separate operating system process, constrained in time ‑‑ in memory, we also assume that every implementation will run multiple fetches in parallel because fetching 100 one after another, is going to be extremely slow. The basic idea is we should always make progress so whatever happens, there should be some sort of mechanism that would not allow you to completely block the whole thing.
And so, what do we do with the rerepositories that are timing out, that are slow, that are crashing, blocking, whatever the hell they are doing?
So from fundamentally there's two approaches here. Fetching things, syncronising to validation, so while validating the RPKI tree and fetching them asynchronously in some sort of a seperate background process. And in first case, you are sort of getting the best in terms of latency because you are validating the stuff you just got from a repository. In second case you avoid being blocked by timing out repository, and the idea here is to essentially combine both of these two things and get ideally the best of two worlds.
So, we think that the repository is suitable for syncronised fetch. If we just see the first time, we are going to try to trust them in general, or if the repository is well‑behaving it's going to return you the content within some relatively small time and sort of lower than some threshold and there's no sort of fall‑back between ‑‑ so it's all these things we don't want.
So, if a repository doesn't fit into this criteria, it becomes asynchronous and there's sort of a separate, slow, sad process that will just trial them in the background with much larger time‑outs and try to get at least anything from them.
It turns out that's a very simple idea but it gives you the biggest gain in terms of the time of one validation, one traversal of the tree, in this case we have five maximal parallel fetches, seven trust anchors and five RIRs and two T 0s from LACNIC and APNIC, and as you can see this sort of technique allows you to get to very quick validations, like 10 to 15 seconds or so, and there's ‑‑ it's relatively stable so all this classification doesn't happen after the first, initial sort of a settling somewhere.
In the instance that doesn't use the asynchronous fetching, basically has to wade through all the repositories that are out there and the test was made probably a couple of weeks ago so that's the actual situation in the current ecosystem what we have at the moment. The same test was ridiculously lower of 30 seconds so it's made intentionally to create a situation of this sort of like we have a lot of very small repositories, just because we made this ridiculously small time‑out and it becomes harder so there's a lot of classification going on but still the situation is more or less the same, it kind of stabilises very quick iterations and if you don't, it becomes pretty bad. Just the time‑out is not enough to actually go through these things.
In the other that is helping to avoid bumps on this graph that I just show, is that the time‑outs are made based on how quickly the repository is actually delivering the data, so it turns out that the majority of the repositories are actually very fast so more than 90% of them will give you all the XML within two, three seconds or so and if it takes, let's say, 30, it's probably something is wrong there, so you don't have to wait for more; just put it into this asynchronous bunch and come back to it when it gets back to normal.
Okay. So that was about the robustness part of it, the sort of how you do get, don't halt it, at least some mitigation can be introduced with this whole sync news/asynchronous feeling.
So what can we do about the delays introduced by software and all that?
Well, the obvious answer is revalidate things more often, just iterate quicker and more frequently. But that's a bad answer because it's just ‑‑ it's expensive in terms of resource usage, it takes actually quite a lot of CPU and the other thing is, you are going to be thrashing the repositories with requests all the time. So, can we do something about that?
We can. It turns out. The CPU part, I am not going to go into details here too much but the general idea is that it is possible to validate fully, meaning with all the RFCs and all the signatures checks, resource checks, extensions, what not, only the objects that are ‑‑ that are new, so we haven't seen the object, we do the full validation for it. We have seen the object before, we do some minimum checks for these objects. And you have to be able to be quite smart about the manifest, RPKI manifest, which to validate and which not and there's a lot of fiddling happening there but general, the complexity of this whole algorithm goes from full validation for all objects, to full validation for the ‑‑ only the new ones, plus some very short validations, something like checking validity period and probably revocation for all of the objects.
So what does it give us? For the implementation that is well currently in RPKI aproffer that gives you about 9 to 10 times less CPU spent on every validation. It doesn't have to be that; it probably can be significantly less or significantly more, I don't know, it depends on the implementation.
The obvious disadvantage of that, of course, is that it's lot of complexity and in this case that's one‑and‑a‑half thousand lines of code, and it's a pretty hard core algorithmic code, so it's a lot of testing, some extra potential for bugs and all that sort of stuff and also the validation we considered, all this resource checking is also not supported at the moment because that makes it even more complex. It's possible in principle but it's just some extra complexity.
Okay. So, the second item was can we have actually less RRDP requests, if we can do these frequent validations which we can do now, how can we stop thrashing the repositories?
So the basic idea here is that you want to be up to date with the repository, to below latency, but not ‑‑ so basically in this case you want to have every delta but don't do it more often than necessary. And you can do that by /TKRAPTing to the how much data you actually got from the repository, so if you get 5 from it means probably you skipped some of them so you probably should be more frequent with fetches from that so the interval between fetches should be reduced.
On the other hand if there are no deltas in the updates it means you are probably going there too often so don't do that. And yeah, there's some obvious lower and upper bounds, in this case like one in ten minutes, that sounds reasonable, it doesn't have to be that, it can be configured of course. So the result of that for the same tests that I already mentioned was 7 trust anchors, you can save about 40% of all RRDP fetches. But that, of course, that's not sort ‑‑ that's not a net gain, it's not like the previous item where you just get things better and faster, that's a little bit of a sacrifice because you are sacrificing some of the latency for like infrequently updated repositories, but on average the whole thing kind of stays the same but with less requests. You kind of just do less of requests that are not likely to going to return any updates.
Yeah, that's an interesting observation from there. It turns out that majority of the repositories nowadays actually converge to this upper bound, ten minutes in this case. And just a few of them convert to one minute and it also depends on the time of the day, during the day there's much more frequent updates so there will be more of fetches. And only few repositories are just settling somewhere between but again this whole picture changes with the time of the day.
Yeah. So, the conclusion is this:
It's mainly useful for people implementing ‑ party software partially to the users because you know what kind of problems people have and try to solve. But it turns out there's some relatively low hanging fruits there to address these problems with the whole thing validators, from coming from all this academic papers. And the other thing is that it's possible to implement a much cheaper validation that sort of also future‑proof because we are expecting to have more and more objects out there, we have expected to have ASP deployed, prefix lists, whatever else comes, so having some sort of algorithmic solution that would not depend ‑‑ would not be sort of a complexity of which would not be directly proportional to the amount of data, is also a pretty good idea. It turns out about a third of repositories out there do not support E tags, all modified since and so on, headers, that's a shame, that should be ‑‑ it's a very easy solution to reduce the traffic.
So, all these things are implemented, the RPKI approval latest releases, it would be nice for people to try it and give some feedback.
Yeah, if you don't know the slides, references to this paper that I mentioned, it's quite interesting reading. And yeah, questions? Comments?
AUDIENCE SPEAKER: The RPKI Mafia is moving towards the mic.
Tim: RIPE NCC but in this context I will say I am a recovering RPKI developer.
So, I was wondering, you mentioned ‑‑ first of all, thank you for doing this, I think it's really useful to get data on this and see how, yeah, how things can be improved in terms of performance by doing things differently. Because I think blocking repositories are an issue that need to be addressed but the CPU reduction is interesting to see.
One question that came to mind was you mentioned for some repositories you do synchronous fashion and for others, asynchronous. It makes me wonder why if you can do asynchronous, why not just do it on everything?
ZBYNEK POSPICHAL: The same thing, latency. If you unload everything in the background, it means that while you go down ‑‑ basically, from the ‑‑ from the cold start you would have to wait for few iterations of this fetches before you get down to the bottom of the tree and the same comes to every change in the ‑‑ in publication points or anything so yeah, if we care about getting as up to date as possible, you still want to stay syncronised.
Tim: Yeah. I think that's especially true in the first run, though.
ZBYNEK POSPICHAL: Yeah. Practically speaking, yes.
Tim: All right.
JOB SNIJDERS: RPKI client project. I don't think many people tell you this often enough but I think you are doing super cool work with RPKI proofer and I am very happy it exists. So thank you for that.
I have a small thing to add. What you mentioned about not having to validate the entire tree all the time, is also how RPKI client does it by virtue of what openness is called, partial chains, in the context that you create where you validate a given object, you mark certain things as trusted or untrusted and this avoids having to go all the way up to the Trust Anchor every time for every ROA. So I think that is an analogy to what you ‑‑
ZBYNEK POSPICHAL: Yes.
JOB SNIJDERS: To me, I think that means you are on the right track in having implemented this.
ZBYNEK POSPICHAL: There's a bit more RPKI specific in the sense that like, you look at the actual manifests and try to create sort of a difs between what it was and is now, and revalidate all the ‑ that are different now, which majority of cases is not, absolutely, absolutely is just update or something. But, yeah, probably ‑‑
JOB SNIJDERS: Thank you.
AUDIENCE SPEAKER: Ben Madison. Just to echo what Job said, I really like that we live in a world where there's a ‑ implementation of an RP. So thank you for doing that.
I think that was a really interesting presentation and I think it's ‑‑ it's really welcome that people are talking more in public about the finer details of how this fairly complex system actually works under the hood and what we can do to make it better. So thank you for taking the time to put this together.
A couple of pieces of feedback. The first is, I think that if you do a version of this presentation again, I'm not sure that the distinction that the choice of language to describe the strategies of synchronous and asynchronous is the best choice of language, and I suspect this is where Tim's question came from. I understood it as being the synchronous strategy fetching alongside validation, so you are making validation progress as you are receiving data from RRDP sync rather than kind of blocking until you hit repository boundaries and only beginning to ‑‑ you are not waiting around for a complete repository each time you make any progress in the validator or possibly blocking on the whole tree; is that a fair characterisation? Because that ‑‑ it's clear to me why that would be faster if that's correct?
ZBYNEK POSPICHAL: It's ‑‑ it's blocking on every publication point that it hasn't seen before or have seen but it's time to refetch it. So that's the general idea, while going down the tree. Asynchronous means it ‑‑ I know about this repository and I am going to fetch it.
AUDIENCE SPEAKER: And then start reading it?
ZBYNEK POSPICHAL: Yes.
AUDIENCE SPEAKER: That was my understanding. So I think the way you explained it was clear, but as I say it's not obvious to me what choice of terms is better but I think synchronous and asynchronous that's not my intuitive understanding of those words.
The stuff about dynamically adjusting the refetch frequency, it's not obvious to me that there should be a natural ‑‑ that the frequency of historical publication frequency should be a natural predictor of future publication frequency, for me, I think that the solution to that part of the puzzle exists in a slightly different part of the ecosystem, and I think what we need to be doing is making it super cheap to contact a publication point and determine that there's no new data so that we can do it essentially as often as we want and it's cheap for ‑‑ so that it's cheap for the RP and for the publication point and I wonder if you have got any suggestions as to how that could be improved? Because the notification file can be a bit chunky.
ZBYNEK POSPICHAL: Yes, what I mentioned E tags, the modified things, whatever headers is not fully supported yet everywhere, the biggest repositories do, but not all of them. That ‑‑ and yeah, you are right, the assumption here is there's some kind of intrinsic ‑ to repository, they are not completely random, some kind of background running and updating that, if that assumption is wrong it becomes not very useful.
AUDIENCE SPEAKER: Maybe a solution that doesn't rely specifically on HTTP is what we need. I think E tags probably solves it but it solves it quite narrowly but that's a place to do some work but thank you.
(Applause)
BEN CARTRIGHT‑COX: Paul is going to go through our Chair replacement procedure. Paul, take it away.
PAUL HOOGSTEDER: Yeah. At our last RIPE meeting, RIPE 87, we asked for your thoughts about the mechanism for Chair selection and replacement. We did hear strong consensus about certain things like having democratic way for the Working Group, not the current Chairs, to select the new Chairs, welcome new Chairs into the Working Group. There were other subjects like term limits where we found no consensus at all, with opinions ranging from leave everything as it is, up to and including having strict one‑year term limits.
Now, at our next Working Group meeting at RIPE 89 in Prague, we will have a Chair selection and I will step down when one or more new Chairs have been selected.
Based upon your input we have made a proposal and of course this policy, if accepted by you, can be fine tuned completely (or) at a later moment in time.
The proposal:
A call for interested parties is made on the Working Group mailing list at least every two years or whenever needed. If all three Chair positions are filled at that moment the longest sitting Chair will offer their seat for selection but can try to be selected again if they wish to do so.
This does not apply if one of the other Chairs voluntarily makes their seat available. Again, they have the chance of trying to be reselected as well.
Interested parties have two weeks to make their interest known directly to the Chairs by e‑mail, not on the mailing list because that might influence over candidates to run for that.
After these two weeks, the Chair or Chairs ensure that all candidates are announced on the mailing list and issue a call for discussion.
At the next Routing Working Group meeting at RIPE meeting, selection of the candidates will be done using Meetecho call or a similar process, where both in‑room and remote participants of the RIPE meeting can voice their opinion.
That's it. Can I get a show of hands if you think that's a good idea. And can I have a show of hands for people who think this is a bad idea. Okay. Then that's what we are going to do. We will publish this on the mailing list and put the new procedure on the RIPE website. Thank you.
(Applause)
I think that brings us to the end of our meeting.
BEN CARTRIGHT‑COX: It does. I would like to say thank you, Paul, for the years of service on the RIPE Working Group, you are going to Connect, which shares many of the similar topics, but it's been great having you here and I look forward to running the election on the mailing list. So we will announce it ‑‑
PAUL HOOGSTEDER: We already have contact for the RIPE chairs and I will step down when the selection has been done so I can help you with the process. Not with the day‑to‑day running of the Working Group.
BEN CARTRIGHT‑COX: Thank you, everybody. I think this means that the Routing Working Group is closed, yes.
PAUL HOOGSTEDER: Yes.
BEN CARTRIGHT‑COX: I am doing it right, great.
PAUL HOOGSTEDER: See you in Prague.
BEN CARTRIGHT‑COX: This stuff is very complicated, you have no idea.
LIVE CAPTIONING BY AOIFE DOWNES, RPR
DUBLIN, IRELAND