Plenary session
21 May 2024
9 a.m.

WOLFGANG TREMMEL: Good morning. Find your seats. We are nearly ready to start. I am Wolfgang and I am chairing this session, together with Doris. Please keep in mind the session is recorded and the Code of Conduct applies.

First speakers we have from Nokia, Carl Montanari and from OVHcloud Simon heck

TORE ANDERSON: They are talking about simulating network at scale with Clabernetes and OVHcloud. The stage is yours.

SIMON PECCAND: I am very happy to talk to you about we have been simulating networks in the few months with Clabernetes. But first, I'll let the stage to Carl.

CARL MONTANARI: I created Clabernetes. And really quickly I'll going over our straightforward agenda here. I am going to talk about kind of be Clabernetes, where it came from, why it came to be, what it is, a little bit about how it works and I'll tun it back to Simon to talk about using this thing in the real world and using it. And hopefully we'll save sometime for Q&A.

To start with, the kind of Clabernetes story, we have to start at the beginning with container lab. I know it's early, but maybe I could get a show of hands who is familiar with container lab, maybe used it a bit. All right, that's great.
Containerlab for anybody who isn't familiar is a great tool. I am going to go through it quickly, definitely recommend you check it out after the talk. Containerlab, in a simplified nutshell is about building declarative, shareable repeatable network topologies, and as you probably can guess from the name, or obviously a lot of you are familiar, this relies heavily on container property systems, something like Nova Linux, something like that.

But, we can also support virtual machines like your XS 9K, all of that kind of thing uses Romain's work, VR net lab, and so Containerlab is free open source software, it's got a very batteries included user friendly kind of mentality to it, and in general, it's great.

Once you have Containerlab and you have your containerised network operating systems, either just natively containered or packaged up, you now able to build network topologies using it. So you define these topologies in a very straightforward item he will file, nothing crazy here, defines the nodes, links, and then you use Containerlab to spin this up in Docker on your local machine or on a virtual machine somewhere, obviously with Docker running. And this is really great and this is kind of what makes Containerlab repeatable and shareable is they've with config file. We build the topologies in Docker, Containerlab handles all of that. If I have a topology I can share it with Simon and we can have the same experience at the end of the day. This is very useful.

But. There can be some problems with this. What happens if you want to emulate a very large topology or you have some very resource Hungary operating system that takes a bunch of memory to boot. This is obviously problematic right, because if you are on your laptop you are stuck with your laptop, if you are on a Cloud VM you can pay more but that becomes costly and you don't want to do. What do we do? Containerlab has kind of hooks or ability to do multi‑node topologies, which is a really useful feature. And basically what you have to do is you separate your Containerlab topology and you use VXLAN to plum tunnels between these sub‑topologies that you split up and spread out across multiple compute nodes. This works. It's a bit error prone. It's tedious. You should definitely automate it but obviously that's its own kind of challenge. This leads us to the question of what if there was some kind of magical compute system that could scale out and, you know, etc. It's a very useful thing.

And Kubernetes enters. Maybe another show of hands for who is familiar with Kubernetes? Kubernetes obviously a very popular, very big project. In a nutshell, again this is a container orchestration platform. We are talking about containers, containers labs, Kubernetes is a segway here. Its main purpose in lightly is intelligently scheduling these work loads, which is handy. And it's very extensible in a lot of ways. But the two that we care about in particular, are customary source definitions and controllers, these are kind of silly words but the customary source definition means we can have some open APIs back, we can load that up to Kubernetes and now that server supports the operations on that resource that we defined in this spec. The controller, kind of an overloaded term as well, is basically just some code that's running and that is usually interacting with these resources that you create.

So, why Kubernetes like generally and also for Containerlab? Well generally, I mean obviously it's very popular, everybody raised their hands saying we are familiar with it a little bit. It's prevalent, it's everywhere, you can run it on your laptop, whatever. Even if you are not running it in your organisation, you are probably like a click away from running it it because you are in some Cloud, that has a managed Kubernetes offering. You are very close to it even if you are not already doing it on this previous slide we talked about we have these containers, maybe we have tomb containers, where do we put them? That's literally Kubernetes job. That's what it's all about. We want some cool points. I don't know if the train is still rolling but if we can some cool points from it we'll take it. We didn't want to go full cool points. The last thing about Kubernetes specifically is once you have this Kubernetes API, you can deploy that that API kind of wherever that API lives. If it's on premise, if it's hosted, self managed, whatever, this is the same experience that we have with Containerlab and Docker, where if I give Simon the topology file, he can run the same topology, the same experience that I have in my Docker on my machine. So Kubernetes is basically extending that. Obviously this is kind of relevant to Containerlab and this story. So, if you hadn't already kind of put this together, we had Containerlab, or collaborate, and Kubernetes and now we have Clabernetes because I am terrible at naming things. I also named the thing scrappily for scrape CLI, clearly this is a trend that's bad and I should like fix it but here we are.

So, Clabernetes is basically exactly what you kind of would expect listening to me so far. About a year ago, I was talking to Romain a year ago, I said why don't we just run this in Kubernetes, it seems like a good merge and he said yeah, go ahead and build the thing. Sounds good. So I started doing exactly that. And kind of here we are. So the main goal of Clabernetes is Containerlab that we know and love, big thanks to Romain and Karim and Wim and Markus and all the contributors to Containerlab. We want to scale it out. That's the main primary goal of Clabernetes.

We also want to be able to install it this any Kubernetes cluster. There is a big as strikes here because of course access controls, security, blah blah blah, can make this a challenge, but in general, if you have a vanilla Kubernetes cluster you should be able to install and run Clabernetes.

Keeping with the kind of batteries included, friendly user experience, of Containerlab, we want to bring that to Clabernetes, so we tried to make it so that, you know, the happy path for network folk like ourselves, you don't have to be a Kubernetes expert. You get into troubleshooting you might have to open Pandora's box, but in general you should be able to get a basic understanding of Kubernetes and get your network topologies modelled in Kubernetes and running and connected to etc. And last point really is the same as the first, this should just work with Containerlab, I love Containerlab, hopefully all of do you as well. And we just want to run it in Kubernetes. That's the whole point. We want to have no mess, no fuss, put that into Kubernetes.

Then how do we go about doing this? Like the design, the idea is be dumb smartly or be smart dumbly or some combination thereof. Keep it simple/stupid if you want, whatever you want to say. But just try not to be clever, we love Containerlab, this is about making Containerlab scale out. Let's use Containerlab wherever we can. Let's be simple, not try to over think this. Keep it simple.

Then of course we want to do standard Kubernetes things. A lot of this we don't necessarily want to expose like the networking kind of folk that don't care, but maybe if you have Kubernetes admin team we want it to look and feel like a normal thing they would install on a Kubernetes cluster. What that means is it's a simple controller, very standard stuff. It's written in Go it's deployed at the helm chart. We have a few, I think simple CRDs, this is part of that extensibility thing we were talking about, and there is only two of them and you only have to care about one. So that's simple. Then we have no additional CNI or cluster requirements. It should be able to install on any cluster.

Here on the screen we have a topology definition, so this is that customary source that I was talking about. Inside of it ‑‑ there is some Kubernetes junk, normal stuff if you are familiar with Kubernetes. Inside of that, you'll see the spec, definition, Containerlab. This is a totally box standard Containerlab item he will definition, nothing crazy. If you can't be bothered to do the Kubernetes part, there is another badly named tool, Clabverter, that can convert your Containerlab topology into this Kubernetes topology CR. Once you have this you can apply it to your cluster, the controller, just some go code, basically then reconciles against just another silly Kubernetes word to say we run some function when things happen to this resource. That controller basically takes this topology definition, chunks it up into sub‑topologies, which is a node from your Containerlab definition, divides them up, makes Kubernetes deployments, connects them with VXLAN and now you have a running Containerlab topology. If you have a load balancer in your cluster, which is really common, it will also expose that nodes with the load balancer service, so you can SSH, whatever, to your devices just like normal. And then last point here the main point is we try to be sane about updates to these definitions because obviously these containers can take a long time to boot. They can have a lot of resource which can have a lot of churn, so we try to intelligently manage what things we update and when. So we won't ‑‑ every time you make a single update to the topology we don't nuke it and redeploy the whole thing. We try to be good stewards of our resources. That's all I have time to talk about.

Simon is going to tell you about using this in the real world.

SIMON PECCAND: Thank you Carl. So, let me tell you a back story about how we use Clabernetes in the first place. In September of last year, I set up a meeting with my team to talk about all the struggle we had with regards to virtual labbing, so we were having scale issues and lack of programmability of our labs. Basically everything Carl just described. And so after ending up with a call and we did not have a really elegant solution to this, I read on Twitter that Romain just announced two hours later actually the release of Clabernetes software. It was a bit of a no‑brainer for us that we had to reach out to them and to start working together on this.

So, first is a little disclaimer before I start. I will be using some Cloud products in this presentation and of course I will be using OVHcloud. You will be fine running any Cloud provider of your liking as long as they offer the same services.

So, to create a Clabernetes service in the Cloud provider, it's simple. You select the location of your cluster in terms of data centre and region of you select the beviness of your node in terms of CPU and RAM, the notebook size which you can tune up later if you neutral more resources or less resources, and a few minutes later your cluster will be up and running.

And so sits ready to install Clabernetes.

You don't need to know Kubernetes in much detail but you will need some tools on your local machine, starting with helm, so helm is a, and cube CTL which will be your local API end points to communicate with your cluster, so you give cube CTL a config which contains all the credentials and the location of your cluster so you can communicate with it easily.

Then Clabernetes sees it available on GitHub and helm is able to fetch directly all the last version of the deployment of it.
So you just go with the gigabit URL. In my case I also passed on the some values in the AML file which is used to specify resources for H network O it is vendor. For example, Cisco are going to require that much of RAM and CPU and so on. So that when Clabernetes will do its scheduling controlled the nodes it will do it in an effective manner and you don't run out of resources.

Then a few minutes later, you have your Clabernetes manager running on your cluster ready to host topology from Clabernetes in Containerlab.

Another very important competence that I installed is registry, so, it is used to store all your images from network vendors in any version of flavour that you want. It's especially very handy if you have a large number of topologies across your team. You can just reference them in the topology from a central place. Very easy to spin up. You basically just set‑up the name and the storage requirements and you can get going pushing your images on it.

We use infrastructure code to generate the configuration of the routers. Which is based on Jinga, I don't know if any in the room are familiar with Jinja, but so Jinga is basically place orders all over the place, that you can replace with data from JSON and YAML and so on. This is all stored in GIT so we can be reviewing management which is very handy day to day, and we have a competent code, CI DC competent, which is open source, which will watch this git repository prospective for any changes and will compute the configuration automatically. That's two main changes that are not resolved by this solution. The first one is if my syntax it actually valid, is it going to commit on the router? And I don't know also if I'm breaking anything on the control plane or data plane or whatever.

So, at the time we had to go, take the configuration, push it on the router, go check manually if everything is okay, and then merge it into the prediction branch.

So, the first topology I wanted to assimilate with container ‑‑ with Clabernetes and Containerlab was the Paris RAR, it's composed of all the layers that we want to assimilate, starting with an A Z and the region and the network of the region, equity canning to the backbone, and finally at the top, we assimilate some edge networks and peers. So, this is mostly based on Cisco IO 6 RD, but in the container that is supplied by Cisco. And we also have some nexus VM, it's not available as a container. Also, free OTR, which is open source and allows us to inject up other routes from road dumps that we want from the peers. And finally at the end of every end point, I implemented some Ubuntu probes that I will show you later.

So here is an extract from the configuration. So, as I said, I'm now able to reference my registry directly on the field of H node as well as start the configuration, and find the links. So I had to do some training on my templating because for instance the naming scheme of virtual nodes are not the ones that you will find on the real routers, so I had to fine tune this as well as some features that I may not be available such as Max X on, I had to disable it. Otherwise all the comfort plane at least is fine.

And it's now time to deploy. We also put topologies in GIT so we can share it across the team easily. And you will need one last binary that Carl talked about TIS Clabverter, so it will take the Yaml topology, container topology and then you can pass the output to cube CTL and it will deploy it on the cluster. Here is a CLI. It will actually automatically find the Yaml topology in the current directory, and in a few minutes later you can see your node spinning up. So, for reference, IOS 6R D is about three to four minutes to be effective. In nexus is more ten to fifteen minutes because it's a VM, it takes up more resources and it's longer to boot.

I don't yet know if my design is working. Is the configuration committed? Am I breaking anything? The first thing you want to do it is SSH into every router like a real one. And the good news is Containerlab makes this especially easy because it will provision automatically well the management interfaces with also the DNS entries and stuff like that. And Clabernetes embeds a binary code that will automatically find the first container in your probe, and SSH to it with default credentials.

And then once you log in, it's like a real router, mostly, so you can pass it any comment that you wish is legitimate.

That's fine. But to be able to know how to glance if the whole design is working and the whole topology is correct, I needed something more concise and better. So I installed a local connector called fluent bit which is deploying or not the cluster and the nice thing with this is that it will automatically pass JSON note bot on the node. So if I put a loop on the probe that is pinging all the destinations that it's supposed to be reachable, then I can output it, the formatting ‑‑ I can, sorry, format the output to JSON and fluent bit will take it and send all the nice field in formatted data to my load connector. And then in this load connector I can create dynamic tables showing every probe and the reachability in the table.

Feedback: We have achieved a game changer for us in our day‑to‑day work flow because we can now run as much labs as we want, as much topologies, even if they are very large. We don't have any technical issues with, in regard to resources. Clabernetes associated with Cloud managed Kubernetes cluster makes it virtually painless. It's very easy to get started with this. And we like more container native network OS. So nexus from Cisco is the last one we'd like to have, there's been a push from the vendors in the last couple of years to have more container native OS.
We'd like to assimilate basically the whole backbone regions. We can assimilate every layer in Paris we can do it for every region that we want, and any part of the network before we do any changes. Also, what we are going to do is CICD lab validation, so this will make CDS spin I have dynamic labs on our cluster with Clabernetes to be able to show ‑‑ to install all the configuration changes that we are planning automatically without us doing anything, and give us the result of end‑to‑end testing and syntax validation and if the result is okay, then we can merge it into prediction.

So, if you want to get started with Containerlab and Clabernetes, you can head to Containerlab dot Dev, or scan the QR code, very nice documentation, it's very clear. And there is also a channel if you have any more questions or technical difficulties. You can head there to ask questions.

That's it, that's the end of the presentation. We also have shiny stickers, so if you want one, you can go to Romain, who is here.

Thank you.


WOLFGANG TREMMEL: Thank you very much. I see people running to the microphone.

AUDIENCE SPEAKER: Peter Hessler. This looks very cool. It's very nice to see this. I was wondering if, in your testing and development if you are able to see when a change would trigger a link flap, like changing an MTU with interface and if you can see that in your testing?

SIMON PECCAND: MTU change will affect things like BGP obviously, especially when you push a lot of it. So this will be visible.

AUDIENCE SPEAKER: Rinse Kloek, speaking for myself. Very good presentation. Very interesting. Do you also plan to assimilate, for example, not your own network, for example your external peers like you will have BGP peers, eBGP. Do you also plan to assimilate that in your lab or is it just only configuration changes you are simulating?

SIMON PECCAND: Yeah, so in the Paris area, topologies that I showed at the very top was, it was cogent but it could be any tier 1 network that you wish, so with free OTR you can inject the DFZ so you can assimilate any NNRP that, want and have TTL probe behind it.

AUDIENCE SPEAKER: It's manually created so you manually created those extra VMs like FRR to assimilate your neighbours?



AUDIENCE SPEAKER: Maria, BIRD, I'd like to ask what are the system requirements for let's say divided by the number of nodes, because what we have run into is when we started simulating more than 1,000 nodes, we ran out of disc space and we ran out of memory. So my question is basically how is it scaling this way?

CARL MONTANARI: I can try to answer that. I think it would probably depend on your node pool in whatever Kubernetes cluster you are running. You could have node pools with nodes of a different profile to match whatever your estimated workload is. I don't know if that's a direct answer because obviously it would be hard to answer. But I think the short is we could probably scale it really, really big with sufficient amounts of nodes in the node pool and then with a sanely sized nodes to make sure that you have like a sane amount of containers per node. I don't know if that ‑‑ that's probably not the answer you are looking for but hopefully at least useful.

AUDIENCE SPEAKER: I'm not going to run this on my laptop, okay. Thank you.

AUDIENCE SPEAKER: David. We have been running Containerlab for a while and we have experienced some interesting problems with the big A and big J vendors where it would work on the hardware but not on the lab and vice versa. Do you have any plans to getting rid of the hardware lab and going Containerlab a hundred percent or will you still use a tool topology?

CARL MONTANARI: I don't have a lab at the moment, so I run Clabernetes at my house, I have no problem with this. I think in general there is always going to be things you are going to have to test in hardware. I think Simon said during the presentation, that are obviously not going to run in software. It's going to be tailored to your individual use case. Like it really matters that I test Maxec, well then you probably need hardware, I can't help you there. But as vendors get better software implementations and hopefully we can paint over at least some of those problems.

AUDIENCE SPEAKER: I also have a small comment. Juniper is continued with VMX but they have recently released three new products to kind of step in and replace VMX, so...

WOLFGANG TREMMEL: Last question.

AUDIENCE SPEAKER: Hello. This is Kostas from NTT, excellent work. I just got the pass from the previous speaker. Is there any way to connect this virtual topology to a physical one, just to be able to emulate the missing stuff on the data plane that you cannot actually assimilate in virtual world?

CARL MONTANARI: Yeah. I think the answer for the moment in the case of Clabernetes is probably not easily but with Containerlab, you can. Romain can probably check me here. But yeah, I think you could do like Mac VRF or something to bridge out to physical interfaces. There is nods ‑‑ he is nodding. But Clabernetes no at the moment, but it's something that's probably possible.

AUDIENCE SPEAKER: A good point considered. Thank you.

WOLFGANG TREMMEL: Okay. Thank you very much Carl and Simon.


The next talk is Maria Isobel.

MARIA GANDIA: Thank you all. As I said, this presentation is go the SIG‑NOC tools survey results. I work for a consortium of universities in Catalonia. So today I'm talking about GEANT, the results of a tool survey running GEANT. Before we start, disclaimer, I'm not in favour or against any of the tools shown here in. It is just the results of the survey, not necessarily my personal opinions and it's just to have fun with the results and see your, take your own conclusions.

You may know GEANT because GEANT is the network, the Pan‑European network that links all the national research networks in Europe, but GEANT is more than that. GEANT is also the organisation that runs the network, it is an association, it is a project and it is a community.

So about the GEANT project. GEANT project is an an a big one. The community it more than 50 million users because they belong to the research education unions. It's 500 people working there, so it's a lot of people. I am one of these 500. And this is the 9th phase of the project so far. We have been running GEANT projects for years. This is GEANT 5.1. The previous project was GEANT 4.3. It has, it's offering services to all these communities, to this 50 million users around security, around operations, trust and identity and also network development. And I am currently working on the network development team in the GEANT project mainly in the network academy offering resources for out remediation, like the training programme which is open for you with social networks and education.

This GEANT project is also an umbrella where the SIGs. These are session interest groups. These special interest groups are just birds of a feather. It's like places or groups of people that meet, share ideas, interchange experiences with practices around the topic. And we have SIGs on many topics. We have them on Cloud, we have them on trust and identity and also on the network of course. And on the network we have SIG NGN, net generation networks and SIG‑NOC. That's network operation centres. So what is SIG, what do we do in SIG‑NOC? We share experiences with network operations centres of course. This is in a graphical image what we do. We have a mailing list. We exchange information through the mailing list. But we are a community. So it's mainly about a community.

SIG‑NOC is quite like a NOG, but for research and education institutions. And I have highlighted here in red squares what we are talking about today, the survey. Because the SIG‑NOC community meets twice a year, we have fun talking about autonomy of the knocks. About security and network operations but we also create the survey. So we died what questions go into it, what tools go n we run the survey, we did about the tools and standard and then we have the reports. And the reports contain a PDF with all the graphs, the file with the graphs too, so if you want to see them in details you have the numbers there and also a matrix.

We have run the survey four times. This is the fourth time we have run the survey. In the first survey that was run in 2011, 2012, we asked about many things because we wanted to know more details about the knocks. We wanted to know everything. It was the first time. And we asked about autonomy, we asked about tools, about structure, about everything that has to do with a NOC. We had many open boxes there because we felt it was great to have open box to say let people just explain what they did. The problem was that open boxes for a survey make it very hard for respondents to answer. It's much easier to just click on the answers. And on the other hand, it's difficult for the people who analyse the results of the survey to take these texts, free text boxes and have a structure in them.

So for the second survey we decided to reduce the number of questions, focus on tools and standards, and just have a list of tools.

In the first survey we asked about 14 functionalities. You will see them later. In the second survey we added a functionality to this list of functionalities, DDoS mitigation that was not in the previous survey and then we asked also about subjective things. How important is this tool for your NOC? How good this constitutional for your NOC?
So for the importance it could be from 1 to 4, from low to high.
For the rating, for the quality, it could be from poor to excellent, from 1 to 5. We did the same in the third survey but this time we added a new functionality: Orchestration of automation of the virtualisation. NOC detected some networks were automating many things and we wanted to see which tools they were using for that. Again, importance and rating.
And for the fourth survey that was run last year, we added a new function again, training, because NOCs are the trainers of new engineers that come to the NOC. It's not a special department, it's just teaching the newcomers, engineers, it's a NOC that teaches NOC engineers how to work at the NOC.

We had 6 responses. In fact we had 82 responses in the survey but only 68 of them were kept because we tried to clean them up a little bit.

First, because we want to know who is answering the survey. We want to have one response for each organisation because if ten people from one organisation respond to the survey and one person from another organisation does the same, it would look like the tool is used ten more times than another one. So we want to know which organisation are you working on and what tools you are using. If you don't write the organisation, you put XXX in the name of the organisation, then this response is discarded. And we only kept duplicated organisation for one organisation that contacted us and told us that they were using different departments and different tools for different functions and they were not repeating functions or tools. So that was fine.

After cleaning 68 responses, the structure of the survey and the structure of the report after that, it is always the same. For each one one of the functions short definition of the function. Then we did is the NOC is responsible for that function. If it is then we did the tools. If it's not, then we jump to the next function. And then for the tools, we asked about the importance and rating, all the pretty defined responses were answered by two or more NOCs in the previous survey, in the 29 survey. And open boxes are also included. So if you have other tools, that's fine. You can put them in the results.

This is the structure of the diagrams in the report. And it is complicated because we are tried to monitor or to show three dimensions into a two dimensional diagram. So we have small bubbles and big bubbles and the bigger the bubble the more organisation use the tool. The smaller the bubble the less popular the tool is. So for instance here, you would have something that goes to the right, up and with a big bubble, this means it's popular because the bubble is big, it's very important because it goes to the right, and it's very good rating because it's up. On the other hand, this tool here will be not very popular, and it's just important for a few members of NOCs, but very poor rating. Usually, when you have a NOC ‑‑ tools that are quite popular they are around here and if they are not popular, they are around here, you will see that in the next slides.

Then we also have a matrix and the matrix also shows three dimensions in a two dimensional diagram or file. Here, you have in the rows are functionalities, 17 functionalities. And the columns are tools, more than 150 tools here.

So for each one of these cells, you have the number of institutions that use the tool and if you click on the cell on the the XL file, you have the list of organisations that use it. And why do we have the list of organisations that use it? Well because we are a community and we know each other and if we are using a tool and we know that another organisation is using the same tool and we have problems with that tool or we want to use the functionality of this tool, it's easier to poke them directly. We can do it through the mailing list but we can also meet them at one of the meetings or poke them by e‑mail. Or one the other tools.

The types of organisations that answered the survey with national research and education and campus network mostly. You saw we had 68 answers. Some NOCs managed two types of networks or three types at the same time. Many also manage campus university networks or Internet exchanges or also specific research networks, so we have organisations that manage more than one network.

And about the functions: This is the list of functions. And in this report, in this presentation, you will see that we also have the counts for 2016, 2019, 2023. Here for 2023, it's always rating 1, 2, 3, 4, 5, 6, 7 but for the other ones you see the position in the ranking were they were in the corresponding report. So, if they are in the same position, it's a zero, if they are going up, green arrow, and a number of positions they went up. If it's going down, red arrow and number of positions going down.

What does this show us? Well that monitoring is always the function that NOCs are performing. NOCs are monitoring the network, there are there for that. Problem management is the second one which is quite similar, you monitor the find the problem and then you act on it and ticketing, you create tickets. These are the three things of the NOC functions.

We also see that knowledge management and domains is slowly going up from the 8 to 6 to the 4th position. This means that even if NOC engineers don't like documenting very much, they understand the importance of the documentation. So that's good. It's going up. We also see that communication, coordination and chat is going up. So, it is important to communicate what's happening on the network. And on the other hand, we see that security management is going down 6 positions. Why? Well we think that's because security operations centres are starting to be deployed in the different organisations, so security is taking over from the NOC to the SOC and the NOC doesn't feel responsible for the security any more.

Also, we see that training, even if it was in the last position, it's in the 13th position so it mean it's quite relevant. And orchestration, automation and virtualisation are the last function in the ranking.

Regarding the number of tools that we use, it is huge. There is no single tool for all the functions, that's clear, but there is no single tool for one function. Look at monitoring. More than 11 tools on average for monitoring in each organisation. This means that each one of you if you belong to a NOC, or that you, or if you are a research education institution, I think it's similar in other environments and in this community in particular, you have more than 11 tools that you use more or less daily or weekly to monitor your network. SNMP tools, flow monitoring, syslog analysing tools, ICMP monitoring tools, RIPE Atlas, NLnet log, you have many many tools to monitor your network. On the other hand, you have ticketing. Most of us just use one tool for ticketing. It can be two, depending on the NOC maybe you use an external ticketing tool or a different tool for security and for monitoring, but you can see that you use, or we use many tools. The second one in the ranking would be communication coordination and chat. And you name it, how many chatting tools do you have in your mobile phone? You have like rocket chat and slack and WhatsApp and Twitter and everything so NOCs also feel the pain.

The first question we asked about tools in the survey was not about the rating, as I mentioned before, but frequency. We asked: Are you using the streaming telemetry very frequently or are you still using SNMP based tools? We are using SNMP based tools a lot. It is very important. It's on top of the table used by almost all the NOCs. That's normal. We have all these tools based on SNMP. After that we have flow monitoring based tools and syslog handling tools with a bit more frequency for syslog. Streaming telemetry has room for improvement. It's a better than in the last survey but still quite in the left and bottom position in the table.

Then we asked about the tools in particular, the list of tools. And then I animated this graph so you can see something here, because if not, it's so crowded with 36 tools that it's impossible to see something. This way you can spot your tool, but you can also go to the XL file and take a look at it if you want. You can see that we have all sorts of tools with all sorts of importance sees and rating. Grafana is the most popular one followed by Nagios but we have many others, if each NOC is using 11 tools, some of them are just everywhere.

I tried to zoom in in this graph. The next slide is about this zoom but I let it stop okay. We have all the tools here. You can see them also in the XL file.

I zoom in and these are the top five. The first one is Grafana. It was the most popular for most of the networks. Then Nagios, and interestingly, MRTG, this starts at 3, but MRTG is quite popular still and it's a long time ago known tool. Then we have RIPE Atlas. This is one the most popular tools we use. And Perfsonar, this is used by the research education community for the performance management to see the performance of specially multi‑domain circuits.

If we zoom in a little more ‑‑ well zooming out. Top ten. If we look at the top ten, we also see that other tools appear like NF Sen, and so on. And if we look at the top 16, and see that I put up 16 here and not top 15 because RIPE RIS was the 16th one and this is the RIPE meeting, so I wanted to show you how the RIPE RIS is used. RIPE RIS, taking into account that we have 36 tools, it's in the 16th position so it's in the top of the table, we are using it and even more for problem management. This is just monitoring. Maybe it's not daily, but we are using it.

We have other tools like ICI NGA appearing in this top 16. If we take a different look to the same data and we see how ‑‑ what's the percentage of the respondents to this question that used each one the tools, we see that Grafana is used by more than 73% of the organisation that is answered this survey, so it's a very popular tool right now. And Nagios is the second one, more than 50%. And you can see that Nagios and Cacti are still popular but MRTG is going up. Grafana is the first comer, the first time on the table in 2023 because it was mentioned by five organisations in the previous survey, and SmokePing and NFDump disappear from the top 10.

For each one of the functions, you will see the average of tools that they are used but I already mentioned them in the first graph.

If we go to problem management (mentioned) confluence, ELK stack, but Nagios is the most important one. RIPE Atlas is quite important, RIPE RIS as I say mentioned for problem management is also quite popular. We don't use so many tools, RIPE RIS is one of the popular tools. It's in the top ten.

For problem management we have a mix of open source, we have a mix of vendor based and distributed tools. A little bit of everything. And this is something that is pattern that is repeated in other functions too.

There is an increasing usage of RIPE Atlas and RIPE stats. Jury a goes down from the 1st to the third or fourth position maybe because they change the way they offer the product and some organisations are moving to other tools.

For ticketing we don't have so many tools. We are RT, and the rest here. And Proactiva Net and Youtrack are mentioned once. The answers are quite scattered, there is no single tool that is used by more than 50% of the respondents of this question, and Repostracker was in the first position in 2016 it went back to the first position in 2023.

Then if we see knowledge management and documentation tools. You can see here that we have many tools, most of them are on the top of the table. And regarding the number of users, confluence remains 50% of the users and GitLab is quite the same. GitLab, NetBox and net Cloud appear for the first time because they were mentioned in the previous survey. And then several tools are going down like Google drive, WIKI and own Cloud. There are other tools that disappear from the top ten.

And then for reporting and statistics. Grafana is again the king followed by the list here. I'm going to go quickly through this because you have reports to see. I wanted to show you also the communication tools. Because this time the signal community decided to separate big directional communication from unidirectional communication. In previous surveys it was all together. So for bio directional communications you see we use e‑mail various tools like e‑mail for mailing list, but if you look at the percentage of people who use them, around 60% of the organisation used e‑mail and mailing lists. This means that nearly 40% of the organisations are not using this for bio directional communication any more. It's different for unidirectional communication, you use it with mailing list but not for bio directional. Remember, the 2019 survey was pre‑academic, now it's after the pandemic, we are using more asynchronous tools, for video conferencing tools too, and we are what we are not doing is calling by a phone, we don't use the telephone any more. It looks like we don't like it maybe we use the mobile phone but the land line disappears from the top ten. In previous survey he is we had land lines there, people were calling the NOC, now people only call the NOC if really needed. If not they prefer just to drop a message.

For unidirectional communications as I mentioned, the 0% of the organisations use e‑mail and mailing lists. So we use e‑mail and mailing lists and when we want to broadcast something, we use Twitter too. But we don't use Zoom or other things. We still use traditional e‑mail based tools for that.

For configuration management and backup. Here is a clear line. This is the list here. And then GIT is used by more than 70% of the users. And RANCID is quite popular too.

Subversion is going down compared to the previous survey.

The last graph I wanted to show you is of resources management. And why? Because I found it quite interesting that this big bubble here is XL files. So, we have lots of tools to manage resources but we are still using XL files, most of the organisations use XL files. And you see the importance is not that high, all the tools are in the middle of the table. Medium importance, ratings are quite scattered too. XL is not the best rated one but it is the most popular one. The biggest bubble.

Here, more than 60%, nearly 70% of the organisations use XL file for resource management. NetBox was mentioned. But it was in other. So it will appear in the next survey, in the list of organisations. And confluence it the best rated but it's only used by 24% of the organisations.

There are other functionalities in this survey in the reports. But I cannot go through all of them because I don't have time to go through all of them.

Some interesting facts: Resource management was about IP addresses, VLANs and so on. Inventory management was about routers and switches and so on. And guess what? XL files are also here. We are using XL files. It's a popular tool.

As more institutions have SOCs, the percentage of NOCs who feel responsible for security management decreased from 63% to 45%.

Automation is still the function less NOCs feel responsible for.

And you can see all the results in this link here. All the results from previous survey, the XL files, everything is there.

Conclusions: You see the ecosystem is huge. It is biodiverse, NOCs work with dozens of tools, more than 150 are used by more than two organisations, so there are others that are used by just one organisation. And also ad hoc tools.

There is no tool that has all we need, even for just one function. Look at monitoring.

There is no tool that works for all the functions, not even XL files.

And there is a biodiversity, this mix of open source and vendor based distributed tools.

So if you are starting a NOC, I don't know how many newcomers are here, but if you are starting a NOC, I would say that if you take a look at the survey, it can give you some ideas. If you take the five or ten most popular or 20% most popular tools, you should probably define making your decisions. And the survey helps you understand the trends and which tools are more popular. This means that have more community behind it if you have any questions and for which functions. And also past trends. If you have a wonderful tool that works for you, I'm not going to tell you to change it. It's up to you. That's the results of the survey. And that is the end of my presentation. Thank you.


WOLFGANG TREMMEL: Thank you Maria Isabel. And I see people rushing to the microphones.

AUDIENCE SPEAKER: Peter Hessler. I noticed in some of the responses that you had a specific tool and a generic term that can also be applied to that tool like for example IM, couldn't match Slack, it couldn't match Teams, etc., and I was wondering if you got any explanation or understanding of why respondents chose the generic terms verse the more specific term?

MARIA GANDIA: In fact the list of terms we put there was a decision by the signal community, so we said well, do we want to put a looking glass, a specific looking glass or a specific IM, and they said no, keep these. So we kept it like this. It's just a community that it's small to maybe we can just do that. But why people decided to put, to choose the IM instead of other tools, this is not something I know because I don't know who responded, why they did that like this.


AUDIENCE SPEAKER: Hello. Stavaros who is not using XL. Super interesting presentation, thank you very much. I saw in your results however, a little bit of mixed results with contradicting with my impression, because I was ‑‑ I speak a lot with colleagues from other NOCs, and we shared experiences on tools and stuff like that. So, I was wondering what type of NOCs did you talk about and you did your research? Because maybe I missed it because I understand if you are a small NOC team like AMS‑IX NOC with eight or nine guys, it's like a Jack of all trades, we do everything so we use a lot of tools and so on. But in larger organisations, the NOC has a little bit more specific role, more on the ticketing side, customer facing and the real stuff with NetBox, fancy and simple are happening with other parts of the organisation. And in your results I could see both. So I was wondering what kind of NOCs ‑‑

MARIA GANDIA: These NOCs are ‑‑ research communications organisations. Mostly national research... and example us networks, so this means that NOCs are usually not so big as in big organisations. NOCs do almost everything in some cases, but there is a real diversity in the size of the networks too and the size of the managing organisations. For instance, we have organisations that only have four people there and they do everything. And there are organisations that have 800 people there and they both belong to the national research education community. So it's quite diverse. There is a gap between some of these organisations.

AUDIENCE SPEAKER: Okay. So the people who actually did reply to the questions, most of them are using ‑‑ most of the functions, if I understand correctly.

MARIA GANDIA: Yes, well monitoring, problem management. Then automation is sometimes done by a NetDEF team. And other tools are in the SOC, so it depends on the size of the organisation.

AUDIENCE SPEAKER: My name is Andrei from Indonesia. My question is how did you do the survey, did you use the open‑ended questions or closed questions? Because it's a bit weird for me. My impressions ‑‑ this is 2023, already, the survey, and I haven't seen any of your presentations that some NOCs using AI? I mean my organisations we use a lot of stuff from the API from Grafana and everything else and we use it using the AI using the LOM, we can actually ask AI rather than looking into some graph, for example if some problem happens with our network.

MARIA GANDIA: There is work to do. Okay, the first part of your question was about, let me ‑‑

AUDIENCE SPEAKER: Open‑ended questions or closed questions?

MARIA GANDIA: This was a list of tools, okay. We used Survey Monkey for the questions. There was a list of tools and at the end of each one of the questions we had an open box, so you could add your own ad hoc or other tools that were not on the list.

And in the last surveys, if there was a tool that was mentioned by two or more organisations, then we added it in this survey. If the tool was only mentioned once, then we removed it from the survey. And about the artificial intelligence, it's not that we don't use it, we in fact are working on artificial intelligence in the GEANT project too, but this was quite orientated to quite focus on the tools. So we were asking about independent tools, let's say, and you see there were many tools used by organisations. Some organisations have IP Is and use these APIs to organise their own dashboard and have their own things there. But we wanted to know what was behind them, what kind of tools they were using behind them. So for instance, maybe you are using influx dB and you are drawing with Grafana, and that's fine, but we wanted to know are you using Grafana? If you have your ad hoc tool, cool, but maybe it's yours, and so it's not so relevant for the community to know what you have done because the community wants to know about particular tools that they are also using.

WOLFGANG TREMMEL: Okay. Thank you very much.

So the next presentation is a remote presentation and it was pre‑recorded, but the author is available online for questions later and the topic is exploring the benefits of carbon aware routing by Sawsan El Zahr from the University of Oxford.

SAWSAN EL ZAHR: And today we'll talk about my work on exploring the benefits of carbon aware routing. ... British Telecom BT and my supervisor.

To achieve zero carbon by 2050. It's true that for routing may be done in different countries there is one thing for our city...... energy consumption in the order of 10 to 20 per year. On the other hand, networks report less than per year which may seem not very significant. However, request a huge number of ISPs these small contributions will sum up and then carbon related to networks can the no be neglected. This work is in the context of routing, so at the network layer, and it's specifically addresses code 2 emissions associated with routers.

If we look at the small network if we want to send a package from router A to router E. Previous works focused on encasing the efficiency of the networks by for example choosing the shorter path. But, routers place a different locations will be fed from different energy sources and will have a different carbon intensity. The carbon intensity is a weighted measure of the amount of zero 2 so produce 1 kilowatt hour of energy. For example, here it will be better to route through the upper path now instead of the lower one, although the lower one is a shorter path. So this adds the geographical dimension to the routing problem.

Now, with the ability to predict this carbon intensity per region from the power grid ahead of time, we can take benefit of this information, integrate it into the routing process and explore how much carbon emissions can be produced based on carbon aware routing.

So, the first step is to clarify the building components of the carbon footprints for networks. It depends on the amount of energy consumed and then on the source of this energy, is it coal, gas, renewable energy and finally on the weighted carbon emissions associated with the source.
The last two terms can be covered by the term carbon intensity. And we'll go over both components, the energy and carbon components and then want to define the metrics that relate to this.

So, first of all, the energy consumption of the router has a dynamic part that is proportional with the utilisation, almost proportional. And either part that is build from the static power or zero port power, plus the power for enabling every port.

On the other hand, the carbon intensity is the measure of grounds of zero 2 emitted to the energy consumed. So, a higher carbon intensity means that the source of energy is emitting more carbon and is less clean. So renewable energy sources will have an associated lower carbon intensity. So for example, for solar and wind it can go to zero, the carbon intensity. In the figure we can see the variation of the carbon intensity per day in the UK. And this is for different seasons. So, this metric various a lot per day, per season and also per region and we can see a noticeable change in a few hours.

Now recent machine learning algorithms allow to forecast this carbon intensity up to 24 or even 48 hours ahead of time. And this is actually the main motivation of this work, to plan ahead, to adapt the routing of traffic to greener paths.

So, now knowing that the carbon footprint relates to energy and carbon, we define a set of metrics to be considered by the routing problem and in subsequent sections, we evaluate the impact of these metrics on the reducing the overall carbon emissions. So starting with the energy related metrics.

The typical power is the first metric, and is defined by the power consumption at 50% load, and can be mostly extracted from data sheets.

The second metric is the energy relating or labelling of the router. This metric indicates the energy efficiency of the router. It is still not standardised but we define it as the ratio of the typical power and the maximum package rate. It can be defined we start with this definition and after examining a range of routers available in the market, we derived a closed range and divided into a 7 star scale, so from A to G. Next we looked at the active dynamic power and divided it by the maximum capacity so the unit here is watts manager Megabits. For the energy rating, we considered a typical power, but here we looked only at the effect of the dynamic power pursuant of traffic. So this metric is the incremental dynamic power per unit of traffic. That's it for the energy related metrics.

We move onto the carbon related metrics. The first metric is obviously the carbon intensity, and we talked about that before.

And finally, the carbon emissions metric is different from the carbon intensity because it considers the actual energy consumption per router over a previous time. It's the product of the energy consumption and the carbon intensity.

So that's it for the metrics. But we can also have combinations of these metrics, for example we can combine carbon intensity with the typical power or carbon intensity combined with energy rating and so on.

So, how to evaluate these metrics.
We have two approaches:

The first approach is to change the link costs based on these metrics and then apply the algorithm. This is a basic approach. And for example, if we look at this example network, and we consider the carbon intensity to be the metric for the links, so in this case, we can have, with equal costs multipath, ECMP, we can have two paths with the same cost that is like considered greener than the one in the middle. But then if we add the energy labelling to this metrics, the routing tables will be also again different. So that's like a different optimisation target.

So that is the first approach. And this one will improve only on the dynamic part because all equipment are still on.

The second approach is further reducing the output power by shutting down unnecessary ports. So the second approach is a carbon aware traffic engineering based heuristic that is we denotes by CATE. After changing the link costs, it will pick the links with the least utilisation and highest carbon emissions, shut them down for the specified period of time, it then checks if the graph is still connected, checks if the current savings are still positive and it repeats this process over and over until no further links can be disabled.

So, this is a simulation based idea using the simulator and it is applied on the real network topologies like British Telecom, BT, in the UK, this is the topology of BT in the figure, and BT is a main ISP with more than 1,000 nodes, and we also looked at GEANT topology, which is education and network based in Europe.

So now before proceeding to the results, it is worth highlighting on the traffic patterns considered, that during the author B game changer, we considered two traffic patterns, the daytime traffic and the evening traffic. So the daytime traffic is dominated by business customers and is mostly somehow symmetric. Whereas, the evening traffic is dominated by residential customer traffic that is mostly down streaming of content from caches co‑located at metro nodes. So here you see the hierarchy of the BT topology, these are call nodes, metro nodes and tier 1 nodes. According to BT caches are located at metro nodes. So this down streaming traffic peaks at around 8pm.

So now we present some results. We see the carbon and energy presented improvement with respect to OSPF for day traffic on the left and evening traffic on the right. We have different combinations of the metrics introduced and then we have CATE that is based on the saving ports. So the first three metrics are energy metrics and we see that the implemented dynamic power per unit of traffic leads to higher improvement in terms of energy and carbon. And this is in the context of power efficiency, because carbon is not considered for these energy metrics. Then combinations that include the carbon intensity save more on carbon but this is at the expense of a path stretching of about 5%. The path stretching is the increase in the average number of nodes travelled by a packet. We can see that also, the combination of carbon intensity now with the implemented dynamic power pursuant of traffic is the best combination of metrics.

And then CATE introduces additional carbon savings by disabling unnecessary ports. And then all of these improvements vanish for evening traffic because the flows for down streaming are all short, and there is no room for changing the paths a lot, that even disabling ports is minimal and can be impossible at peak times.

Moving onto the GEANT topology. We can see similar patterns in the results. So, again, CATE has the highest savings. Now for the GEANT topology all the nodes are considered to be the same because we don't have the same hierarchy, like core 21, we assume all the nodes to be the same so that's why we all have the same energy parameters, so we only carry between OSPF, carbon intensity and carbon emissions. Again CATE has the highest savings with around 8% of the limits disabled. Interestingly you can see that the delay is really similar for all four scenarios. This is also for BT.

Now, back to BT. If we want to see the evaluation of flow paths, on the left we have the carbon intensity heat map. And or not right we see the flow intensity in gigabits per second per region in the UK. So this is for OSPF and this is for the carbon aware routing. So according to the carbon intensity API, when we extracted the carbon intensity information from, the UK is divided into 14 regions. And they did an analysis for the energy flow between the regions to derive the carbon intensity for the energy consumed per region.

So we see that the OSPF, London has the highest cross section through it. Then with carbon aware routing, flows are higher for northwest England and east England that have a lower carbon intensity. And this is like relatively something around the map.

And then on the other hand, we can see that East Midlands and south England have less flows in carbon aware routing because of the relatively higher carbon intensity.

Finally, London, on the other hand, did not change really. And this is because of the high density of nodes in London, and this leads to somehow having a bottleneck in the topology.

Now, looking at the flow intensity for GEANT topology. We have similar patterns but it's worth highlighting here also on the change in the few regions. In carbon aware routing, we see that France, Italy and Germany have the highest intensity of flows. So most of the flows flow through these countries. But when we compare these values to the flow intensity in OSPF we see that the flows increased in France, in Italy, but decreased in Germany. But yet, Germany is still with the regions with the highest flow intensity. And this highlights again on the issue of having bottlenecks in topology where some high carbon intensity regions cannot be avoided.

Finally, another aspect of the simulation was to see the impact of changing the static dynamic power ratio for routers because different routers will have a different value of this issue. For example, chassis based routers have a very high static power for the chassises, so we are aware of this issue and we see that carbon savings diminish as this ratio increases. So it's always better to invest in equipment with lower static power but also the ones that are more power proportionate.

So, in summary, some take a situation of this work is first, in terms of metrics, the carbon intensity, combined with the implemented dynamic power pursuant of traffic, are the best combination of metrics.

The energy labels are a good metric for purchasing, but in the current form have limited routing benefits. It can be better for example to divide it into two separate metrics. One for static and one for the dynamic power.

Third, the limitations include the high power that limits the carbon savings.

Fourth, the routing bottlenecks also limit carbon savings and it's always better to place additional renewable energy sources at these bottlenecks because it's hard to be avoided by the flows.

And finally, the carbon optimisation is application‑specific. Because as we have seen the down streaming traffic is hard to be rerouted, but one possible future direction is to look at fine shifting of the flows instead of just space shifting them.

So, the next steps. What do we have as next steps? First of all, we need to identify and agree on a set of metrics for sustainability, like energy and carbon metrics for routers and also for all of the equipment along the way.

Second, we need to establish veracity of the reported metrics like can we really trust this carbon intensity data from the power grid?
Third we need to define standard reporting format.

And finally, we need to tie the electricity consumption and the carbon intensity to applications. And this is the concept of carbon tracing.

So, a final note is that the code is available on GitHub for the CATE algorithm, and I am happy to take any questions. Thank you.


WOLFGANG TREMMEL: Okay. If you are online ‑‑ yes, there you are. Hello. Perhaps I should look at the camera so you can see me. Okay, there is someone at the microphone.

AUDIENCE SPEAKER: Tom Hill from BT. Quite interesting to see this. I didn't actually know that this work was going on. So that's good fun. I have been advised as well that all the information here was gleaned from the public domain. So it's important to note that this isn't internal information, there are a few problems with it, but I think very interesting work on the whole, and something we perhaps don't necessarily consider.

I wonder whether or not you had considered the combination of metrics and working on a sort of a threshold based constrained mechanism for working out if we use the least carbon emitting path, does it go over a certain threshold of latency for example or does it go over ‑‑ you know, is it a path that's not necessarily as capacious as some of the others? Has that been factored into some of the research that you have been considering?

SAWSAN EL ZAHR: When we considered the carbon intensity into the path, there was like a small additional delay, so $5%, but it wasn't that much. Again this... explore how much you can gain from the carbon aware routing, it was ‑‑ like, it cannot be tackled now by the, there are a lot of challenges that we need to resolve.

AUDIENCE SPEAKER: I mean not the least of which it's very difficult to actually attribute where the power was generated as to where the power is then consumed. But I do feel as network operators, and this sounds terrible of us really, but we have other requirements for making sure that we hit a certain level of network performance before we worry about how much carbon we're burning, you know, in providing those links. I mean the constraints there, the traditional constraints that we have, those don't go away.

SAWSAN EL ZAHR: Yes. Some of the... but the CATE algorithm but we didn't put them here because we want to see at host how much can we improve on the the carbon with less constraints.

AUDIENCE SPEAKER: It's interesting work. So thank you very much.

AUDIENCE SPEAKER: Peter Hessler. Very cool presentation. I like that we seem to be more aware of how are our carbon use and our energy use is being affected by our industry. One thing that I noted in my day job was that there is a number of energy saving settings that we can enable on, for example, Juniper PTX routers, that can save hundreds of watts per device, but they are not enabled by default. So, I would encourage all the operators who are paying attention to look at the documentation for your devices and see what you can enable and this will simply turn off ports that don't have any optics plugged into them, if there is no optics in a line card you can turn off the entire line card and it won't affect any other ports that are in use and you still get full performance out of them.

SAWSAN EL ZAHR: Yeah, so I want to add to that is because the carbon can be forecasted up to 4 or 48 hours ahead of time, so we can plan for this.


SAWSAN EL ZAHR: Thank you.

WOLFGANG TREMMEL: I see no more questions. So, thank you saw ASN, thank you for your presentation. And perhaps see you at one of the next RIPE meetings in person.

SAWSAN EL ZAHR: Hopefully. Thank you.

WOLFGANG TREMMEL: I have a couple of announcements.

First, please, please, please rate the talks. Rate the talks. The Programme Committee really relies on your rating and it helps us a lot on planning.

Also, there has been an agenda change. So we moved one of the Friday talks to this afternoon, so please check the online again there for an update. And also in the afternoon, we are going to have lightning talks.

And with that, I send you to the early coffee break. Thank you very much.

(Coffee break)