Breaking Monero Episode 05: Input Selection Algorithm
https://youtu.be/Sn44ahKxC1E
In this episode, Sarang and Justin explain more about how Monero selects decoys for ring signatures, more than just "randomly." We explain how Monero's selection algorithm has evolved over time, and how a strong input selection algorithm is critical for protective ring signatures.

An Empirical Analysis of Traceability in the Monero Blockchain
Malte Möser, Kyle Soska, Ethan Heilman, Kevin Lee, Henry Heffan, Shashvat Srivastava, Kyle Hogan, Jason Hennessey, Andrew Miller, Arvind Narayanan, Nicolas Christin

All Breaking Monero Episodes

Episode Transcription

Justin: Welcome back to Breaking Monero. Today we are talking about Monero’s ring input selection algorithm. We’ve spoken in the past about ring signatures and decoys in other previous episodes and Sarang specifically has talked about how the decoys are selected, but he used a very vague word called, a vague phrase at least called randomly selected right? And in this episode we’re gonna get far more nuanced far more specific about what we mean by this, this mysterious phrase and also the phrase itself isn’t very accurate either so it’s important to add some additional clarification here on what’s actually happening. And there’s a lot more to ’random’ than behind-the-scenes so I’m going to start with the screenshare showing an example of some of Monero’s input selection algorithms over the past. So on the top here you can see an example of a completely random distribution algorithm algorithm. So on the left you have old outputs that were generated at the very, very beginning of Monero’s history. On the right you have new outputs that are generated very, very recently, especially within the past few days or so. Let’s say that the green circle is the actual output that was spent, this is the real money that was spent, and the blue ones are the decoys that are selected. Now a completely random distribution method might sound great to begin with because you know any input could be selected for any reason but this leads to a lot of unintended, like consequences as a result. So you can make pretty strong heuristics that say people are far more likely to spend new money than old money so as a result the latest input, the green one highlighted here, is most likely to be the real one. And well you don’t necessarily have the ground truth to prove that this is true it could be tested as very reliable over time; you could make the potentially very strong heuristic there. And you can see in the example on the screen on the first on the first line there that is the case because that’s often the case right? So Monero sought to improve upon this, iterate upon this and you can see on the second line there there’s an example of a recent zone selection algorithm. So you have again the whole history of Monero’s outputs, but you have a short recent zone period where you’re more likely to select other decoys from the specific period so Monero’s code might specify for instance that the recent zone needs to be about 1.8 days and that you should select about half of the decoys from this said recent zone. So you can see on this example here that about half the decoys are selected from this recent zone and then for the rest of the tail going back to previous time, like the very beginning of Monero’s history you still have the ability to select these outputs, but they’re less common than new outputs and this helps address the specific heuristic we’re speaking about where the, the latest output is the most, the latest output in the ring is usually the true one because now you have a more, latest out, latest decoys included in this ring so therefore you have a more plausible selected outputs in this case. And the recent zone was nice and simple; it was a really easy way to implement this sort of feature and it would, definitely was an improvement over the existing completely random system Monero began with, but it’s not ideal. And so Manero has moved to what more resembles the bottom line there which is a matching distribution, one that is based on empirically observed distributions based off what we’ve, Monero and outside researchers found with Bitcoin and Monero, it’s a mathematical model. So you can see that in this case the newest outputs are even more likely to be selected for instance. So this, hopefully this diagram helps show how it’s not just about how many inputs there are in a transaction, it’s also about how you select them. And there’s a lot of implications on how these are actually selected, but it’s more than just timing as I show here; timing is just one part of how this is done and for it, to this end. Sarang is going to speak a little bit more specifically about other factors involved in the selection algorithm.

Sarang: Yeah, absolutely. I mean, and it’s worth noting that you know technically the way that we still do it and the way that we’ve always done it did have elements of randomness to it. I mean random is kind of it’s a, it’s often a really poor word right? So for example even though typically the outputs that you’ll choose kind of follow that particular mathematical model, there is an element of randomness involved. So, you know two people can definitely choose very, very different rings but you know on average they’ll follow a pattern that looks approximately like the pattern that you’ve showed so there’s always still randomness into it. But like you said timing heuristics or you know guesses that an adversary might, they might make based just on the age of the output like you said are only one heuristic. They’re a pretty big heuristic right because you know in general for a lot of old transactions you could guess what you thought the newest one was and, and you know you might be right although you couldn’t prove it you know implicitly. But timing is just one part and we’ve iterated since then to kind of mitigate against other smaller heuristics that were not timing based. So one example deals with something called coinbase outputs. If you’re not familiar with the term basically every Monero block of transactions that is generated has a special output in it that generates new money as part of the protocol. That’s kind of what helps to reward miners for doing work in part and those coinbase outputs I like to think of as you know, newly generated money. So, in general do people spend newly generated money or coin based outputs as the true spender? You know probably not necessarily as often as non-new money or non-coinbase outputs. So for example if I happen to choose a ring that contained, I don’t know 10 coin based outputs and then my true output which was not a coin base output an adversary might look at that and think "Hmm I would say it’s much more likely that this person you know didn’t spend a coinbase output because that’s all very new money." So that could be a heuristic that they might use; they might think coin based outputs are probably decoys. Well in that case that would kind of imply that we should select fewer coinbase outputs as part of our rings. How many is too many or too few? I mean that’s not a very well-defined problem with a very well-defined solution. But as we’ve iterated our selection algorithm to make it better against this you know, guess newest heuristic involving output age you know we probably introduced more coinbase outputs then some people would have liked. So we made a slight modification to the algorithm where instead of just choosing a block and then yanking a decoy out of it which tends to give us more coinbase outputs than fewer, instead we actually look at a very small window around that particular block so we’re effectively increasing the size of the bin from which we get to choose our outputs. And what that ends up meaning statistically is that we end up choosing fewer coin based outputs which kind of mitigates against this much smaller heuristic and that of course is not the only heuristic type you might come up with either. So coinbase outputs are one thing an adversary might use to look at to try to make guesses. Timing which we’ve worked on of course and have talked about might be one that an adversary might use. There’s other ones. For example if I have a transaction that has two different inputs, each of those has a separate ring, maybe the adversary is able to look at the different decoys and outputs that are in those rings and maybe the adversary will find that there’s a transaction way back when in the blockchain that generated two different outputs. And maybe one of those outputs appears in one of the inputs to my new transaction and the other output appears as an input to the other one. Again it’s just a guess because it may have happened by chance, but probably not. The adversary might try to conclude that the outputs that were generated in a previous transaction are now being spent by me and might make some conclusions based on that. Again, without external information a heuristic is not a proof, but it gives the adversary something that they might try to guess. So in general this is very, very complex. I will say right now it is pretty impossible to get rid of all possible heuristics. So we can always make our selection algorithms better and as Justin pointed out and as I’ve kind of hinted at we have done this over time we iterate to get better and better. A good way to think about this is something that Justin brought up in fact kind of with like a plugging of a hole analogy if you want to give that which I kind of liked.

Justin: Yeah, of course. So one example like we, we know of a specific heuristic for instance. The, the guessed newest heuristic might be an example. So we can iterate Monero’s selection algorithm to help counter this sort of heuristic and the actual effectiveness of it, but in doing so we’re still choosing some other way to select outputs that people can develop heuristics for. So there’s no limit to the number of heuristics that people can come up with they can continue making complicated heuristics over time pretty much no matter what we do. So we’re always plugging these holes that we’re aware of and the biggest holes we know of, but we might indirectly be making smaller holes, we might be making holes that we’re not necessarily aware of because maybe the heuristics haven’t been conceived yet especially by participants in the Monero community. so this is definitely something that will need continuous improvement, it needs continuous iteration in order to make it better. To keep patching these holes you can’t just pave a road and never expect to have potholes; you have to suffer through it like the Minnesota winters like I have where you go and have to keep patching these potholes that keep appearing right? They keep coming out, and when you think you’re done some truck’s gonna drive over it and come up with, and make a new one right? So these these sort of circumstances keep happening. Apologies for that terrible analogy. I’m trying to play Sarang here. But yeah we try to address the big ones first. But there might have been some consequences. As an example by changing the selection algorithm to follow a more mathematical distribution we actually selected more of those coinbase outputs and we needed to go back and sort of refine how we did this. Was every improvement, was every iteration still an improvement overall? From what we can see now it should be, yes. We addressed the existing heuristics and done more good than harm, but there’s always going to be some sort of trade-off that we’re going to sort of keep playing with so to speak.

Sarang: Yeah, absolutely. I mean we, we definitely receive reports all the time about people who come up with you know a particular small pattern that they might see among certain transactions or outputs based on the way that we make our selections. And some of those are very, very small. Doesn’t mean that you know, it’s optimal that we keep them in there, but you know to some extent it’s not necessarily always obvious how to make an absolute good change to counter some of those small heuristics without inadvertently kind of ruining the work that we’ve done for some of the much bigger heuristics. So, you know if we see a small heuristic pointed out and we don’t change our algorithm because of it you know it doesn’t mean that we are not concerned about those heuristics, but it means that we always have to balance the good that we’d be doing by making such a change with the inadvertent harm that might be caused by it. So, it’s always an arms race right? Financial privacy as a whole as you’re probably learning from this video series is an arms race. Analysis only gets better over time and that’s great. You know just because we receive small reports sometimes and big reports other times about different heuristics that come up doesn’t mean we don’t want to see them. I love seeing those reports I love learning more about what other people are doing with this analysis. But it just reminds me that it is an arms race. Analysis gets, gets better. We get better because of that. You know that might invite more analysis which just has kind of this spiraling effect toward us getting better.

Justin: Yes, speaking about spiraling effects towards getting better, one of the great things about ring signatures is that the selection algorithm and the ring size sort of go hand in hand in sort of a positive or negative feedback loop. Suppose you’re purposely making things worse, but as Monero increases its ring size it sort of decreases the severe negativeness of perhaps a bad selection algorithm or some limitations of a selection algorithm. If you just keep picking more and more inputs for example right? More and more decoys the you know shortcomings of a specific algorithm might decrease will or should ideally decrease. Meanwhile if you improve your selection algorithm you make better use of the decoys that you have. So these two components are really critically important and having strong privacy in Monero making the most out of its ring signatures because if you have a larger ring size with the terrible selection algorithm then you’re not going to have great privacy because you’re able to develop really strong heuristics potentially and likewise if you have a really even if you had a, some mechanism of having a perfect algorithm under every circumstance, but you had a like really small ring size for example then you’re also not great either. So these things really critically work hand in hand at providing really the privacy that ring signatures offer. It’s more than just what they say they provide out of the box which is 1 out of, in Monero’s case 1 out of 11 are possibly spent. It’s all the additional metadata, timing analysis, coinbase metadata that is associated with this and so the selection algorithm needs to adjust over time to, to compensate for this. Otherwise we could just pick the first 10 outputs that were ever generated on Monero’s blockchain and be like "oh it’s either the latest or of the first 10" and that’s obviously not very convincing. So you know it’s input selection algorithm is very important for Monero’s ring signatures. Alright Sarang do you have any last closing thoughts to leave the viewer with?

Sarang: Just that our goal still remains to provide the best plausible deniability possible with ring signatures and over time we continue to learn about better and better ways to do that and so we keep iterating and iterating and iterating to get better and we’re absolutely not perfect now but we continue to try to get better and better.

Justin: Alright, thank you Sarang. Thank you everyone for watching this episode of Breaking Monero. We will catch you in the next one. Take care.

Sarang: See ya