The basic idea is pretty straight forward. Say you have a massive guest list but only a tiny sticky note to keep track of everyone. A Bloom filter is basically that note but with a clever bit of math that lets it lie to you in one specific direction. You take a name and run it through a few different hash functions. They tell you which spots to flip from 0 to 1 in a big array of bits. When you want to check if someone is on the list later you run their name again and see if those same spots are already 1s.
The catch is that different names might end up flipping the same bits by pure coincidence. If you ask the filter if John Doe is there it might see all 1s and say yes even if John never showed up. That is a false positive and you just have to live with it. But if the filter sees even a single 0 it knows for a fact that John is not there. It never lies about a negative which is where the real magic happens for fast searches.
When you want to avoid burning time on expensive lookups in a slow database, you stick the bloom filter in front of the slow stuff as a gatekeeper. If the filter says the data is not there you trust it and move on instantly without ever having to do IO. You only do the heavy lifting if the filter gives you a maybe. Using a bit of memory is often cheaper than doing a bunch of wasted database queries.
The basic idea is pretty straight forward. Say you have a massive guest list but only a tiny sticky note to keep track of everyone. A Bloom filter is basically that note but with a clever bit of math that lets it lie to you in one specific direction. You take a name and run it through a few different hash functions. They tell you which spots to flip from 0 to 1 in a big array of bits. When you want to check if someone is on the list later you run their name again and see if those same spots are already 1s.
The catch is that different names might end up flipping the same bits by pure coincidence. If you ask the filter if John Doe is there it might see all 1s and say yes even if John never showed up. That is a false positive and you just have to live with it. But if the filter sees even a single 0 it knows for a fact that John is not there. It never lies about a negative which is where the real magic happens for fast searches.
When you want to avoid burning time on expensive lookups in a slow database, you stick the bloom filter in front of the slow stuff as a gatekeeper. If the filter says the data is not there you trust it and move on instantly without ever having to do IO. You only do the heavy lifting if the filter gives you a maybe. Using a bit of memory is often cheaper than doing a bunch of wasted database queries.
I used this for my capstone project on data deduplication. I believe I used it to check if the chunk of data has not been seen before.
it’s a really handy technique that’s really underrated I find, and dead simple to implement too