Sitemap

SAF Codes: Fighting Deep Fakes with Cryptography

25 min readDec 3, 2023

--

The rate of progress in AI technology recently has been astonishing. As generative AI becomes more powerful and more ubiquitous, deep fakes are likely to become an increasingly common problem. Even without AI, malicious editing of videos has long been a tool for disinformation. For the purposes of this post, I’m not particularly interested in the distinction between these techniques. Fakes are fakes. Generative AI is simply amplifying an existing problem.

We need tools to detect these faked videos. These tools need to be reliable, and as ubiquitous and easy to use as the faking tools are. Ideally, these tools would just be baked into every video sharing platform, or your web browser, and if a fake video is detected, a big warning would pop up on the screen.

Of course, the problem is that fake videos can be extremely hard to detect, and they’re just getting better and better as AI improves. At the moment it’s still possible for humans to tell the difference between a real video and a fake, but for how much longer will that be the case?

Currently we have a patchwork of potential solutions to this problem, none of which have really been implemented at scale. In this post I’ll briefly outline some existing solutions, and then I’ll add another approach to the mix that I think could fill some gaps in that patchwork.

Existing tools

Broadly speaking, the existing suite of deep fake detection tools can be divided into two categories: consumer side and producer side.

Consumer side tools are probably what you’re most familiar with. These are tools that viewers can use to detect fakes, without any input from the people who made the video.

The most straightforward consumer side tool to detect AI generated videos is to train other AIs to spot them. Unfortunately this is an extremely difficult task. Our best detector AIs are only about 65% accurate. And there’s no reason to think that accuracy is going to significantly improve over time, because the detector AIs will always be in an arms race with the generator AIs.

There are some other consumer side techniques that focus on specific details of the videos, and these are often a lot more effective than generic approaches. For example, in a real video, if you analyse the subtle changes in a person’s skin color, it’s possible to detect their pulse. But the fake people created by current generative AIs don’t have a detectable pulse. This technique was around 98% accurate in 2020, but does it still work as well today? Will it work in 5 years? It wouldn’t be difficult for an AI to learn to generate a pulse. We can come up with all sorts of details that AIs are currently bad at generating, but we still have an arms race problem. If we start depending on a specific detail like this to spot deep fakes, someone will inevitably train an AI that can generate that detail correctly. Still, these tools can be a useful part of the patchwork of solutions in the short term.

Public education could also be considered a consumer side tool, but studies suggest training people doesn’t significantly improve their abilities to spot speech deepfakes. In any case, this isn’t a long term solution, because it won’t be long before generative AI can create fakes that no human can detect.

Detection after the fact is just a fundamentally harder problem than generating the fakes. This isn’t surprising. After all, it’s easier to tell a lie than to spot one.

The other category is producer side tools, where the video is augmented at creation time in some way that allows consumers to detect if it has been digitally altered. These solutions are more promising in the long term, in my opinion.

One suggestion has been for cameras to cryptographically sign the videos they produce. This would mean we could prove that the video was filmed by a real camera, and even identify the specific camera. Of course, this requires us to replace or upgrade our existing cameras to have this signing hardware. This approach could have serious privacy implications if users aren’t aware that their camera is essentially signing their name to every video they take. It also doesn’t give any control to the person being filmed, just the person filming.

Another producer side suggestion uses the blockchain, but that seems to just be the same camera signature idea, with extra steps. Ordinary cryptography should work just as well here.

There have also been proposals to regulate AI on the producer side, for example governmental mandates that all generative AIs watermark their creations. This sort of thing might help, but there are always going to be open source AIs that don’t abide by these rules and don’t add the watermarks. I still think this is an important avenue to pursue though, because if watermarking is implemented by all the major AI companies, it would at least decrease the pool of people who have access to unrestricted generative AIs to those who are willing to go to the extra effort of using an open source tool.

In a similar vein, YouTube will soon require creators to label any AI generated content they upload. That’s good, but doesn’t address non-AI fakery, and it’s also going to be difficult to enforce. It’s not clear how they plan to detect cases where the uploader has simply not bothered to apply the AI label.

Types of threat

Throughout this post, I’m going to use “target” to refer to the person being filmed, and “viewer” for the person watching the video. There are many ways that a video can be maliciously edited, with or without AI. Some examples are:

  • Chopping up and rearranging the video and audio to change what the target is saying.
  • Slowing down the video to make the target appear drunk or otherwise impaired.
  • Using AI to generate or alter the audio to change what the target is saying.
  • Using AI to generate or alter the video to make it look like the target is, for example, making an obscene or racist gesture.
  • Lightly editing a video of a lookalike actor to make them look more like the target.
  • Taking the target out of context. For example, filming an interview, then taking one of the target’s answers and putting it after a different question.
  • Taking the entire video out of context. For example, taking a video of a news event (say, a violent riot), and claiming it’s actually a video of some unrelated event (e.g. a peaceful protest).

None of our tools can protect against all these threats, so we need a patchwork of approaches.

Deep Defender

The existing tools all have their place, and my suggestion is no more a panacea than they are. It’s just one more tool in our toolbox, on the producer side. The basic idea is that the person being filmed runs the Deep Defender app on their phone, and makes sure the screen is visible to the camera (e.g. poking out of their shirt pocket). The app listens to what they’re saying, breaks the audio into chunks and generates a fingerprint of the chunk, then cryptographically signs the fingerprint and displays it on the screen as a QR code. I call the resulting QR code a signed audio fingerprint, or a SAF code.

The app will change the QR code every time there’s a new chunk of audio, which in the current version is roughly every second. You can see a demo in the video below, or checkout the GitHub repo. As it stands now, this app is just a proof of concept, but if this idea turns out to be useful I’m hoping people with more security and AI expertise will help improve it or build their own version.

Deep Defender demo video

Cryptography refresher

Skip this section if you already know how cryptographic signatures work. If you’re not familiar with them, I won’t go into the mathematical details, but they’re built on asymmetric cryptography, where messages are encrypted using one key (the private key) and decrypted with a different key (the public key).

Essentially if Alice is sending a message to Bob, she first runs the message through a hash function. The hash function returns a pseudorandom number that is much shorter than the message. This number looks random, but it’s actually deterministic, so the same input message will always generate the same output hash. Then Alice encrypts the hash with her private key. This encrypted hash is her signature, and she sends it along with the message. Bob can verify the signature by decrypting it with Alice’s public key, hashing the message, and checking that his hash of the message matches the decrypted signature.

A diagram showing how cryptographic signatures work

Due to the nature of asymmetric cryptography, only Alice (with her private key) could have generated that signature, which proves that the message hasn’t been altered since she signed it. If someone tries to alter the message, then the hash inside the signature won’t match the message anymore, and they can’t generate a new signature for the altered message without Alice’s private key.

Real cryptographic signature algorithms are a bit more complicated and varied than this, but this core idea is sufficient for today.

How do SAF codes work?

QR codes can store arbitrary data, not just URLs. A SAF code (signed audio fingerprint) is a binary blob, with this structure:

A diagram showing the structure of a SAF code

The most important parts are the audio fingerprint and the signature. The signature signs every other field of the SAF code. Viewers of the video can decode the QR code and verify that the signature was created by the target and matches the rest of the SAF code. That proves that the SAF code was not altered after it was generated by the target’s phone. Then they can try to match the fingerprint to the audio in the video, and if it matches it proves that the audio hasn’t been altered either. That’s pretty much all there is to it.

If a malicious editor fabricates some audio that the target didn’t say, it won’t match the audio fingerprint anymore. If they digitally alter the QR code so that the fingerprint matches the altered audio, then the signature won’t match. And they can’t generate a new signature to match the altered fingerprint without the target’s private key.

Similarly, it’s not possible for an AI to generate a deep fake video containing a valid SAF code, because not even our most advanced AIs can generate a cryptographic signature without the target’s private key. This isn’t an arms race issue either — math is math. Modern cryptography should be secure from AI attacks for the foreseeable future. There’s no reason to think AI is any more of a threat than a human attacker. The primary arms race in cryptography remains algorithm cleverness and key size vs computing power, and it’s not hard to update our algorithms and increase our key sizes to keep up with increasing compute. Fundamentally, we’re always going to have mathematical functions that are easy to evaluate, but extremely hard to invert. AI doesn’t significantly change that equation.

Interestingly, cryptographic signatures also provide non-repudiation. So the target can’t later deny that they said what they said, or claim that the video was faked. Unless their private key was stolen, the only person that could have generated those SAF codes is the target.

The audio fingerprint

The core of the SAF code is the audio fingerprint. It’s essentially a hash of the audio chunk, but it needs to be robust to the various legitimate changes that occur between filming and viewer playback:

  • Added noise
  • Changes in volume and frequency response
  • Compression artifacts
  • Differences between the microphone on the target’s phone and the camera that’s filming them.
  • The different position of the phone’s microphone and the camera microphone (the target’s speech will be much louder in the phone’s mic).

Our fingerprint algorithm needs to ignore all these differences, while still being able to distinguish “let’s eliminate rabies” from “let’s eliminate babies”.

The prototype app uses a very simple robust audio hash similar to the algorithm that your phone uses to identify what song is playing. The hash is basically a coarsely encoded spectrogram of the sound.

It also calculates the RMS volume of the chunk, and stores it in the fingerprint next to the hash. This is useful because this particular hashing algorithm becomes extremely noisy during the gaps in the target’s speech, as it’s essentially hashing the background noise. I found that the verifier would consistently reject quiet chunks as mismatches, so encoding the volume allows the verifier to be a bit more forgiving when the target isn’t speaking. Of course, that means the verifier should also check that the volume number of the matched chunk is accurate (relative to other chunks), otherwise a malicious editor could insert fake audio into the quiet parts where the verifier is being more lax.

This works well enough as a proof of concept, but it might be possible to train an AI to make subtle changes to the audio that aren’t detected by this algorithm. Further research is needed to figure out how robust this basic algorithm is, and investigate the alternative fingerprinting algorithms mentioned in the appendix.

The metadata fields

Looking at the metadata fields, the SAF code starts with the literal ASCII characters “SAF”, so we can tell at a glance that this QR code is (probably) a SAF code. The version and algorithm fields are for future extensibility. They’re both varints, and are currently hard coded to 1. The algorithm field describes how the audio fingerprint is generated, as well as other details like how frequent the chunks are, and what cryptographic signing algorithm was used. It might be a good idea to have two algorithm fields in a future version, separating the choice of fingerprinting algorithm from the signing algorithm.

The timestamp field guards against various kinds of malicious editing. It’s a uint64 containing the time that the audio chunk ended (ms since unix epoch). Just like all the other fields, it can’t be altered without causing the signature to mismatch. Looking at the timestamps of successive SAF codes in the video, we expect to see them increase by a consistent amount of time that aligns with the time the audio chunk occurs in the video. If that’s not the case, we know something was edited. We can also compare these timestamp differences to the expected chunk size of the fingerprinting algorithm.

The timestamp is useful for detecting certain edits that the fingerprint might have missed:

  • If an editor chops up and rearranges chunks of the video to change what the target is saying, we’ll notice that the timestamps jump back in time, or skip too far forward.
  • If they remove a chunk of video, we’ll notice a gap in the timestamps.
  • If they slow down or speed up the video, we’ll notice that the timestamps don’t match the video time. E.g. if two SAF code timestamps differ by 1 second, but the second SAF code’s matched audio chunk occurs in the video 2 seconds after the first, we know the video has been slowed down to half speed. Of course, a 2x slow down would probably be detected as an audio fingerprint mismatch too, but the timestamps let us detect more subtle speed changes that the fingerprinter might miss.
  • The timestamp also tells us the date and time that the video was filmed, so it can’t be misattributed to a different event that happened at a different time. I suppose the video could still be misattributed to an event at the same time but in a different location, so maybe it would be a good idea to include GPS coordinates in future versions of the SAF code (as an opt-in feature of course).

Not all edits of this kind are malicious, but the timestamps allow us to detect and report them so the viewer can decide if it’s a legitimate edit. It might also be a good idea to include a sequence number in addition to the timestamp (especially if future fingerprinting algorithms use variable chunk sizes), but for now that would be a bit redundant.

Another detail that helps detect edits is that the audio chunks are actually overlapped. So even if the editor cut the video exactly at the boundary between two chunks, the fingerprints wouldn’t match the audio. No matter where you cut the video, you’re always slicing through a chunk. This also prevents them from inserting fake audio in between the chunks.

A diagram showing how the audio chunks overlap

For more details about how the verifier works, check out the appendix.

UI considerations

Assuming this idea gains traction, verifiers could be built into video platforms like YouTube. If a SAF code is detected, a small lock icon could appear in the corner of the UI, like the lock icon Chrome shows next to the URL if it’s using HTTPS.

If a verification error is detected, a warning message could be shown to the viewer, or the lock icon could flash and animate to be unlocked. Clicking the lock button would show more details, such as the specific error, the fingerprint matching score, or any other details of the detected edits.

The verifier could also be implemented as a Chrome extension, if the video sharing platforms aren’t interested.

To better automate the verifier, it would probably also be a good idea to add some sort of user ID in the SAF code metadata so that the verifier can look up the target’s public key. But that ID will need to be reported to the viewer for verification, otherwise a malicious editor could just create SAF codes signed with their own private key. But I’ll leave these details for the security experts and UI designers to figure out.

Advantages of SAF codes

The thing that distinguishes SAF codes from the other approaches is that it’s a proactive tool that the target can use to protect themselves. It doesn’t require cooperation from the camera manufacturers, or the video editors, or the AI designers, or even the streaming platforms. Everyone in the chain between the target and the viewer could be a malicious actor, and a valid SAF code would still prove that the target said what they said.

Since SAF codes are based on cryptography, they’re much less susceptible to the arms-race problem that AI based consumer side deep fake detectors have.

Unlike the other producer side approaches, SAF codes don’t require special camera hardware, or any hardware really. Any ordinary phone can run the Deep Defender app.

Since the QR codes are just an ordinary part of the video, this approach doesn’t require buy-in from the people who own the content pipelines to plumb through any metadata.

Limitations of SAF codes

The biggest limitation of this approach is that it requires that the target run this app on their phone, and have the screen visible to the camera. This is a problem for adoption, because it’s going to be a bit awkward to have your phone poking out of your shirt pocket, or propped up on the table in front of you.

If a video doesn’t contain any SAF codes, this tool can’t say anything about whether it’s legitimate or not. Unless adoption becomes widespread, it won’t be suspicious for a video to not have any SAF codes. We’d have to reach a point where the target is known to always use Deep Defender, so that any videos of the target that don’t contain SAF codes are immediately suspect.

The tool only detects audio alterations. It doesn’t help with video alterations, such as changing the target’s facial expression, or gestures. A malicious editor could have the target give the camera the middle finger, and their SAF codes would still be valid.

It would also be possible to edit the video such that the target is taken out of context, without being detectable by the SAF codes, and without looking too suspicious. Interviews are often filmed with two cameras, one aimed at the target and the other at the interviewer. If you take a clip of the interviewer asking one question, you can then cut to target answering a different question. Since the target’s SAF codes aren’t visible in the interviewer’s camera, viewers won’t be able to see any of the SAF codes that encoded the question. The shot of the interviewer would need to linger for a couple of seconds after the target starts answering to let the audio chunks flush, but that wouldn’t look particularly weird. The editor would just need to take care to make sure the swapped questions are about the same length, so the timestamps still line up.

The weakest point of the security is the audio fingerprinting algorithm. Under the current prototype, it probably wouldn’t be too hard to alter the audio without significantly altering the fingerprint. We’ll definitely want to switch to a more advanced robust audio hash, or one of the other alternatives (see appendix), before actually using the system. But even with more advanced fingerprints, this will still be the weakest link in the system.

It’s also entirely possible for the target to sign fake SAF codes using their private key. There wouldn’t be much point to signing a fake audio fingerprint, but a malicious target could fake the timestamps by just changing the clock on their phone. For instance, they could commit a crime on Monday, then on Tuesday change their phone’s clock to Monday and film themselves, and try to use the SAF codes to give themselves an alibi. Essentially, the timestamps can’t be considered any more trustworthy than the target themself.

An alternative delivery method

If you trust your cameraman and editor, there’s another option. Rather than using the Deep Defender app, you could simply generate the SAF codes after the video is filmed, and embed them directly in the video. You’d still use the same QR codes as the app, but they would be baked into the video, e.g. in the bottom left corner of the screen, or as a subtle watermark over the whole screen. The obvious advantage of this approach is that it avoids the inconvenience of using the app and keeping your phone visible.

Another advantage of this approach is that audio fingerprint doesn’t have to be robust to differences between the camera and phone microphones, or their positions. It still needs to be robust to compression artifacts etc, but these variations are much smaller.

This also means the verifier can be stricter when matching the audio fingerprint, making it less likely that a malicious edit will make it through the verification. This approach is more secure as a result.

In addition, it would be possible to include the video itself in the fingerprint, preventing malicious editors from changing the target’s expressions or gestures.

Generating the SAF codes after the edit also means there won’t be any out-of-sequence timestamps caused by legitimate edits. So any instance of this error that the verifier does spot would almost certainly be a malicious edit, making it a much more useful signal to report to the viewer.

Also, multiple parties could sign the SAF codes, not just the target, giving their stamp of approval to the contents after the fact. This is sort of possible with the app, if multiple people all have Deep Defender running and visible. But with this approach you could have multiple signatures on the same SAF code, which is more efficient and convenient.

If SAF codes gain traction, and the streaming platforms are willing to support them, then there’s an even cleaner alternative. Instead of showing the SAF codes on screen as a QR code, they could be stored in the video’s metadata. This would make the SAF codes invisible and immune to compression artifacts or changes in video resolution, but plumbing this custom metadata stream through to the viewers would require the cooperation of the streaming platforms. This variant of the system ends up sharing a lot of similarities with the existing proposal of having cameras sign their videos, but is opt-in and doesn’t require specialized hardware.

Summary

SAF codes (signed audio fingerprints) work by generating an audio fingerprint of what the target is saying, cryptographically signing it, then presenting the result as a QR code. They can detect if the video has been cut up and rearranged, or if sections have been removed, or the whole video has been sped up or slowed down. They can also detect if the audio has been altered to change what the target has said, or if the entire video is being misattributed to a different event at a different time.

The cryptographic signature means they can’t be faked by generative AI. The signature also provides non-repudiation, meaning the target can’t later claim that they didn’t say what they said. They can provide this protection without the cooperation of the camera operator, the camera manufacturer, the video editor, or the streaming platforms.

The current Deep Defender prototype uses a very basic robust audio hash, and this is the weakest link in the chain. We’ll need to switch to a more advanced hash, or one of the other alternatives in the appendix, before we rely on this for real world use cases.

The main downside of this system is that the target needs to run the Deep Defender app on their phone, and keep the screen visible to the camera, which is a bit inconvenient. If the QR code isn’t readable by the camera (whether obscured or too blurry or too small), the protection is lost. Similarly, if a video doesn’t contain any SAF codes, the most we can say is that it’s suspicious (exactly how suspicious depends on the rate of adoption). Additionally, the SAF code doesn’t fingerprint the video, so a malicious editor could still change the target’s facial expression or gestures to something inappropriate.

Some of these downsides can be mitigated by generating the SAF codes after the fact and embedding them in the video (either as a QR code or as metadata). This avoids the need to use the app, makes the audio fingerprinting more secure, and also allows including the video in the hash. The downside of this alternative is that the target has to trust the people filming and editing the video.

Some speculative fiction

The year is 2055, and Gaten Matarazzo has just been elected US president, thanks to the broad popularity of his universal basic income policy. These days, ignoring the economic disruption caused by AI is as ridiculous as ignoring climate change, and people are finally ready to give UBI a shot. Attempts by reactionary groups to undermine his campaign using deep fakes were largely unsuccessful.

In past elections, the best tactic for deep fakers has been to use a light touch. Start with a real video of the candidate explaining one of their policies, and subtly twist their words to gently exaggerate their position. The change needs to be large enough to put off swing voters, but subtle enough to seem like something the candidate would plausibly say. For example, changing “it’s time to end fossil fuel subsidies” to “it’s time to end fossil fuel use”. It’s a delicate art, but GPT-9 is pretty good at it. Polling suggests that the outcome of several past elections can be attributed to deep fake video ads targeted at key demographics.

SAF codes and related technologies were supposed to mitigate this problem, and they did help somewhat. But despite the fact that YouTube, Facebook, TikTok, and VibeSpoon all integrated a popular open source SAF code verification library into their platforms, uptake of the client side apps was pretty poor. It just wasn’t cool to have your phone poking out of your shirt pocket all the time. So when people saw a video where no one was using SAF codes, it didn’t strike them as suspicious, it was totally normal.

All that has changed in recent years with the rise of wearable tech, and especially e-ink fabric. Everyone wears the stuff these days. Changing the design of your clothing every day to suit your mood is just plain fun. And with generative AI, you don’t even need art skills to show off your style.

President Matarazzo is never seen in public without his trademark e-ink SAF-hoodie. The constantly shifting patterns that cover the entire jacket encode all his speech, and many of his gestures, into beautiful cryptographic art. Unless you know what you’re looking for, you would never even notice the SAF code embedded in the design. But it’s clear as day to the decoders and verifiers in all the major platforms.

Only the most disreputable fake news peddlers bother with political deep fakery anymore, because it’s just not as effective as it used to be. AI produced videos are still commonly used for entertainment and as an artistic tool, but few people are fooled into thinking they’re real.

Appendix 1: Alternative fingerprinting algorithms

More advanced robust audio hashing

The most obvious improvement to the fingerprinting algorithm would be to use a more advanced robust audio hash. There are a lot of options out there, many of which are specifically targeted at the speech use case. The algorithm I chose was just the easiest one to implement for a proof of concept.

AI based hashing

We could train an AI to hash the audio. Specifically, we would train a network with an embedding layer that becomes the hash.

A diagram showing a possible ML architecture for audio fingerprinting

We train the network on pairs of audio chunks, some of which are the same (with added noise, frequency filtering etc), and others that are different (totally different, or just cut up and rearranged). We run the two chunks through two identical copies of our network. The network outputs to a small embedding layer, and then we compare the similarity of the 2 embedding layers. We train the network to generate embeddings with a high difference score if and only if the audio chunks are actually different.

The embedding layer is our audio fingerprint, so in the app we just use one copy of the network, and have it generate embeddings for our audio chunks. We’d need to experiment with different embedding sizes to find one that works reliably, without being too big for the QR code. For example, we could use a 64 node embedding layer, and discretize the value of each node to 2 bytes, giving us a 128 byte hash.

This means we’re back to training detector AIs to fight faker AIs, but this time we’ve given a huge advantage to the detector AI. Deep faking is a hard AI problem, and spotting deep fakes is an extremely hard AI problem, but comparing 2 audio chunks in this way is a relatively easy AI problem. So this time the arms race is much more in our favor.

Another similar AI approach that would be worth investigating is an autoencoder, where the AI learns to encode the audio in the embedding layer, then reconstruct it. This would have the advantage some version of the original audio could be extracted from the SAF code for human verification.

Ultra-low bandwidth audio compression

The fingerprint doesn’t actually need to be a hash. There are ultra-low bandwidth audio codecs designed for speech. Rather than hashing the chunk, we could simply compress it enough that we can store the entire chunk in the SAF code. This approach has the advantage that we can decode the audio stream and recover the chunk. So if the automatic verifier finds a mismatch, or if the viewer is just feeling suspicious, they can play back the decoded audio stream to decide for themselves if it matches.

One example of such a compression algorithm is Lyra, which produces good results as low as 3.2 kbps. So we could encode a 1 second chunk of speech in 400 bytes. That’s pushing the limits of what will fit in a reasonable QR code, but there’s no reason we have to use 1 second chunks. Shorter chunks would mean smaller QR codes more often, which is fine as long as the QR codes are still readable by the camera.

Another option is Codec 2, which can go even lower. We don’t need the decompressed audio to sound good, just to be understandable to the listener, and Codec 2 is legible at 700bit/s. This puts a 1 second chunk at around 90 bytes, which fits comfortably in a QR code.

Speech-to-text

In a similar vein, we could just run a speech-to-text algorithm on the audio, and use the resulting string as the fingerprint. If the verifier finds a mismatch, it could report the reconstructed transcript for the viewer to review. For this approach we’d probably want chunks that are a bit longer than 1 second.

Another approach would be to store the phonemes (eg the IPA), rather than the text. Phoneme detection is an easier problem than speech-to-text, and is probably also more robust for this use case. But the transcript wouldn’t be as easy for a human to read.

Appendix 2: Verifier implementation details

The verifier algorithm is a little more complex than what I outlined above. Each SAF code has three timestamps:

  • Encoded timestamp: The timestamp that was encoded in the SAF code.
  • Video timestamp: The frame in the video when the SAF code first appears.
  • Audio timestamp: The time within the video’s audio stream of the chunk that was fingerprinted.

The first two are given, and the main complexity of the verification algorithm is figuring out the third. The encoded timestamp is measured from the unix epoch, and the video and audio times are measured from the start of the video. So these clocks have very different offsets, and there will also be some variance between them for each SAF code. But they should all advance at the same rate, and there shouldn’t be any significant drift. So the difference between consecutive SAF codes should be roughly the same for all three clocks.

The verifier runs through the following steps, and might report errors along the way. Not all of the errors are a definite sign that someone has maliciously edited the video though.

  1. For each frame of the video, try to find a QR code. If we find one, and it’s different to the last one we saw, proceed to verification. This frame’s time is the video timestamp of the SAF code.
  2. Try to decode the SAF code. If it’s not a valid SAF code (e.g. doesn’t start with “SAF”, or is the wrong length), ignore it. It’s probably an unrelated QR code.
  3. Do we recognise the version and algorithm codes? If not, it could mean we need to update the verifier, or the SAF code is using a deprecated algorithm. In any case we can’t proceed if we don’t know the fingerprinting algorithm, so report the error.
  4. Check the cryptographic signature using the target’s public key. If it doesn’t match it means that either we are using the wrong public key, or someone has maliciously edited the video. Report this as a severe error.
  5. Check the encoded timestamp. We expect it to be roughly the same as the last one plus the chunk stride (which depends on the fingerprinting algorithm that was used). We allow some variance, but if the new timestamp is too early or too late, report that the SAF code is out of sequence. This indicates there was some sort of edit, but we don’t know if it was malicious. Report it to the viewer and they can decide if it’s legitimate.
  6. Compare the encoded timestamp with the video timestamp. The difference between consecutive encoded timestamps should be roughly the same as the difference between consecutive video timestamps. But we allow a lot of variance here, because of the latency between ending the audio chunk recording (the definition of the encoded timestamp), then fingerprinting it and displaying the QR code on the screen (definition of the video timestamp). If there’s a mismatch, report the same type of error as step 5.
  7. Next we need to locate the SAF code’s audio chunk in the video (i.e. figure out the audio timestamp). Search for a match by sliding a window around the audio stream, calculating the fingerprint for each candidate chunk, and comparing it to the fingerprint in the SAF code. Use the encoded timestamp, and the audio timestamp of the previous SAF code’s match, to estimate what the new SAF code’s audio timestamp should be, then search around there to find the best match. We’re trying to maximise the similarity between the two fingerprints. The fingerprint algorithm I’m using in the prototype has a nice peak that lets us hone in on the match (see the plot below).
  8. How good is the match? If the score is below a threshold, it’s likely that someone has manipulated the audio, but it’s also possible that we just got unlucky with noise etc (depends on how good our fingerprinting algorithm is). Report it to the viewer as a fingerprint mismatch, and they can decide how suspicious the score is. Does the bad score correspond to a moment in the video when the target said something particularly weird or controversial?
  9. Use the encoded timestamp and audio timestamp pairs to detect video speed changes. If the difference between consecutive encoded timestamps is less than the difference between their audio timestamps, then the video has been slowed down, and vice versa. Some amount of variance in this number is expected, because there will be a little bit of slop in the chunk matching. But if the speed difference exceeds a threshold, that’s probably a malicious edit, so report it to the viewer.
  10. Otherwise, report that the video (or at least its audio stream) was not altered.
A plot showing how the fingerprint match score smoothly approaches a peak around the optimal chunk offset

--

--

Liam Appelbe
Liam Appelbe

Written by Liam Appelbe

Code monkey, board game hoarder, aspirant skeptic

No responses yet