Top Posts Tagged with #raspivid

Adventures with a Raspberry Pi camera module, its inaccurate timing / timestamps, & ffmpeg

I have been toying with a Raspberry Pi 3 for a few months in my spare time. If you don’t know what that is, it’s a $35 computer, a card-sized motherboard. It’s using a system-on-a-chip (SoC) that is very similar to what you’d have in a smartphone. Unlike most desktop PCs, it runs on the ARM architecture, not x86. It’s a great way to learn lots of new things, mostly about Linux, scripting tasks, and the lower-level intricacies of computers. It doesn’t have much computing power... but it has enough for software-based video encoding! And the foundation behind the Pi also sells an optional camera module for it.

I wanted to create a permanent-ish video feed. I had several options in front of me; the OS comes with easy tools to faciliate the use of the camera, but it also has a build of ffmpeg and a V4L2-based camera driver!

I was also planning on having audio in this stream, but it drifted off-sync incredibly fast. I was able to partially solve this issue after many hours pouring over many pages of documentation and google results. I’m still not even sure of what exactly (or which combination of things) has helped, but I hope this offers answers or leads to other people that may also be seeking a solution.

As a disclaimer, some of the more tricky details may be inaccurate and are my own guessing, but at the end of the day... this works for me.

CHAPTER 1: GETTING AUDIO SUPPORT

Make sure your build of ffmpeg comes with pulseaudio or alsa support. At first, I was using a static, pre-compiled build that didn’t even work properly in armhf (and I had to fall back to the slower armel). If I tried to call for my microphone with it, it would complain about some missing code somewhere.

Thankfully, I then realized that, since its Stretch release, Raspbian now comes with a pre-compiled version of ffmpeg. It’s version 3.2, and that’s a little old, and for the sake of it, I messed with apt-get sources in order to grab 3.4.2 from the “unstable” branch of the distribution. At some point during that process, I was one key stroke away from screwing up the entire OS by updating over a thousand more packages, so if you go down that route... be careful.

You could compile ffmpeg yourself too, but at this point in time, that goes beyond my technical know-how, unfortunately.

Here’s another very simple thing that eluded me for a while: how to call for the microphone. I saw a lot of posts telling people to ask for hw:0,1 (or other numbers), and I kept doing this with no success. It turns out the better solution is to ask ffmpeg what it sees by itself:

ffmpeg -sources alsa

That’s just it! I would have loved to show you what that looks like, but unfortunately, it is now returning a segmentation fault on both my ffmpeg builds. I think that might have been caused by a recent kernel update. But that is supposed to work... I promise. My USB Yeti Blue mic can be called like this:

-f alsa -i sysdefault:CARD=Microphone

You may also be able to use pulseaudio. My understanding is that Raspbian is configured to run it on system boot, but it tries to initialize too early and silently fails. So you’ll have to start it yourself, manually, without sudo:

pulseaudio --start

And right after that, you can enumerate sources just as shown above (just replace “alsa” by “pulse”). For me, the result is:

Auto-detected sources for pulse: alsa_output.usb-Blue_Microphones_Yeti_Stereo_Microphone_REV8-00.analog-stereo.monitor [Monitor of Yeti Stereo Microphone Analog Stereo] alsa_input.usb-Blue_Microphones_Yeti_Stereo_Microphone_REV8-00.analog-stereo [Yeti Stereo Microphone Analog Stereo] * alsa_output.platform-soc_audio.analog-stereo.monitor [Monitor of bcm2835 ALSA Analog Stereo]

And the one I have to choose has, of course, “input” in the name.

I don’t know if there’s really a difference between using pulse and alsa. It didn’t seem to make any differences over the course of my testing. I am using alsa as it’s allegedly the standard.

CHAPTER 2: CHOICES OF VIDEO INPUT

The cool thing about the Pi’s SoC is that, across all models, they have the same GPU... almost. They all come with H.264 hardware encoding support. It means that the encoding takes place in (fixed-function?) hardware, and therefore, for all intents and purposes, doesn’t take up resources on the CPU. However, hardware doesn’t implement all the fancy refined techniques that are present in the state-of-the-art software encoder, x264, which are used to get the most out of the bitrate: psychovisual optimizations, smarter decisions, trellis stuff, and all that. In short: it’s extremely fast, but it’s not as good.

Now, just in case, I’m going to give you a little reminder on what is probably one of the most abused set of terms in computer video:

MPEG-4 is a set of standards (Part 10 is AVC, aka H.264, while Part 2 is ASP, aka DivX, xVid)

MP4 is a container (it’s also Part 14 of the standard set)

H.264 is a video compression standard.

x264 is a software library which makes available an encoder, which can encode video into the H.264 standard.

ffmpeg is software that implements a lot of libraries, including x264, in order to decode, encode, etc. video and audio content.

Let’s all try to not mix up these terms. Though, understandably, to some degree, a bit of overlap, hence why a lot of people say one to mean the other.

On all the Pi boards, except the 3, you’ll most likely want to try to work using the hardware-encoded H.264 stream, because the CPU is just too slow. The 3, however, is good enough to use x264 — with some compromises. You can even get up to 720p if you’re willing to heavily cut on the framerate. But we can talk about that later, because for now, we need to get the video to ffmpeg!

There are a variety of ways we can do this:

Raspivid, raw H.264

Raspivid, raw YUV/RGB/grayscale

V4L2 driver, raw video (a lot of formats)

... and others

Raspivid comes with the most settings, and is probably the easiest. Seriously, look at all this! Here’s a run-of-the-mill command line that I got started with:

raspivid -t 0 -o - \ -b 900000 --profile high --level 4.2 \ -md 4 -w 960 -h 720 -fps 20 \ -awb auto -ex nightpreview -ev -2 -drc med \ -sh 96 -co 0 -br 50 -sa 5 \ -p 1162,878,240,180 \ -pts \

Let’s go over this:

The backslashes allow me to go to a new line; very useful.

-t 0 makes the time “infinite”; the program won’t stop running until you stop it by yourself, by using CTRL+C when in the terminal that runs it.

-o - makes the output go to stdout, which your terminal will print. So you’ll have a whole lot of gibberish. But you’ll understand why soon.

On the second line, gives the video a bitrate of 900kbps, and it is then encoded with the High profile of H.264 with the highest level available on the hardware encoder. This compresses the video better (by allowing stuff like 8x8 intra-predicted macroblocks), but makes it slower to both encode and decode. However, the High profile has very broad hardware decoding support by now, and if you have that, it doesn’t matter.

On the third line, I am manually selecting the fourth sensor mode of the camera module. It offers the full field of view and bins pixels together for a somewhat less noisy image (but also slightly less sharp). I also define the resolution and framerate.

On the fourth line, I select the white balance, scene mode, exposure bias, and dynamic range compression modes. I’m not sure how the latter works, or what it really does. It might only make a difference in bright outdoor scenes when the sensor is operating at its lowest ISO and exposure values... but that’s only an hypothesis.

On the fifth line, I tweak the image. I sharpen it (at 96%) and saturate it a little bit (+5%). I don’t change the contrast and brightness from their default values, but you can lift the blacks a little bit with -2/51 if you wish.

On the sixth line, I am defining where the preview window sits. This will display the camera feed on your screen; it’s displayed as a hardware layer that will go above everything else, because, as I understand it, it’s drawn directly by the GPU and bypasses everything the OS does below. By default, it will take up the whole screen, and that can be troublesome. Here, I make it take up a little 240x180 square in the lower-right corner, whose coordinates are adequate for a resolution of 1400x1050.

And last, -pts is supposed to add timestamps to the feed.

So now you have a video stream... but how do you get it to ffmpeg?

You have to pipe it. With this: |

Can you believe that character is useful for more than just typing the :| face ?

Computers are incredible.

When you add this between two programs, the output of one goes to the other. This is where the -o - from earlier comes in; all of that data which would otherwise get displayed as gibberish text will go straight to ffmpeg!

Let’s now take a look at the entire command:

raspivid -t 0 -o - \ [...] -pts \ | ffmpeg -re \ -f pulse -i alsa_input.usb-Blue_Microphones_Yeti_Stereo_Microphone_REV8-00.analog-stereo \ -f h264 -r 20 -i - \ -c:a aac -b:a 112k -ar 48000 \ -vcodec copy \ -f flv [output URL]

We have to tell ffmpeg that what’s coming in is a raw H.264 stream from stdin, and to copy it without reencoding it: that’s -vcodec copy. After that, you can redirect the stream to wherever you want; a RTMP address for live-streaming (hence the FLV), or you could have it record to a local file with the container of your choice.

You can also ask Raspivid to pipe in raw video, like so:

-o /dev/null \ -rf yuv -r - \

The H.264 stream will go to /dev/null (a sort of “file address” that disgards everything sent to it; it’s the black hole of Linux systems), while a new stream of raw video will get piped to stdout instead. Then, on the ffmpeg side, you have to accomodate for this change:

-f rawvideo \ -pixel_format yuv420p -video_size 960x720 -framerate 20 \ -i - \

And then, of course, you can’t just pipe raw stuff towards your output, so you’ll have to reencode it (but we’ll go over that later).

You remember the -pts setting that I mentioned before? It doesn’t really work, and even if it did, you can’t use it. That’s because, when piping raw feeds towards ffmpeg, raspivid seems to be adding this data as some sort of extraneous stuff, and this corrupts how ffmpeg reads it.

Have you ever tried to load an unknown audio file in Audacity, only to be prompted “give me the sample rate, the little/big endian format, the bit depth”? And if any of the settings were off, you would just get screeching distorted audio? This is kind of a similar situation. With a raw H.264 feed, I would get funny distortion on the lower 2/5ths of the video, while the raw YUV feed would just be like an out-of-phase TV, but way worse, and much much greener.

You can also use the video4linux2 driver for the camera. It offers slightly less control over the settings, but it has one major advantage over raspivid: once you launch it, it will keep running, and you can tweak settings over time; with raspivid, you can only input settings before the launch. Here’s how to proceed:

sudo modprobe bcm2835-v4l2 sudo v4l2-ctl -d 0 -p 20 --set-fmt-video width=768,height=432,pixelformat=YU12 \ --set-ctrl contrast=-3 --set-ctrl brightness=51 --set-ctrl saturation=10 \ --set-ctrl sharpness=98 --set-ctrl white_balance_auto_preset=1 \ --set-ctrl auto_exposure_bias=14 --set-ctrl scene_mode=0 v4l2-ctl --set-fmt-overlay=width=240,height=180,top=894,left=1160 v4l2-ctl --overlay=1

The first line loads the driver. (you can use rmmod to unload it)

The second line specifies that we are, on device 0 (-d 0), setting a framerate of 20, which resolution, and which pixel format (YU12 being ffmpeg’s YUV420p).

Then we have all the other image settings that I mentioned before.

And then the overlay settings, very similar to Raspivid’s.

So, all of this works, except there’s a problem; the audio goes out of sync very very fast. In fact, it’s not even synchronized in the first place, and it only gets worse as time goes on!

You see, it turns out that our raw H.264 / YUV streams don’t have timestamps. Raspivid is unable to add any. However, the video4linux2 driver supposedly supports them.

One thing to keep in mind, however: it doesn’t really matter what framerate and resolutions you specify in your V4L2 controls; those will be “overriden” by what you have in your ffmpeg command line.

And unfortunately, the V4L2 driver seems to have an annoying limitation; I can’t get it to go past 1280x720. It might be forced into using the wrong sensor modes, or maybe it is incorrectly assuming the first version of the camera module somehow.

CHAPTER 3: A MATTER OF TIME

Audio drift.

I have tried so many settings, folks. And so many combinations. Mostly because of that, I’m still not 100% sure what’s contributing towards solving it, or what is the exact reason of this lack of audio synchronization.

At first, I had drift right off the start. Then it was over the course of 45 minutes. Then it took 5 hours to become really bad. And now, it only starts being noticeable after around 7 hours. I haven’t managed to fully fix it yet, but unfortunately, I’m running out of ideas, and testing them is becoming more and more tedious when I have to wait several hours before the audio drift rears its ugly head.

From what I understand, the camera module has its own timing crystal, which runs at 25 MHz. On the first version of the camera module, it accidentally ran at 24.8 for a while. As a result, for example, people were getting 30.4fps instead of 30.0. This was said to be fixed, and that the drift was then “less than a 1/100th of what it used to be”. But that means there was still drift. I don’t know what is up with the second version of the camera module, though.

I didn’t really take notes while solving the issue, I’m writing a lot of this from memory, so I can’t give a lot of details, and I might be misremembering a few things. A lot of this is just guesses, some of which are somewhat educated, some of which are a lot wilder.

My hypothesis is that there are separate issues that compound and/or add on top each other, somehow. But there’s one thing that I’m 99% sure of: the framerate is inaccurate. You see, I was requesting 20fps, but I wasn’t getting 20. I was getting ~20.008. I noticed this after leaving the stream on for a while with -vsync vfr, and piping part of its output with another ffmpeg instance to a bunch of MKVs for analysis.

This checked out with the fact that the video was lagging behind audio more and more as time went on, and that, with -vsync cfr, I was dropping frames like clockwork, once every 2 minutes and 15 seconds.

So why does this happen? My best guess is that ffmpeg “naively” thinks that it is going to get the exact framerate that it’s asking from V4L2, and then assumes certain things based on that. That could explain why, even with dropping the extra frames, things still went out of sync eventually.

Unfortunately, .008 is not precise enough and a couple more of decimal places are nice for streaming over many hours. Putting -vsync cfr back on, I tried with .007 instead, and compared the dropped vs. the duplicate frames. If more dropped than were duplicated, then .008 was too much, and vice versa (or maybe it was the opposite). This is how I landed on .00735, which, over the course of 89 hours (I was off visiting Paris with my girlfriend in the meantime), dropped “only” 29 frames but also duplicated 3, somehow—maybe jitter in that timing, even if it ends up averaging to the right number?

In the end, the solution is to work out what the actual framerate is yourself, and then insert it in there as both the input and output rate.

From what I understand, -framerate 20 -r 20.00735 as input options, right after one another, are telling ffmpeg to query 20fps from the input device, but then to actually “sample” the source at 20.00735fps. (update: I think this is actually wrong. I have removed the -r in input now.)

Chances are that, on your camera module, the number will be different; it seems probable to me that the timing error might differ across camera modules, as a sort of timing crystal lottery, akin to the silicon-overclocking-potential lottery.

There is also something else altogether, the -async option (followed by a number of samples, e.g. 10000). It’s supposed to keep the audio and video clocks synced together, but it doesn’t seem like it never really did anything; I believe it’s because it’s pointless to sync 2 clocks if one of them is not working properly to begin with. That said, if it wasn’t there, it might be causing another drift on top of the existing one(s), so I’m leaving it in just to be safe.

I also use -fflags +genpts+igndts right at the start, which supposedly re-generate the timestamps properly or something like that.

The “nuclear option” to ultimately get rid of the drift is to restart ffmpeg after a certain amount of time. It’s not a solution as much as it is a workaround, really.

I don’t think you can avoid doing that, because you can’t get a precise enough timing for it to go on synced forever. This can be achieved by using the -t option right before the output, e.g. -t 12:30:25 for 12 hours, 30 minutes, 25 seconds, and after the ffmpeg command line(s), calling for your own script again. (see below)

CHAPTER 4: x264 ENCODING SETTINGS

Here’s the thing: if you’re gonna go the way of software encoding, you will need to actively cool the Pi, especially if you overclock it. Speaking of overclocking, here are my /boot/config.txt settings (which may not work on your Pi, silicon lottery, etc.):

arm_freq=1320 core_freq=540 over_voltage=5 sdram_freq=600 sdram_schmoo=0x02000020 over_voltage_sdram_p=6 over_voltage_sdram_i=4 over_voltage_sdram_c=4

It turns out you can actually push the RAM that high with the help of those other settings (schmoo controls the timings). While I’ve always had small heatsinks in place, they’re not enough to sustain a hard load for an extended period of time. At first, I used an “Arctic” USB fan, and that actually worked quite well, but I switched to a case with a tiny fan inside of it a couple days ago. It doesn’t cool as much, and has a whine, a very quiet one, but one nonetheless. I am able to stay at ~72 °C while the CPU is hovering around 75% load.

My encoding settings are:

ffmpeg -re \ -fflags +genpts+igndts \ -thread_queue_size 1024 \ -f alsa -i sysdefault:CARD=Microphone \ -thread_queue_size 1024 \

-video_size 768x432 -framerate 20 -r 20.00735 -i /dev/video0 \

-c:a aac -b:a 128k \

-threads 4 -vcodec libx264 -profile:v high -tune stillimage \

-preset faster -trellis 1 -subq 1 -bf 6 -b:v 768k \

-vsync cfr -r 20.00735 -g 40 -async 24000 \

-vf "hqdn3d=0:0:4:4,eq=gamma=1.1,drawtext=fontfile=/usr/share/fonts/truetype/ttf-dejavu/DejaVuSans-Bold.ttf:text='%{localtime\:%T}': fontsize=16: [email protected]: x=6: y=8″ \

-f flv -t 8:00:00 [OUTPUT URL/FILE]

(empty line)

sleep 2s

bash ./V4L2_transcode_raw.sh

The order of the options matters a lot in ffmpeg and is quite finnicky. I start from the “faster” preset. My goal is to use from 50 to 80% of the Pi’s CPU on average. In my case, the image will be pretty static, so there are a few things I can do to make the most out of the processing power available to me.

I enable the “stillimage” tuning preset, which turns down the strength of the deblocking filter and tweaks the psychovisual RD settings accordingly.

I turn the sub-pixel motion estimation accuracy (subq) down one notch. Update: subq is very important because, speaking very broadly, it defines the precision of macroblocks, which are the foundation of how the video is encoded. However, it’s one of the, if not THE most expensive setting, and its cost scales up with resolution very fast. Ideally you’d want at least 3/4, but that’s too much at 768x432 for the RPi 3 at this framerate.

Increasing from 3, I allow up to 8 b-frames in a row. B-frames are bi-directional, instead of being only able to reference the past. They are one of the cornerstones of H.264 and very efficient, but costly. The more I allow, the more of a big resource spike I have if I wave my hand in front of the camera to simulate lots of motion. You can see how many consecutive B-frames x264 ended up using in the end-of-encoding stats. In my case, they ended up being the maximum allowed amount in a row well over 90% of the time.

I also enable trellis quantization. To quote Wikipedia: it “effectively finds the optimal quantization for each block to maximize the PSNR relative to bitrate”.

I set the keyframe interval to 40 frames: 20 fps = 2 seconds, the standard compromise for streams. A longer keyframe interval would increase compression efficiency...but CPU usage too, as well as the initial delay to connect.

I also make use of a couple video filters. The first one is the excellent hqdn3d denoiser; it runs very fast and does an excellent job. I don’t use spatial denoising at all (the first two numbers), but I dial temporal denoising up to 4.

I’ve gone up to 100 just for fun and I saw that in the absence of motion, with noise being pretty much entirely wiped out, the CPU usage stays relatively low at around 50%. The downside is that everything is a ghost and leaves a hell of a trail. You should definitely play around with hqdn3d’s settings; whatever little cost it has is far offset by not having to encode anywhere near as much noise. I’d recommend a minimum of 3.

The second video filter displays the current time in the upper-left corner.

Then I ask for a maximum of 8 hours of running time, and make the script loop in on itself as explained before. This also makes it retry over and over if my router temporarily goes down in case of a DSL de-sync or whatever reason. It also becomes a lot easier to iterate on encoding settings; all you gotta do is press Q to force ffmpeg to stop encoding, and it will re-read your bash file again.

You can easily go up to 720p and up the speed vs. quality preset ladder if you’re willing to compromise on framerate. Around 5-6 fps, you should be able to use settings that are around the medium/slow preset. And of course, if bitrate is not a concern, you can use the ultra/superfast preset and up the framerate. But then you might as well be using the hardware-encoded stream :)

I can’t find the source of that information again, but x264 with the superfast preset, IIRC, beats everything else at equivalent speed, competing software encoders and especially hardware encoders. That said, I have noticed that the Pi GPU’s own hardware encoder works pretty well for very static content, even at surprisingly low bitrates. Make of that what you will! In the end, you should definitely experiment on your own (that’s what Raspberry Pi boards are for), and not take everything that I say at face value.

Update, March 3rd, 2017 : better video settings for lower res stuff:

This is with a resolution of 480x272 and a framerate of 20.

-threads 4 -vcodec libx264

-profile:v high -preset medium -tune stillimage \

-b:v 512k -bufsize 768k -maxrate 768k \

-subq 6 -bf 5 \

It turns out subq is a bit more important than I thought, especially at lower resolutions. To see what I mean, try setting bitrate very low, like 128k, and observe subq 1 vs. subq 6. Dialing it up to 6 really pays off. The “medium” preset has it up to 7, which is maybe a bit too much. With these settings, it’s possible to go down to lower bitrates while maintaining excellent quality, enough to potentially see your stream over unstable 3G connections if you wanted to.

It also turns out that specifying a bufsize and maxrate is very important. If your camera ends up filming something that’s very flat (e.g. pitch black night), x264 is designed to not waste bits where there’s no need. So the bitrate lowers itself way under what you specified... but x264 also interprets this as “all that bitrate I’m not using now, I can use it as soon as I have something meaningful to encode again!”. And when that happens, it will not only make the bitrate spike enormously, but also saturate the Pi’s CPU because x264 now wants to do a lot more than before... bufsize and maxrate keep this in check.

Because the resolution is so low, I do something a little bit different with the filtering chain: I crank up the temporal denoising a bit higher & add sharpening.

-vf "hqdn3d=0:0:5:5,unsharp=5:5:0.8:5:5:0.8,[......]

Be careful with the filter chain: I suspect a lot of (if not all) filters are single-threaded, and could be “semi-invisibly” holding back encoding if you ask too much of them. For example, I can’t reliably run unsharp on resolutions above 480x272 if the framerate is 20. I also tried, at one time, to keep the video input at 720p, then to resize it to the final resolution with the scale filter; turns out if you do it with lanczos instead of bicubic, it won’t be fast enough.

Ultimately, I’ll say again, read up on what settings do, experiment with their impact on both quality and speed, gauge which tradeoff you need between framerate and resolution, and you’ll find something that suits what you’re filming.

IN CONCLUSION

This was a real pain in the ass, but also interesting to play detective with. I hope this helps other Raspberry Pi users. There are definitely other things to be explored, such as encoding ffmpeg’s output in hardware again (using the OMX libraries), so that you can still do the denoising, text, etc. in ffmpeg, but still have very cheap hardware encoding (which the Pi 1, 2, and Zero models desperately need). And maybe someone else will figure out how to fix the audio desync further.

This is what the Pi and open-source things are about, anyway: sharing your discoveries and things that you made, your process... even if they’re not perfect.

My understanding is that, were I to use a device that works as both a video AND audio source, I would (maybe) not be facing this issue, as both sources would be working off of the same internal clock. However, that’d be worse than a workaround, it would have other issues (driver support?) and it defeats the point of trying to solve this on the hardware that Raspberry Pi users have :)

Thanks for reading!

Trending Tags

Last Seen Tags

#raspivid

Trending Tags

Last Seen Tags

#raspivid