Making a slideshow with Ffmpeg

Beware! There are easier ways to do this ;-) But I’ve long wanted to learn how to use some of the more advanced features of ffmpeg, and putting together a slideshow (with a inset video) for an online course during Covid-19 was the opportunity. Now I have a pipeline, this should not take so long next time.

1. Write the talk you normally would for a live presentation. Make a PDF version of the slides.

2. Record the talk

Either with your laptop webcam, or using a digital camera. I’ll be using a resolution of 1280 x 720 pixels, at 25 frames per second. Since the video inset will be shrunk, a smaller resolution (800x600) would be fine.

If you’re like me, you can’t do this in a single take. No problem. Also, be careful about the sound quality. One could record sound separately from the moving images, and later combine them, but this is beyond this particular HOWTO.

3. Process the talk video

Using ffplay, determine the start and stop time in each segment. (ffprobe is invaluable too.) Trim the videos and transcode to MP4. In my limited experience, working with MP4s in ffmpeg is most successful.

ffmpeg -i PA131753.MOV -ss  7 -t 509 part1.mp4 
ffmpeg -i PA131754.MOV -ss  6 -t 449 part2.mp4 
ffmpeg -i PA131755.MOV -ss  6 -t 284 part3.mp4 
ffmpeg -i PA131758.MOV -ss 80 -t 557 part4.mp4

Concatenate the segments:

ffmpeg -i part1.mp4 -i part2.mp4 -i part3.mp4 -i part4.mp4 \
  -filter_complex \
    "[0:v:0][0:a:0] [1:v:0][1:a:0] [2:v:0][2:a:0] [3:v:0][3:a:0] \
     concat=n=4:v=1:a=1 [outv][outa]" \
  -map "[outv]" -map "[outa]" me.mp4

(I couldn’t get the simpler -concat function to work.) In words: “send the video of input 0 to stream 0, the audio of input 0 to stream 0, the video of input 1 to stream 0... concatenate four inputs into one video and one audio stream, them map those streams to the video and audio in the output.

Then add a fade-in and fade-out (not necessary, of course):

ffmpeg -y \
   -i me.mp4 \
   -filter_complex \
     "color=black      : 1280x720 : d=1720            [base] ;  \
      [0:v] fade=in    : st=0     : d=2    : alpha=1  [v1]   ;  \
      [v1]  fade=out   : st=1718  : d=2    : alpha=1  [v2]   ;  \
      [base][v2] overlay                              [v3]   ;  \
      [0:a] afade=t=in : st=0     : d=2               [a1]   ;  \
      [a1] afade=t=out : st=1718  : d=2               [a2]    " \
   -map [v3] -map [a2] \
   -t 1720 \

In words: “make a 1280x720 black screen for 1720 s; take the video from the first input and fade in over 2 s, then take that stream and fade out for 2 s starting at second 1718; overlay the faded stream on the black; fade the audio in and out similarly, and combine; trim the final video to 1720 s.”

4. Make the slide video

Watch the talk video and jot down the start time of each slide, and the end time of the last slide. I tried to use the concat demuxer with a script listing slides and durations, but the resultant video was faulty, skipping some slides. In the end, I made a video of each slide separately. Some awk and some shell script:

echo "
1 0
2 65
3 108
17 1644
18 1720" | \
  awk '{if ($1>1) {print "ffmpeg -loop 1 -i img" $1 -1 \
        ".jpg -vf \"fps=25,format=yuvj420p\" -t " $2 - last \
        " img" $1-1 ".mp4"} last = $2}' >

ffmpeg -loop 1 -i img1.jpg -vf "fps=25,format=yuvj420p" -t 65 img1.mp4
ffmpeg -loop 1 -i img2.jpg -vf "fps=25,format=yuvj420p" -t 43 img2.mp4


Then combine them:

ffmpeg \
  -i img1.mp4 -i img2.mp4 -i img3.mp4 -i img4.mp4 \
  -i img5.mp4 -i img6.mp4 -i img7.mp4 -i img8.mp4 \
  -i img9.mp4 -i img10.mp4 -i img11.mp4 -i img12.mp4 \
  -i img13.mp4 -i img14.mp4 -i img15.mp4 -i img16.mp4 \
  -i img17.mp4 \
-filter_complex \
  "[0:v:0] \
   [1:v:0] \
   [16:v:0] \
   concat=n=17:v=1 [outv]" \
-map "[outv]" slides.mp4

5. Overlap the video inset on the slides

ffmpeg \
  -y \
  -i slides.mp4 \
  -i me_fade.mp4 \
  -filter_complex \
    "[1]    scale=w=320 : h=180 [2] ; \
     [0][2] overlay=920 : 240   [3] " \
  -map "[3]" -map "[1:a]" \

In words: “Scale the second input to 320 x 180, then overlay than on the first input, locating the inset video at 920 pix across and 240 down; use the audio from the first input.

Yay! Ffmpeg is pretty handy, when you finally get it. And the final file is only 58M for a 28 minute slideshow (see snippet).