(cache)Finding the durations of MP4 files without downloading the entire file

I wanted to find the durations of a bunch of MP4 files located out on the net – durations for the introduction videos for the top Kickstarter projects.

But I wanted to do this quickly. Downloading all those MP4 files would take too long. A little bit of research revealed that MP4 files files set up for streaming have their metadata (or moov atom) at the beginning of the file.

Now I need a way to read just the metadata, without getting the entire file.

More research reveals that I can use curl and dd to get the first bytes of a file. For some reason ‘curl -r’ doesn’t work.

So now we’re ready to go.

I made a file that had one Kickstarter project URL per line. Here’s a couple of them:

http://www.kickstarter.com/projects/formlabs/form-1-an-affordable-professional-3d-printer
http://www.kickstarter.com/projects/1523379957/oculus-rift-step-into-the-game

1 2	http://www.kickstarter.com/projects/formlabs/form-1-an-affordable-professional-3d-printer http://www.kickstarter.com/projects/1523379957/oculus-rift-step-into-the-game

This script will load the Kickstarter project page, and get the URL-encoded download link for the project’s introductory video, if there is one:

$ cat ks-urls | xargs -Ifoo sh -c "curl -s foo|grep link |grep http://www.kickstarter.com/swf/kickplayer.swf |cut -d '&' -f 5| sed -e 's/amp;file=//g' " > ks-video-urls

1 2	$ cat ks-urls \| xargs -Ifoo sh -c "curl -s foo\|grep link \|grep http://www.kickstarter.com/swf/kickplayer.swf \|cut -d '&' -f 5\| sed -e 's/amp;file=//g' " > ks-video-urls

Now we need to URL-decode the URLs:

$ cat ks-video-urls | python -c 'import sys, urllib; print urllib.unquote_plus(sys.stdin.read())' > ks-decoded-video-urls

1	$ cat ks-video-urls \| python -c 'import sys, urllib; print urllib.unquote_plus(sys.stdin.read())' > ks-decoded-video-urls

Now we get the durations from the video urls, you’ll need Python, pip, and virtualenvwrapper installed. We make a Python virtual environment, and install hsaudiotag module to decode the mp4 metadata:

$ mkvirtualenv mp4
$ pip install hsaudiotag
$ cat ks-decoded-video-urls| xargs -Ifoo sh -c "curl -s foo| dd count=1 2>/dev/null | python -c 'import sys, StringIO; from hsaudiotag import mp4; s=StringIO.StringIO(sys.stdin.read()); print mp4.File(s).duration'" > ks-video-durations

$ mkvirtualenv mp4

$ pip install hsaudiotag

$ cat ks-decoded-video-urls| xargs -Ifoo sh -c "curl -s foo| dd count=1 2>/dev/null | python -c 'import sys, StringIO; from hsaudiotag import mp4; s=StringIO.StringIO(sys.stdin.read()); print mp4.File(s).duration'" > ks-video-durations

This code uses curl and dd to download only the first 512-byte block of the MP4 file.

Now we analyze the durations using a simple R script, I am on a Mac so I need to use Homebrew to install R:

$ brew install gfortran
$ brew install R
$ R -q -e "x <- read.csv('ks-video-durations', header = F); summary(x); sprintf('standard deviation: %f', sd(x[ , 1]))"

$ brew install gfortran

$ brew install R

$ R -q -e "x <- read.csv('ks-video-durations', header = F); summary(x); sprintf('standard deviation: %f', sd(x[ , 1]))"

Output for the top 100 Kickstarter technology projects (by amount raised) – all numbers are in seconds:

 Min.   : 52.0  
 1st Qu.:145.8  
 Median :183.5  
 Mean   :203.3  
 3rd Qu.:246.5  
 Max.   :583.0
[1] "standard deviation: 90.14273"

Min. : 52.0

1st Qu.:145.8

Median :183.5

Mean :203.3

3rd Qu.:246.5

Max. :583.0

[1] "standard deviation: 90.14273"

The average duration of the top 100 Kickstarter videos is 203.3 seconds, or just about 3.38 minutes.

Thanks to:

Stack Overflow for this question and answer about how to calculate statistics of numbers in a file, one per line.
The hsaudiotag team.