Downloading WWDC Videos using Nokogiri

How to download WWDC Videos for offline viewing using Nokogiri, cURL and wget

Published on September 24, 2018

About 4 minutes read

Introduction

I do love a good excuse to start a new project and learn something new. There is something profoundly beautiful about the journey you need to take when trying to find a way to solve a new problem.

So here's the current excuse for a new project: I was going on a trip to New York City and needed to find a way to (productively) spend my time on both flights (it takes about 10 hours to get to NYC). I decided to catch up on the WWDC videos I hadn't watched at the time. After going to the Apple Developer portal, I decided that I wasn't going to spend all the time clicking through all the links on the page to download every video, so I used it as an excuse to create a tiny video downloader script using Ruby.

I have used Nokogiri a number of times in the past, when I needed to scrape data from a website, but I had never used it in conjuction with a system command to download files.

The download utility I used was wget (I also added an option for cURL in the code below). I decided to not write a download function from scratch because there was no reason to re-invent the wheel, especially when such great utilities already exist. The main learning focus was on how to effectively scrape a web page recursively and use command line utilities programatically.

I wanted the script to be also usable in any *NIX and BSD system, so I could potentially use it in a Rails application should I ever need to, with only a few small changes to some methods or logic.

Script execution and flow explanation

You can execute the script using this command: ruby [name-of-script] https://developer.apple.com/videos/wwdc2017/ .. You need to specify the page you want to scrape and the directory to download the files. The defaults are the files from WWDC 2017 and the current directory to save the files.

The basic flow of the script is as follows:

  • The script opens the main page specified by the first argument of the script (default: WWDC 2017)
  • It scrapes the index page for all video links, creates video objects and adds them to the videos array of the downloader
  • It goes through each entry in the videos array and starts downloading it, creating its parent directory (if it doesn't exist)

The following section contains the full source code for the downloader/scraper script.

Code

module DownloadController
  require 'nokogiri'
  require 'open-uri'
  require 'awesome_print'

  # Helper functions
  # 
  # Returns the result of opening the page that is passed to it
  def self.open_page(uri)
    Nokogiri::HTML(open(uri))
  end

  # Video class definition
  class Video
    attr_accessor :name, :page_url, :download_url, :section

    def initialize(name, page_url="", download_url="", section="")
      @name         = name 
      @page_url     = page_url 
      @download_url = download_url
      @section      = section
    end

    def describe
      puts "=========================================================="
      puts "Video Info:\n Name: #{@name}\n Page: #{@page_url}\n Download Link: #{@download_url}\n Section: #{@section}\n"
      puts "=========================================================="
    end  

    def set_download_url
      video_download_url = DownloadController::open_page(@page_url).css('a').select {|a| a.text == 'HD Video'}

      # Checks for an HD Video link. If none exists, select will return an empty array 
      if video_download_url == []
        @download_url = ""
      else
        @download_url = video_download_url.first['href']
      end
    end

    # Returns true if video contains a valid download url
    def valid_download?
      @download_url != ""
    end

    # Returns the video filename removing the ?dl=1 suffix
    def filename
      @download_url.split("/").last.split("?").first
    end

    # Returns the file extension using the filename
    def file_extension
      filename.split(".").last
    end

    # Retuns the number for the video session
    def video_session
      filename.split("_").first
    end

    # Returns the a more readable version of the video filename 
    def proper_name
      "#{@name} (Session: #{video_session}).#{file_extension}"
    end
  end

  # Downloader class
  class Downloader
    def initialize
      @base_url       = ""
      @page_url       = ARGV[0] || "https://developer.apple.com/videos/wwdc2017/"
      @save_directory = ARGV[1] || "."
      @videos         = []
    end

    # Helper functions

    # Sets the base url for the video links
    def set_base_url
      uri = URI.parse(@page_url)
      @base_url = "#{uri.scheme}://#{uri.host}"
    end

    # Scrapes the index page for all videos
    def get_videos_from_index_page

      ap "Getting index page links"

      DownloadController::open_page(@page_url).css(".collection-focus-group").each{ |c| 

        # Get the video section to name the folder
        section_name = c.css(".font-bold").text.strip ||= ""

        # Go through each of the section video links and create the video objects
        c.css("a").select {|link| link.text.strip != ''}.each {|link| 

          v = Video.new(
            name = link.text.strip, 
            page_url = "#{@base_url}#{link['href']}",
            download_url = "", 
            section = section_name
          )

          ap "Scraping #{v.name}"
          v.set_download_url
          v.describe

          @videos << v
        }
      }
    end

    # Goes over every entry in the video array and downloads valid videos
    def download_files

      @videos.select {|video| video.valid_download? }.each {|v|

        # Download video file
        download_file(v)
      }

    end

    # Bootstrap method. It sets the base URL for the videos, it gets all video links from the front page
    # and proceeds to download the videos in the video array of the Downloader instance.
    def start
      set_base_url
      get_videos_from_index_page
      download_files
    end

    private 
      # Download file method. It creates the folder if it doesn't already exist
      def download_file(video)

        # Check if the save directory exists and if not create it
        `mkdir "#{@save_directory}/#{video.section}"` unless Dir.exist?("#{@save_directory}/#{video.section}/")

        video.describe
        ap "Downloading file..."

        `wget -O "#{@save_directory}/#{video.section}/#{video.filename}" -N "#{video.download_url}" --show-progress`
        # `curl --url "#{video.download_url}" -o "#{@save_directory}/#{video.section}/#{video.filename}"`
      end

  end
end

# Initialize a new Downloader object and start downloading the files
d = DownloadController::Downloader.new
d.start

Resources

You can get the source code here