A light byte of Python

Tim Golden

http://timgolden.me.uk/python/iet-talk

Data Integration

Taking data from here and putting it there

All of these (and more) are possible sources and targets for everyday data needs.

  • Python is excellent at extracting data and at issuing it.
  • Built-in support for regular expressions, url reading and conventional IO.
  • Many useful third-party packages for reading and writing other formats
  • On Windows, links into COM and -- with IronPython -- the world of .Net.

Our Task

  • This appears simple, and indeed it is simple.
  • Python is excellent at this kind of thing.
  • We have limited time and we have to go through step-by-step.

Extras

If time permits, issue all or part of the stored list as:

  • Once we have the data in one format, it's just a question of issuing it in another

The Source

  [div class="whatsnew-date"]

  [h2]DATE[/h2]

  [div class="book-summary"]
    [p][span class="book-title"]TITLE[/span][/p]
    [img class="thumblet" src="IMAGE_URL" /]

    [div class="book-synopsis-summary"]
      [p class="synopsis"]SYNOPSIS[/p]
      [p class="summary"]SUMMARY[/p]
    [/div]

  [/div]
  
  • I've removed some of the HTML we're not interested in
  • We need to pull each of the highlighted fields out

The Target

    CREATE TABLE whatsnew
    (
      id INTEGER PRIMARY KEY,
      title TEXT,
      synopsis TEXT,
      summary TEXT,
      image BLOB,
      date_added DATE
    )
  
  • The id is a surrogate primary key, automatically incremented
  • The other information comes out of the web page, including the actual image itself, stored here as a BLOB.
  • Note the date_added is stored for each row

The Tools

Built-in modules

Third-party modules

  • Python comes supplied with large number of useful modules
  • Sometimes, other people build on those to provide more accessible interfaces
  • Other 3rd-party modules are for things the stdlib doesn't cover
  • sqlite3 is standard from Python2.5 but exists as a 3rd-party module for 2.4 and below
  • The stdlib has an HTMLParser but it relies on exactly correct HTML

The Outline

  1. Suck the web page into a BeautifulSoup structure
  2. Extract the information you need
    • For each date block
      • For each summary within it
        • Extract the Title, the Details, and the Image
        • Add the Date and this information to a list
  3. Write each summary in the list to the database

Getting the web page

import is the way to get hold of an external module as a namespace

      import os, sys
      import urllib
      from BeautifulSoup import BeautifulSoup as Soup
    
      web_connection = urllib.urlopen ("http://local.goodtoread.org/whatsnew")
      page = Soup (web_connection)
    
      print page.find ("h2")
      # or print page.h2
      

Reading the date

      from datetime import datetime
    
      date_block = page.find ("div", "whatsnew-date")
      date_string = date_block.h2.string
      #
      # Wow! Feature
      #
      w, d, m, y = date_string.split ()
      print w
      print d
      print m
      print y
      d = d[:-2]
      date_string = "%s %s %s" % (d, m, y)
      date = datetime.datetime.strptime (date_string, "%d %B %Y").date ()
      

Reading the story

      summaries = []
      for summary in date_block.findAll ("div", "book-summary"):
        title = summary.find ("span", "book-title").string
        summaries.append ((date, title))
      

Getting the picture

      import urlparse
    
      image = summary.find ("img", "thumblet")
      href = urlparse.urljoin ("http://local.goodtoread.org/", image['src'])
      image_data = urllib.urlopen (href).read ()
    
      open ("temp.jpg", "wb").write (image_data)
      os.startfile ("temp.jpg")
      

Writing it out

      import sqlite3
    
      reviews = []
      for review in date_block.findAll ("div", "book-summary"):
        title = review.find ("span", "book-title").string
        synopsis = review.find ("p", "synopsis").string
        summary = review.find ("p", "summary").string
        image = review.find ("img", "thumblet")
        url = urlparse.urljoin ("http://local.goodtoread.org", image['src'])
        image_data = urllib.urlopen (url).read ()
    
        reviews.append ((date, title, synopsis, summary, image_data))
    
      db = sqlite3.connect ("books.db")
      db.execute ("DELETE FROM whatsnew")
      db.executemany ("""
        INSERT INTO whatsnew (title, synopsis, summary, image, date_added)
        VALUES (?, ?, ?, ?, ?)
        """,
        reviews
      )
      db.commit ()
    
      

The finished result

books_to_db.py

Other possibilities