A light byte of Python

Tim Golden

http://timgolden.me.uk/python/iet-talk

Data Integration

Taking data from here and putting it there

Information on a web page
Values in a database
Ranges in a spreadsheet
The contents of a PDF
An RSS feed

All of these (and more) are possible sources and targets for everyday data needs.

Python is excellent at extracting data and at issuing it.
Built-in support for regular expressions, url reading and conventional IO.
Many useful third-party packages for reading and writing other formats
On Windows, links into COM and -- with IronPython -- the world of .Net.

Our Task

Open a book review web page
Extract a list of review summaries with their date added and a thumbnail
Store the list in an RDBMS,

This appears simple, and indeed it is simple.
Python is excellent at this kind of thing.
We have limited time and we have to go through step-by-step.

Extras

If time permits, issue all or part of the stored list as:

An RSS feed
A CSV file
A PDF document
A folder of thumbnails

Once we have the data in one format, it's just a question of issuing it in another

The Source

  [div class="whatsnew-date"]

  [h2]DATE[/h2]

  [div class="book-summary"]
    [p][span class="book-title"]TITLE[/span][/p]
    [img class="thumblet" src="IMAGE_URL" /]

    [div class="book-synopsis-summary"]
      [p class="synopsis"]SYNOPSIS[/p]
      [p class="summary"]SUMMARY[/p]
    [/div]

  [/div]

I've removed some of the HTML we're not interested in
We need to pull each of the highlighted fields out

The Target

    CREATE TABLE whatsnew
    (
      id INTEGER PRIMARY KEY,
      title TEXT,
      synopsis TEXT,
      summary TEXT,
      image BLOB,
      date_added DATE
    )

The id is a surrogate primary key, automatically incremented
The other information comes out of the web page, including the actual image itself, stored here as a BLOB.
Note the date_added is stored for each row

The Tools

Built-in modules

urllib & urlparse → open web pages and manipulate URLs
datetime → date/time manipulation

Third-party modules

sqlite → a simple RDBMS
BeautifulSoup → robust HTML parsing

Python comes supplied with large number of useful modules
Sometimes, other people build on those to provide more accessible interfaces
Other 3rd-party modules are for things the stdlib doesn't cover
sqlite3 is standard from Python2.5 but exists as a 3rd-party module for 2.4 and below
The stdlib has an HTMLParser but it relies on exactly correct HTML

The Outline

Suck the web page into a BeautifulSoup structure
Extract the information you need

For each date block

For each summary within it

Extract the Title, the Details, and the Image
Add the Date and this information to a list

Write each summary in the list to the database

Getting the web page

urllib.urlopen will fetch a web page
BeautifulSoup.BeautifulSoup will parse its contents

import is the way to get hold of an external module as a namespace

  import os, sys
  import urllib
  from BeautifulSoup import BeautifulSoup as Soup

  web_connection = urllib.urlopen ("http://local.goodtoread.org/whatsnew")
  page = Soup (web_connection)

  print page.find ("h2")
  # or print page.h2

Reading the date

BeautifulSoup's .find method will track down the right div
A split and string slice will pull out the day, month, year we need
The datetime's .strptime method will parse the date into a datetime structure

  from datetime import datetime

  date_block = page.find ("div", "whatsnew-date")
  date_string = date_block.h2.string
  #
  # Wow! Feature
  #
  w, d, m, y = date_string.split ()
  print w
  print d
  print m
  print y
  d = d[:-2]
  date_string = "%s %s %s" % (d, m, y)
  date = datetime.datetime.strptime (date_string, "%d %B %Y").date ()

Reading the story

BeautifulSoup's .findAll method will find all book summaries against this date
A standard Python list holds a tuple of the date and book information

  summaries = []
  for summary in date_block.findAll ("div", "book-summary"):
    title = summary.find ("span", "book-title").string
    summaries.append ((date, title))

Getting the picture

BeautifulSoup's .find method will find the img tag
urlparse.urljoin & urllib.urlopen will open the image

  import urlparse

  image = summary.find ("img", "thumblet")
  href = urlparse.urljoin ("http://local.goodtoread.org/", image['src'])
  image_data = urllib.urlopen (href).read ()

  open ("temp.jpg", "wb").write (image_data)
  os.startfile ("temp.jpg")

Writing it out

books.db already exists with the table structure shown earlier
The sqlite3 module will connect to a local file database
The connection's .executemany shortcut method will write all the data out at once

  import sqlite3

  reviews = []
  for review in date_block.findAll ("div", "book-summary"):
    title = review.find ("span", "book-title").string
    synopsis = review.find ("p", "synopsis").string
    summary = review.find ("p", "summary").string
    image = review.find ("img", "thumblet")
    url = urlparse.urljoin ("http://local.goodtoread.org", image['src'])
    image_data = urllib.urlopen (url).read ()

    reviews.append ((date, title, synopsis, summary, image_data))

  db = sqlite3.connect ("books.db")
  db.execute ("DELETE FROM whatsnew")
  db.executemany ("""
    INSERT INTO whatsnew (title, synopsis, summary, image, date_added)
    VALUES (?, ?, ?, ?, ?)
    """,
    reviews
  )
  db.commit ()

The finished result

books_to_db.py

A light byte of Python

IET • 5th July 2007

A light byte of Python

Tim Golden

http://timgolden.me.uk/python/iet-talk

Data Integration

Our Task

Extras

The Source

The Target

The Tools

The Outline

Getting the web page

Reading the date

Reading the story

Getting the picture

Writing it out

The finished result

Other possibilities