BBC World Service

Scraper for news stories from the BBC. news

This is a mockup of an idea for improving drag and drop content creation. See Interesting Places for hacking wiki.

A variation would be to 'scent' a web search with sites known to be well parsed as well as search terms.

Routing

The scraper would be trained to recognize urls based on samples showing similarities and differences. These may need to be marked up somehow to simplify recognition.

http://www.bbc.com/news/world-middle-east-28069800 http://www.bbc.com/news/technology-28055909

Routes will be handled by a server-side plugin that aggregates routes found within a site at startup with any remote pages with routing found in the lineup.

There could be whole sites devoted to collecting and applying routes.

Parsing

We'll assume sites use modern html with reasonable div tags and class names.

We'll organize parsing around detectors that construct specific output elements.

PAGE h1 .story-header

IMAGE div .caption img CAPTION div .caption span

VIDEO div .videoInStoryB object CAPTION div .videoInStoryB .caption

PARAGRAPH p

Detector specification will require some familiarity with html/css and browser debugging tools.

The server will be required to proxy non-CORS sites.

The server might apply detectors or pass them up to the client to be applied there.

Generated pages should cite the source and route page used to scrape it as provenance in the create action.