Wuclan uses Wukong (Hadoop massive-data processing made easy) and Monkeyshines (massive-scale directed scraper) to grok the deep structure of social networks. It is designed to scrape in a way that respectful of the terms and technical limits of each site while being agressive and efficient with your resources. We use it in practice to collect and analyze social graphs as large as 50 million-nodes, 1 billion-edges, 500 GB raw data — all of it actual data extracted in compliance with the site’s terms of service.

Currently wuclan handles:

  • Twitter — API
  • Twitter — Search
  • Twitter — Hosebird
  • Last.fm
  • Opensocial

Why?

APIs are nice and all, but they prevent any insight into a) global properties, or b) deep structure. You can’t find global word frequency and dispersion, or average clustering coefficient, or calculate pagerank, or determine weighted-shortest-paths connections between two people through an API call. But with a 10 machine hadoop cluster and a good-sized collection of data, you can (and wuclan has scripts to help answer many of those questions).

Wuclan is strictly meant for such massive-scale investigations. Unless you’re planning to do your final analysis on either hadoop or an enterprise-grade database system it’s probably not worth the hassle.

Wuclan: Scraping

is almost ready for public use. Check back shortly.

lib/wuclan/*/models

Defines the Wukong objects we’ll most often use

  • The user models:
  • TwitterUser
  • TwitterUserProfiles

lib/wuclan/*/request

  • Request — the basic request metadata
  • Parse — dispatches the request contents into wuclan objects
  • Wuclan::Request::Streamer ensures that the request is left alone while recordizing.

Wuclan: Analysis

actually most of this still lives in the imw_twitter_friends repo.

Install

Get the code

We’re still actively developing edamame. The newest version is available via Git on github:

$ git clone git://github.com/mrflip/edamame

A gem is available from gemcutter:

$ sudo gem install edamame --source=http://gemcutter.org

(don’t use the gems.github.com version — it’s way out of date.)

You can instead download this project in either zip or tar formats.

Get the Dependencies

To finish setting up, see the detailed setup instructions and then read the usage notes

See the Detailed install instructions (it also has hints about installing Tokyo*, Beanstalkd and friends.

lib/wuclan/


More info

There are many useful examples in the examples/ directory.

Credits

wuclan was written by Philip (flip) Kromer ([email protected] / @mrflip) for the infochimps project

Help!

Send wuclan questions to the Infinite Monkeywrench mailing list