Wuclan uses Wukong (Hadoop massive-data processing made easy) and Monkeyshines (massive-scale directed scraper) to grok the deep structure of social networks. It is designed to scrape in a way that respectful of the terms and technical limits of each site while being agressive and efficient with your resources. We use it in practice to collect and analyze social graphs as large as 50 million-nodes, 1 billion-edges, 500 GB raw data — all of it actual data extracted in compliance with the site’s terms of service.
Currently wuclan handles:
- Twitter — API
- Twitter — Search
- Twitter — Hosebird
- Last.fm
- Opensocial
Why?
APIs are nice and all, but they prevent any insight into a) global properties, or b) deep structure. You can’t find global word frequency and dispersion, or average clustering coefficient, or calculate pagerank, or determine weighted-shortest-paths connections between two people through an API call. But with a 10 machine hadoop cluster and a good-sized collection of data, you can (and wuclan has scripts to help answer many of those questions).
Wuclan is strictly meant for such massive-scale investigations. Unless you’re planning to do your final analysis on either hadoop or an enterprise-grade database system it’s probably not worth the hassle.
Wuclan: Scraping
is almost ready for public use. Check back shortly.
lib/wuclan/*/models
Defines the Wukong objects we’ll most often use
- The user models:
- TwitterUser
- TwitterUserProfiles
lib/wuclan/*/request
- Request — the basic request metadata
- Parse — dispatches the request contents into wuclan objects
- Wuclan::Request::Streamer ensures that the request is left alone while recordizing.
Wuclan: Analysis
actually most of this still lives in the imw_twitter_friends repo.
Install
Get the code
We’re still actively developing edamame. The newest version is available via Git on github:
$ git clone git://github.com/mrflip/edamame
A gem is available from gemcutter:
$ sudo gem install edamame --source=http://gemcutter.org
(don’t use the gems.github.com version — it’s way out of date.)
You can instead download this project in either zip or tar formats.
Get the Dependencies
To finish setting up, see the detailed setup instructions and then read the usage notes
- beanstalkd 1.3, libevent 1.4, and beanstalk-client
- Tokyo Tyrant, Tokyo Tyrant Ruby libs, Tokyo Cabinet, and Tokyo Cabinet Ruby libs
- Gems: wukong and monkeyshines
See the Detailed install instructions (it also has hints about installing Tokyo*, Beanstalkd and friends.
lib/wuclan/
More info
There are many useful examples in the examples/ directory.
Credits
wuclan was written by Philip (flip) Kromer ([email protected] / @mrflip) for the infochimps project
Help!
Send wuclan questions to the Infinite Monkeywrench mailing list