Ratr

Ratr is a command line tool that takes in a CSV of movie titles and years and then performs an average between the Rotten Tomatoes score (critic and audience) and the IMDB rating as well.

USAGE

gem install ratr
ratr

20 Feet from Stardom (2013): 9.1
12 Years a Slave (2013): 8.9
7th Heaven (1927): 8.0
7 Faces of Dr. Lao (1964): 8.4
8 Mile (2002): 6.6

DEVELOPMENT

To run the test suite:

git clone [email protected]:patricksrobertson/ratr.git
cd ratr
bundle

rspec

To throw out the existing VCR cassettes:

rm -rf spec/vcr
rspec

(note the VCR cassette doesn't seem to retry quite as well as the real code, so there may be non-deterministic failures as a result of the recording. I suggest for the sake of this example that you keep the cassettes)

PERFORMANCE

The application should run within the bounds of the Rotten Tomatoes (RT) expected response time which I calculate at roughly between 3m45s and 7m50s or so. The idea bethind the calculation is that RT actually limits on a per second basis, so you can at best case complete 5 requests per second (.2 sec/req) but in a failure ridden case you are probably doubling the time on request( .4 sec/req). I'm running two parallel requests at a time on the OMDB API, as it tends to respond more slowly than the RT API and they are comfortable.

The application pools both sources into independent threads and then blocks until both operations are complete. Earlier versions ran this in serial, resulting in unacceptable runtimes.

The way to speed this up would be to have multiple RT API keys, and split the movies collection up into N components. Probably not what RT had in mind while doing rate limiting.

DURABILITY

Neither API particualrly documents the error codes. The only error code I (frequently) ran into was a 403 on the RT API -- which should've been a 503 for rate limit exceeded.

The OMDB API scraper handles the 404 case well -- so well that I didn't realize some of the movies weren't found on OMDB until I looked at the VCR cassettes.

Given that I wasn't able to find a maintenance code response from either API -- I deferred implementing specific error handling in that event. The last case to consider was timeouts on the applications network inoperability, I ran out of time to add that handling in at the end.

I would normally do that.

EXTENSIBILITY

The bulk of my effort was devoted to making this slightly more extensible. I constructed a Source class that combines an HTTP processor (serial or concurrent) and an HTTP wrapper that tells the source how to fetch an individual movie and then scrub the responses down to pertinent scores. The thread manager can take N number of sources so in order to add a new source you'd do the following:

Write HTTP wrapper class, defining the interface of #get, #scrub, #adapter.
Add the source (indicating whether it is serial or concurrent) to the Application class
Add the source into the sources array passed into the thread manager.

I got a little hand-wavy on my testing of these things. With time allotment light, I aimed for a 'refactor away a spike' technique where I wrote acceptance tests, then once that was complete started moving code around. So on the individual basis the tests are a little light, and I totally avoided doubling concrete objects. In order to be able to test with doubles, I'd probably do this:

Add tests for the specific interfaces/roles (HttpProcessor, HttpWrapper)
Share said tests on the concrete examples
Feel much more confident about doubling -- now if the interface or a concrete example fails the test suite can catch that at the same rate of the implementation.

To also be fair, I'm not happy about the HTTP wrappers. The scrub method smells of doing too much. I'd look at that with caution and potentially break it out if another source makes this look bad.

I didn't make the movies collection a real object -- there are some smells of accessing the movies collection like a data structure instead of sending messages to it. Making something that implements Enumerable would've been preferrable (in addition to doing find methods). The same can be said for the collection of results that the Manager returns.

TIME SPENT

Spike 1 (serial access) [branch pr-spike] - 45 minutes. Goal was to gain understanding of APIs, worst case runtime.

Spike 2 (parallel access) [branch pr-multi_spike] - 1 hour. Goal was to bring runtime down to the time it takes to pull results from the RT API.

Production implementation [master / pr-refactor_from_spike] - 3-4 hours. I dabbled in this throughout an evening and did a little work the following morning. Goal was to write tests that could re-run without hitting the APIs (rate limit rules everything around me) and bring the required application into an MVP.