SKINNY JEANS LOG PARSING WITH RUBY & SQLITE FOR HIPSTERS

EXAMPLE

your log file has lines that look like

0.0.0.0 - - [01/Oct/2010:00:00:00 -0700] "GET /posts/my-first-post HTTP/1.1" 200 1337 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" "-"
0.0.0.0 - - [01/Oct/2010:00:00:01 -0700] "GET /posts/my-first-post HTTP/1.1" 200 1337 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" "-"
0.0.0.0 - - [01/Oct/2010:00:00:03 -0700] "GET /posts/my-first-post HTTP/1.1" 200 1337 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" "-"
0.0.0.0 - - [02/Oct/2010:00:00:03 -0700] "GET /posts/my-first-post HTTP/1.1" 200 1337 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" "-"

then you get 2 SQL rows that looks like:

2010-10-01, my-first-post, 3
2010-10-02, my-first-post, 1

note the date columns truncate timestamp, so the days are in whatever timezone your log file reports in

WHY?

so you can query a database by date and path and get pageviews and have that data stored CHEAP
because i couldn’t find anything simpler and Google Analytics is limited to 10,000 API requests per day

USAGE

sj = SkinnyJeans::execute(logfile_path = "access.log", sqlite_skinny_jeans = "sqlite_skinny_jeans.db", path_regexp = /\s\/posts\/(.*)\sHTTP/, date_regexp = /\[(\d.*\d)\]/)
sj.pageview.where("date = '2010-10-01' and path = 'my-first-post'")
=> #<SkinnyJeans::Pageview id: 1, date: "2010-10-01", path: "my-first-post", pageview_count: 3>

NOTE: for now *you have to monkey patch the SkinnyJeans#parse_string_as_date*
Parse oldest logs first, then run regularly against your main log, let logrotate handle the rest (skinny_jeans remembers where it left off)
ASSUMES reading log files in ascending order, keeps track of last line read so you could put it on a scheduler or cron job
access the 2 activerecord classes, sj.pageview (returns Pageview class), and sj.update
enjoy the skinny jeans

PERFORMANCE

it parses 100,000 lines in < 2.5 seconds
persists 1,000 requests with 2 compound indexes in 15 seconds, or 10 seconds with home_run c extension