Swineherd-fs

  • file – Local file system. Only thoroughly tested on Ubuntu Linux.
  • hdfs – Hadoop distributed file system. Uses the Apache Hadoop 0.20 API. Requires JRuby.
  • s3 – Amazon Simple Storage System (s3).
  • ftpFTP (Not yet implemented)
All filesystem abstractions implement the following core functions, many taken from the UNIX filesystem:
  • mv
  • cp
  • cp_r
  • rm
  • rm_r
  • open
  • exists?
  • directory?
  • ls
  • ls_r
  • mkdir_p
Note: Since S3 is just a key-value store, it is difficult to preserve the notion of a directory. Therefore the mkdir_p function has no purpose, as there cannot be empty directories. mkdir_p currently only ensures that the bucket exists. This implies that the directory? test only succeeds if the directory is non-empty, which clashes with the notion on the UNIX filesystem. Additionally, the S3 and HDFS abstractions implement functions for moving files to and from the local filesystem:
  • copy_to_local
  • copy_from_local
Note: For these methods the destination and source path respectively are assumed to be local, so they do not have to be prefaced by a filescheme. The Swineherd::Filesystem module implements a generic filesystem abstraction using schemed filepaths (hdfs://,s3://,file://). Currently only the following methods are supported for Swineherd::Filesystem:
  • cp
  • exists?
For example, instead of doing the following:
hdfs = Swineherd::HadoopFilesystem.new
localfs = Swineherd::LocalFileSystem.new
hdfs.copy_to_local(‘foo/bar/baz.txt’, ‘foo/bar/baz.txt’) unless localfs.exists? ‘foo/bar/baz.txt’
You can do:
fs = Swineherd::Filesystem
fs.cp(‘hdfs://foo/bar/baz.txt’,‘foo/bar/baz.txt’) unless fs.exists?(‘foo/bar/baz.txt’)
Note: A path without a scheme is treated as a path on the local filesystem, or use the explicit file:// scheme for clarity. The following are equivalent:
fs.exists?('foo/bar/baz.txt')
fs.exists?(‘file://foo/bar/baz.txt’)

Config

  • In order to use the S3Filesystem, Swineherd requires AWS S3 access credentials.
  • In ~/swineherd.yaml or /etc/swineherd.yaml:
aws:
access_key: my_access_key
secret_key: my_secret_key
  • Or just pass them in when creating the instance:
S3 = Swineherd::S3FileSystem.new(:access_key => "my_access_key",:secret_key => "my_secret_key")