Class: Mbox2CSV::MboxParser

Inherits:
Object
  • Object
show all
Defined in:
lib/mbox2csv.rb

Overview

Main class for parsing MBOX files, saving email data/statistics to CSV, and (optionally) extracting selected attachment types to disk.

Instance Method Summary collapse

Constructor Details

#initialize(mbox_file, csv_file, stats_csv_file, recipient_stats_csv_file) ⇒ MboxParser

Initializes the MboxParser with file paths for the MBOX file, output CSV file, and statistics CSV files for sender and recipient statistics.

Parameters:

  • mbox_file (String)

    Path to the MBOX file to be parsed.

  • csv_file (String)

    Path to the output CSV file where parsed email data will be saved.

  • stats_csv_file (String)

    Path to the output CSV file where sender statistics will be saved.

  • recipient_stats_csv_file (String)

    Path to the output CSV file where recipient statistics will be saved.



18
19
20
21
22
23
24
25
26
# File 'lib/mbox2csv.rb', line 18

def initialize(mbox_file, csv_file, stats_csv_file, recipient_stats_csv_file)
    @mbox_file = mbox_file
    @csv_file = csv_file
    @statistics = EmailStatistics.new
    @stats_csv_file = stats_csv_file
    @recipient_stats_csv_file = recipient_stats_csv_file
    @senders_folder = 'senders/'
    FileUtils.mkdir_p(@senders_folder) # Create the senders folder if it doesn't exist
end

Instance Method Details

#extract_attachments(extract: true, filetypes: [], output_folder: 'attachments') ⇒ Integer

Extract selected attachment file types from the MBOX into a folder.

Parameters:

  • extract (Boolean) (defaults to: true)

    Flag to enable/disable extraction.

  • filetypes (Array<String>) (defaults to: [])

    Array of extensions to extract (e.g., %w[pdf jpg png]).

  • output_folder (String) (defaults to: 'attachments')

    Directory to write attachments into.

Returns:

  • (Integer)

    Number of files successfully written.



65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
# File 'lib/mbox2csv.rb', line 65

def extract_attachments(extract: true, filetypes: [], output_folder: 'attachments')
    return 0 unless extract

    wanted_exts = Array(filetypes).map { |e| e.to_s.downcase.sub(/\A\./, '') }.uniq
    raise ArgumentError, "filetypes must not be empty when extract: true" if wanted_exts.empty?

    FileUtils.mkdir_p(output_folder)
    total_written = 0

    total_lines = File.foreach(@mbox_file).inject(0) { |c, _| c + 1 }
    progressbar = ProgressBar.create(title: "Extracting Attachments", total: total_lines, format: "%t: |%B| %p%%")

    File.open(@mbox_file, 'r') do |mbox|
        buffer = ""
        mbox.each_line do |line|
            progressbar.increment
            if line.start_with?("From ")
                total_written += process_attachment_block(buffer, wanted_exts, output_folder) unless buffer.empty?
                buffer = ""
            end
            buffer << line
        end
        total_written += process_attachment_block(buffer, wanted_exts, output_folder) unless buffer.empty?
    end

    puts "Attachment extraction completed. #{total_written} file(s) saved to #{output_folder}"
    total_written
rescue => e
    puts "Error extracting attachments: #{e.message}"
    0
end

#parseObject

Parses the MBOX file and writes the email data to the specified CSV file. It also saves sender and recipient statistics to separate CSV files. A progress bar is displayed during the parsing process.



31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
# File 'lib/mbox2csv.rb', line 31

def parse
    total_lines = File.foreach(@mbox_file).inject(0) { |c, _line| c + 1 }
    progressbar = ProgressBar.create(title: "Parsing Emails", total: total_lines, format: "%t: |%B| %p%%")

    CSV.open(@csv_file, 'w') do |csv|
        csv << ['From', 'To', 'Subject', 'Date', 'Body']

        File.open(@mbox_file, 'r') do |mbox|
            buffer = ""
            mbox.each_line do |line|
                progressbar.increment
                if line.start_with?("From ")
                    process_email_block(buffer, csv) unless buffer.empty?
                    buffer = ""
                end
                buffer << line
            end
            process_email_block(buffer, csv) unless buffer.empty?
        end
    end
    puts "Parsing completed. Data saved to #{@csv_file}"

    @statistics.save_sender_statistics(@stats_csv_file)
    @statistics.save_recipient_statistics(@recipient_stats_csv_file)
rescue => e
    puts "Error processing MBOX file: #{e.message}"
end