Wgit Change Log
v0.0.0 - BREAKING CHANGES
Added
- ... ### Changed/Removed
- ... ### Fixed
- ...
v0.11.0 - BREAKING CHANGES
This release is a biggie with the main headline being the introduction of robots.txt support (see below). This release introduces several breaking changes so take care when updating your current version of Wgit.
Added
- Ability to prevent indexing via
robots.txtandnoindexvalues in HTMLmetaelements and HTTP response headerX-Robots-Tag. See new classWgit::RobotsParserand the updatedWgit::Indexer#index_*methods. Also see the wiki article on the subject. Wgit::RobotsParserclass for parsingrobots.txtfiles.Wgit::Response#no_index?andWgit::Document#no_index?methods (see wiki article above).- Added two new default extractors which extract robots meta elements for use in
Wgit::Document#no_index?. - Added
Wgit::Document.to_h_ignore_varsArray for user manipulation. - Added
Wgit::Utils.pprintmethod to aid debugging. - Added
Wgit::Utils.sanitize_urlmethod. - Added
Wgit::Indexer#index_www(max_urls_per_iteration:, ...)param. - Added
Wgit::Url#redirectsand#redirects=methods. - Added
Wgit::Url#redirects_journeyused byWgit::Indexerto insert a Url and it's redirects. - Added
Wgit::Database#bulk_upsertwhichWgit::Indexernow uses where possible. This reduces the total database calls made during an index operation. ### Changed/Removed - Updated
Wgit::Indexer#index_*methods to honour index prevention methods (see the wiki article). - Updated
Wgit::Utils.sanitize*methods so they no longer modify the receiver. - Updated
Wgit::Crawler#crawl_urlto always return the crawledWgit::Document. If relying onnilin your code, you should now usedoc.empty?instead. - Updated
Wgit::Indexermethod logs. - Updated/added custom class
#inspectmethods. - Renamed
Wgit::Utils.printf_search_resultstopprint_search_results. - Renamed
Wgit::Url#concatto#join. The#concatmethod is nowString#concat. - Updated
Wgit::Indexermethods to now write external Urls to the Database as:doc.external_urls.map(&:to_origin)meaninghttp://example.com/aboutbecomeshttp://example.com. - Updated the following methods to no longer omit trailing slashes from Urls:
Wgit::Url-#to_path,#omit_base,#omit_originandWgit::Document-#internal_links,#internal_absolute_links,#external_links. For an average website, this results in ~30% less network requests when crawling. - Updated Ruby version to
3.3.0. - Updated all bundle dependencies to latest versions, see
Gemfile.lockfor exact versions. ### Fixed Wgit::Crawler#crawl_sitenow internally records all redirects for a given Url.Wgit::Crawler#crawl_siteinfinite loop when using Wgit on a Ruby version >3.0.2.
- Various other minor fixes/improvements throughout the code base.
v0.10.8
Added
- Custom
#inspectmethods toWgit::UrlandWgit::Documentclasses. Document.remove_extractorsmethod, which removes all default and defined extractors.
Changed/Removed
- ... ### Fixed
- ...
v0.10.7
Added
- ... ### Changed/Removed
- ... ### Fixed
- Security vulnerabilities by updating gem dependencies.
v0.10.6
Added
Wgit::DSLmethod#crawl_url(aliased to#crawl). ### Changed/Removed- Added a
&blockparam toWgit::Document#extract, which gets passed to#extract_from_html. ### Fixed
- ...
v0.10.5
Added
Database#last_resultgetter method to return the most recent raw mongo result. ### Changed/Removed- ... ### Fixed
- ...
v0.10.4
Added
Database#search_textmethod which returns a Hash ofurl => text_resultsinstead ofWgit::Documents(like#search). ### Changed/Removed- ... ### Fixed
- ...
v0.10.3
Added
- ... ### Changed/Removed
- Changed
Database#create_collectionsand#create_unique_indexesby removingrescue nilfrom their database operations. Now any underlying errors with the database client are not masked. ### Fixed
- ...
v0.10.2
Added
Wgit::Base#setupand#teardownmethods (lifecycle hooks) that can be overridden by subclasses. ### Changed/Removed- ... ### Fixed
- ...
v0.10.1
Added
- Support for Ruby 3. ### Changed/Removed
- Removed support for Ruby 2.5 (as it's too old). ### Fixed
- ...
v0.10.0
Added
Wgit::Url#scheme_relative?method. ### Changed/Removed- Breaking change: Changed method signature of
Wgit::Url#prefix_schemeby making the previously named parameter a defaulted positional parameter. Remove theprotocolnamed parameter for the old behaviour. ### Fixed
- Scheme-relative bug by adding support for scheme-relative URL's.
v0.9.0
This release is a big one with the introduction of a Wgit::DSL and Javascript parse support. The README has been revamped as a result with new usage examples. And all of the wiki articles have been updated to reflect the latest code base.
Added
Wgit::DSLmodule providing a wrapper around the underlying classes and methods. Check out theREADMEfor example usage.Wgit::Crawler#parse_javascriptwhich when set totrueuses Chrome to parse a page's Javascript before returning the fully rendered HTML. This feature is disabled by default.Wgit::Baseclass to inherit from, acting as an alternative form of using the DSL.Wgit::Utils.sanitizewhich calls.sanitize_*underneath.Wgit::Crawler#crawl_sitenow has afollow:named param - if set, it's xpath value is used to retrieve the next urls to crawl. Otherwise the:defaultis used (as it was before). Use this to override how the site is crawled.Wgit::Databasemethods:#clear_urls,#clear_docs,#clear_db,#text_index,#text_index=,#create_collections,#create_unique_indexes,#docs,#get,#exists?,#delete,#upsert.Wgit::Database#clear_db!alias.Wgit::Documentmethods:#at_xpath,#at_css- which call nokogiri underneath.Wgit::Document#extractmethod to perform one off content extractions.Wgit::Indexer#index_urlsmethod which can index several urls in one call.Wgit::Urlmethods:#to_user,#to_password,#to_sub_domain,#to_port,#omit_origin,#index?. ### Changed/Removed- Breaking change: Moved all
Wgit.index*convienence methods intoWgit::DSL. - Breaking change: Removed
Wgit::Url#normalise, use#normalizeinstead. - Breaking change: Removed
Wgit::Database#num_documents, use#num_docsinstead. - Breaking change: Removed
Wgit::Database#lengthand#count, use#sizeinstead. - Breaking change: Removed
Wgit::Database#document?, use#doc?instead. - Breaking change: Renamed
Wgit::Indexer#index_pageto#index_url. - Breaking change: Renamed
Wgit::Url.parse_or_nilto be.parse?. - Breaking change: Renamed
Wgit::Utils.process_*to be.sanitize_*. - Breaking change: Renamed
Wgit::Utils.remove_non_bson_typesto beWgit::Model.select_bson_types. - Breaking change: Changed
Wgit::Indexer.index*named param default frominsert_externals: truetofalse. Explicitly set it totruefor the old behaviour. - Breaking change: Renamed
Wgit::Document.define_extensiontodefine_extractor. Same goes forremove_extension -> remove_extractorandextensions -> extractors. See the docs for more information. - Breaking change: Renamed
Wgit::Document#docto#parser. - Breaking change: Renamed
Wgit::Crawler#time_outto#timeout. Same goes for the named param passed toWgit::Crawler.initialize. - Breaking change: Refactored
Wgit::Url#relative?now takes:origininstead of:basewhich takes the port into account. This has a knock on effect for some other methods too - check the docs if you're getting parameter errors. - Breaking change: Renamed
Wgit::Url#prefix_baseto#make_absolute. - Updated
Utils.printf_search_resultsto return the number of results. - Updated
Wgit::Indexer.newwhich can now be called without parameters - the first param (for a database) now defaults toWgit::Database.newwhich works ifENV['WGIT_CONNECTION_STRING']is set. - Updated
Wgit::Document.define_extractorto define a setter method (as well as the usual getter method). - Updated
Wgit::Document#searchto support aRegexpquery (in addition to a String). ### Fixed - Re-indexing bug so that indexing content a 2nd time will update it in the database - before it simply disgarded the document.
- Wgit::Crawler#crawl_site params allow/disallow_paths values can now start with a /.
v0.8.0
Added
- To the range of
Wgit::Document.text_elements. Now (only and) all visible page text should be extracted intoWgit::Document#textsuccessfully. Wgit::Document#descriptiondefault extension.Wgit::Url.parse_or_nilmethod. ### Changed/Removed- Breaking change: Renamed
Document#stats[:text_snippets]to be:text. - Breaking change:
Wgit::Document.define_extension's block return value now becomes thevarvalue, even whennilis returned. This allowsvarto be set tonil. - Potential breaking change: Renamed
Wgit::Response#crawl_time(alias) to be#crawl_duration. - Updated
Wgit::Crawler::SUPPORTED_FILE_EXTENSIONSto beWgit::Crawler.supported_file_extensions, making it configurable. Now you can add your own URL extensions if needed. - Updated the Wgit core extension
String#to_urlto useWgit::Url.parseallowing instances ofWgit::Urlto returned as is. This also affectsEnumerable#to_urlsin the same way. ### Fixed
- An issue where too much Wgit::Document#text was being extracted from the HTML. This was fixed by reverting the recent commit: "Document.text_elements_xpath is now //*/text()".
v0.7.0
Added
Wgit::Indexer.newoptionalcrawler:named param.bin/wgitexecutable; available aftergem install wgit. Just typewgitat the command line for an interactive shell session with the Wgit gem already loaded.Document.extensionsreturning a Set of all defined extensions. ### Changed/Removed- Potential breaking changes: Updated the default search param from
whole_sentence: falsetotrueacross all search methods e.g.Wgit::Database#search,Wgit::Document#searchWgit.indexed_searchetc. This brings back more relevant search results by default. - Updated the Docker image to now include index names; making it easier to identify them. ### Fixed
- ...
v0.6.0
Added
- Added
Wgit::Utils.proces_arr encode:param. ### Changed/Removed - Breaking changes: Updated
Wgit::Response#success?and#failure?logic. - Breaking changes: Updated
Wgit::Crawlerredirect logic. See the docs for more info. - Breaking changes: Updated
Wgit::Crawler#crawl_sitepath params logic to support globs e.g.allow_paths: 'wiki/*'. See the docs for more info. - Breaking changes: Refactored references of
encode_html:toencode:in theWgit::DocumentandWgit::Crawlerclasses. - Breaking changes:
Wgit::Document.text_elements_xpathis now//*/text(). This means that more text is extracted from each page and you can no longer be selective of the text elements on a page. - Improved
Wgit::Url#valid?and#relative?. ### Fixed - Bug fix in
Wgit::Crawler#crawl_sitewhere*.phpURLs weren't being crawled. The fix was to implementWgit::Crawler::SUPPORTED_FILE_EXTENSIONS.
- Bug fix in Wgit::Document#search.
v0.5.1
Added
Wgit.version_strmethod. ### Changed/Removed- Switched to optimistic dependency versioning. ### Fixed
- Bug in Wgit::Url#concat.
v0.5.0
Added
- A Wgit Wiki! https://github.com/michaeltelford/wgit/wiki
Wgit::Document#contentalias for#html.Wgit::Url#prefix_basemethod.Wgit::Url#to_addressable_urimethod.- Support for partially crawling a site using
Wgit::Crawler#crawl_site(allow_paths: [])ordisallow_paths:. Wgit::Url#+as alias for#concat.Wgit::Url#invalid?method.Wgit.versionmethod.Wgit::Responseclass containing adapter agnostic HTTP response logic. ### Changed/Removed- Breaking changes: Removed
Wgit::Document#date_crawledand#crawl_durationbecause both of these methods exist on theWgit::Document#url. Instead, usedoc.url.date_crawledetc. - Breaking changes: Added to and moved
Document.define_extensionblock params, it's now|value, source, type|. Thesourceis not what it used to be; it's nowtype- of either:documentor:object. Confused? See the docs. - Breaking changes: Changed
Wgit::Url#prefix_protocolso that it no longer modifies the receiver. - Breaking changes: Updated
Wgit::Url#to_anchorand#to_querylogic to align with that ofAddressable::URIe.g. the anchor value no longer contains#prefix; and the query value no longer contains?prefix. - Breaking changes: Renamed
Wgit::Urlmethods containinganchorto now be namedfragmente.g.to_anchoris now calledto_fragmentandwithout_anchoriswithout_fragmentetc. - Breaking changes: Renamed
Wgit::Url#prefix_protocolto#prefix_scheme. Theprotocol:param name remains unchanged. - Breaking changes: Renamed all
Wgit::Urlmethods starting withwithout_*toomit_*. - Breaking changes:
Wgit::Indexerno longer inserts invalid external URL's (to be crawled at a later date). - Breaking changes:
Wgit::Crawler#last_responseis now of typeWgit::Response. You can access the underlyingTyphoeus::Responseobject withcrawler.last_response.adapter_response. ### Fixed - Bug in
Wgit::Document#base_urlaround the handling of invalid base URL scenarios.
- Several bugs in Wgit::Database class caused by the recent changes to the data model (in version 0.3.0).
v0.4.1
Added
- ... ### Changed/Removed
- ... ### Fixed
- A crawl bug that resulted in some servers dropping requests due to the use of Typhoeus's default User-Agent header. This has now been changed.
v0.4.0
Added
Wgit::Document#statsalias#statistics.Wgit::Crawler#time_outlogic for long crawls. Can also be set viainitialize.Wgit::Crawler#last_response#redirect_countmethod logic.Wgit::Crawler#last_response#total_timemethod logic.Wgit::Utils.fetch(hash, key, default = nil)method which tries multiple key formats before giving up e.g.:foo, 'foo', 'FOO'etc. ### Changed/Removed- Breaking changes: Updated
Wgit::Crawlercrawl logic to usetyphoeusinstead ofNet:HTTP. Users should see a significant improvement in crawl speed as a result. This means thatWgit::Crawler#last_responseis now of typeTyphoeus::Response. See https://rubydoc.info/gems/typhoeus/Typhoeus/Response for more info. ### Fixed
- ...
v0.3.0
Added
Url#crawl_durationmethod.Document#crawl_durationmethod.Benchmark.measureto Crawler logic to setUrl#crawl_duration. ### Changed/Removed- Breaking changes: Updated data model to embed the full
urlobject inside the documents object. - Breaking changes: Updated data model by removing documents
scoreattribute. ### Fixed
- ...
v0.2.0
This version of Wgit see's a major refactor of the code base involving multiple changes to method names and their signatures (optional parameters turned into named parameters in most cases). A list of the breaking changes are below including how to fix any breakages; but if you're having issues with the upgrade see the documentation at: https://www.rubydoc.info/gems/wgit
Added
Wgit::Url#absolute?method.Wgit::Url#relative? base: urlsupport.Wgit::Database.connectmethod (alias forWgit::Database.new).Wgit::Database#searchandWgit::Document#searchmethods now supportcase_sensitive:andwhole_sentence:named parameters. ### Changed/Removed- Breaking changes: Renamed the following
WgitandWgit::Indexermethods:Wgit.index_the_webtoWgit.index_www,Wgit::Indexer.index_the_webtoWgit::Indexer.index_www,Wgit.index_this_sitetoWgit.index_site,Wgit::Indexer.index_this_sitetoWgit::Indexer.index_site,Wgit.index_this_pagetoWgit.index_page,Wgit::Indexer.index_this_pagetoWgit::Indexer.index_page. - Breaking changes: All
Wgit::Indexermethods now take named parameters. - Breaking changes: The following
Wgit::Urlmethod signatures have changed:initializeakanew, - Breaking changes: The following
Wgit::Urlclass methods have been removed:.validate,.valid?,.prefix_protocol,.concatin favour of instance methods by the same names. - Breaking changes: The following
Wgit::Urlinstance methods/aliases have been changed/removed:#to_protocol(now#to_scheme),#to_query_stringand#query_string(now#to_query),#relative_link?(now#relative?),#without_query_string(now#without_query),#is_query_string?(now#query?). - Breaking changes: The database connection string is now passed directly to
Wgit::Database.new; or in its absence, obtained fromENV['WGIT_CONNECTION_STRING']. See theREADME.mdsection entitled:Practical Database Examplefor an example. - Breaking changes: The following
Wgit::Databaseinstance methods now take named parameters:#urls,#crawled_urls,#uncrawled_urls,#search. - Breaking changes: The following
Wgit::Documentinstance methods now take named parameters:#to_h,#to_json,#search,#search!. - Breaking changes: The following
Wgit::Documentinstance methods/aliases have been changed/removed:#internal_full_links(now#internal_absolute_links). - Breaking changes: Any
Wgit::Documentmethod alias for returning links containing the wordrelativehas been removed for clarity. Use#internal_links,#internal_absolute_linksor#external_linksinstead. - Breaking changes:
Wgit::Crawlerinstance vars@docsand@urlshave been removed causing the following instance methods to also be removed:#urls=,#[],#<<. Also,.newaka#initializenow requires no params. - Breaking changes:
Wgit::Crawler.newnow takes an optionalredirect_limit:parameter. This is now the only way of customising the redirect crawl behavior.Wgit::Crawler.redirect_limitno longer exists. - Breaking changes: The following
Wgit::Crawlerinstance methods signatures have changed:#crawl_siteand#crawl_urlnow require aurlparam (which no longer defaults),#crawl_urlsnow requires one or more*urls(which no longer defaults). - Breaking changes: The following
Wgit::Assertablemethod aliases have been removed:.type,.types(use.assert_typesinstead) and.arr_type,.arr_types(use.assert_arr_typesinstead). - Breaking changes: The following
Wgit::Utilsmethods now take named parameters:.to_hand.printf_search_results. - Breaking changes:
Wgit::Utils.printf_search_results's method signature has changed; the search parameters have been removed. Before calling this method you must calldoc.search!on each of theresults. See the docs for the full details. Wgit::Documentinstances can now be instantiated withStringUrl's (previously onlyWgit::Url's). ### Fixed
- ...
v0.0.18
Added
Wgit::Url#to_brandmethod and updatedWgit::Url#is_relative?to support it. ### Changed/Removed- Updated certain classes by changing some
privatemethods toprotected. ### Fixed
- ...
v0.0.17
Added
- Support for
<base>element inWgit::Document's. - New
Wgit::Urlmethods:without_query_string,is_query_string?,is_anchor?,replace(override ofString#replace). ### Changed/Removed - Breaking changes: Removed
Wgit::Document#internal_links_without_anchorsmethod. - Breaking changes (potentially):
Wgit::Url's are now replaced with the redirected to Url during a crawl. - Updated
Wgit::Document#base_urlto support an optionallink:named parameter. - Updated
Wgit::Crawler#crawl_siteto allow the initial url to redirect to another host. - Updated
Wgit::Url#is_relative?to support an optionaldomain:named parameter. ### Fixed - Bug in
Wgit::Document#internal_full_linksaffecting anchor and query string links including those used duringWgit::Crawler#crawl_site.
- Bug causing an 'Invalid URL' error for Wgit::Crawler#crawl_site.
v0.0.16
Added
- Added
Wgit::Url.parseclass method as alias forWgit::Url.new. ### Changed/Removed - Breaking changes: Removed
Wgit::Url.relative_link?(class method). UseWgit::Url#is_relative?(instance method) instead e.g.Wgit::Url.new('/blah').is_relative?. ### Fixed
- Several URI related bugs in Wgit::Url affecting crawls.
v0.0.15
Added
- Support for IRI's (non ASCII based URL's). ### Changed/Removed
- Breaking changes: Removed
DocumentandUrl#to_hashaliases. Callto_hinstead. ### Fixed
- Bug in Crawler#crawl_site where an internal redirect to an external site's page was being followed.
v0.0.14
Added
Indexer#index_this_pagemethod. ### Changed/Removed- Breaking Changes:
Wgit::CONNECTION_DETAILSnow only requiresDB_CONNECTION_STRING. ### Fixed