@@ -637,19 +637,175 @@ of the day that the ``Date`` refers to in UTC.
637637Regular Expressions
638638===================
639639
640- Ruby regular expressions always have BSON regular expressions' equivalent of
641- 'm' flag on. In order for behavior to be preserved between the two, the 'm'
642- option is always added when a Ruby regular expression is serialized to BSON.
640+ Both MongoDB and Ruby provide facilities for working with regular expressions,
641+ but they use regular expression engines. The following subsections detail the
642+ differences between Ruby regular expressions and MongoDB regular expressions
643+ and describe how to work with both.
644+
645+ Ruby vs MongoDB Regular Expressions
646+ -----------------------------------
647+
648+ MongoDB server uses `Perl-compatible regular expressions implemented using
649+ the PCRE library<http://pcre.org/>`_ and `Ruby regular expressions
650+ <http://ruby-doc.org/core/Regexp.html>`_ are implemented using the
651+ `Onigmo regular expression engine <https://github.com/k-takata/Onigmo>`_,
652+ which is a fork of `Oniguruma <https://github.com/kkos/oniguruma>`_.
653+ The two regular expression implementations generally provide equivalent
654+ functionality but have several important syntax differences, as described
655+ below.
656+
657+ Unfortunately, there is no simple way to programmatically convert a PCRE
658+ regular expression into the equivalent Ruby regular expression,
659+ and there are currently no Ruby bindings for PCRE.
660+
661+ Options / Flags / Modifiers
662+ ```````````````````````````
663+
664+ Both Ruby and PCRE regular expressions support modifiers. These are
665+ also called "options" in Ruby parlance and "flags" in PCRE parlance.
666+ The meaning of ``s`` and ``m`` modifiers differs in Ruby and PCRE:
667+
668+ - Ruby does not have the ``s`` modifier, instead the Ruby ``m`` modifier
669+ performs the same function as the PCRE ``s`` modifier which is to make the
670+ period (``.``) match any character including newlines. Confusingly, the
671+ Ruby documentation refers to the ``m`` modifier as "enabling multi-line mode".
672+ - Ruby always operates in the equivalent of PCRE's multi-line mode, enabled by
673+ the ``m`` modifier in PCRE regular expressions. In Ruby the ``^`` anchor
674+ always refers to the beginning of line and the ``$`` anchor always refers
675+ to the end of line.
676+
677+ When writing regular expressions intended to be used in both Ruby and
678+ PCRE environments (including MongoDB server and most other MongoDB drivers),
679+ henceforth referred to as "portable regular expressions", avoid using
680+ the ``^`` and ``$`` anchors. The following sections provide workarounds and
681+ recommendations for authoring portable regular expressions.
682+
683+ ``^`` Anchor
684+ ````````````
685+
686+ In Ruby regular expressions, the ``^`` anchor always refers to the beginning
687+ of line. In PCRE regular expressions, the ``^`` anchor refers to the beginning
688+ of input by default and the ``m`` flag changes its meaning to the beginning
689+ of line.
690+
691+ Both Ruby and PCRE regular expressions support the ``\A`` anchor to refer to
692+ the beginning of input, regardless of modifiers.
693+
694+ When writing portable regular expressions:
695+
696+ - Use the ``\A`` anchor to refer to the beginning of input.
697+ - Use the ``^`` anchor to refer to the beginning of line (this requires
698+ setting the ``m`` flag in PCRE regular expressions). Alternatively use
699+ one of the following constructs which work regardless of modifiers:
700+ - ``(?:\A|(?<=\n))`` (handles LF and CR+LF line ends)
701+ - ``(?:\A|(?<=[\r\n]))`` (handles CR, LF and CR+LF line ends)
702+
703+ ``$`` Anchor
704+ ````````````
705+
706+ In Ruby regular expressions, the ``$`` anchor always refers to the end
707+ of line. In PCRE regular expressions, the ``$`` anchor refers to the end
708+ of input by default and the ``m`` flag changes its meaning to the end
709+ of line.
710+
711+ Both Ruby and PCRE regular expressions support the ``\z`` anchor to refer to
712+ the end of input, regardless of modifiers.
713+
714+ When writing portable regular expressions:
715+
716+ - Use the ``\z`` anchor to refer to the end of input.
717+ - Use the ``$`` anchor to refer to the beginning of line (this requires
718+ setting the ``m`` flag in PCRE regular expressions). Alternatively use
719+ one of the following constructs which work regardless of modifiers:
720+ - ``(?:\z|(?=\n))`` (handles LF and CR+LF line ends)
721+ - ``(?:\z|(?=[\n\n]))`` (handles CR, LF and CR+LF line ends)
643722
644- There is a class provided by the bson gem, ``Regexp::Raw``, to allow Ruby users
645- to get around this. You can simply create a regular expression like this:
723+ ``BSON::Regexp::Raw`` Class
724+ ---------------------------
725+
726+ Since there is no simple way to programmatically convert a PCRE
727+ regular expression into the equivalent Ruby regular expression,
728+ bson-ruby provides the ``BSON::Regexp::Raw`` class for holding MongoDB/PCRE
729+ regular expressions. Instances of this class are called "BSON regular
730+ expressions" in this documentation.
731+
732+ Instances of this class can be created using the regular expression text
733+ as a string and optional PCRE modifiers:
734+
735+ .. code-block:: ruby
736+
737+ BSON::Regexp::Raw.new("^b403158")
738+ # => #<BSON::Regexp::Raw:0x000055df63186d78 @pattern="^b403158", @options="">
739+
740+ BSON::Regexp::Raw.new("^Hello.world$", "s")
741+ # => #<BSON::Regexp::Raw:0x000055df6317f028 @pattern="^Hello.world$", @options="s">
742+
743+ The ``BSON::Regexp`` module is included in the Ruby ``Regexp`` class, such that
744+ the ``BSON::`` prefix may be omitted:
646745
647746.. code-block:: ruby
648747
649748 Regexp::Raw.new("^b403158")
749+ # => #<BSON::Regexp::Raw:0x000055df63186d78 @pattern="^b403158", @options="">
750+
751+ Regexp::Raw.new("^Hello.world$", "s")
752+ # => #<BSON::Regexp::Raw:0x000055df6317f028 @pattern="^Hello.world$", @options="s">
753+
754+ Regular Expression Conversion
755+ -----------------------------
650756
651- This code example illustrates the difference between serializing a core Ruby
652- ``Regexp`` versus a ``Regexp::Raw`` object:
757+ To convert a Ruby regular expression to a BSON regular expression,
758+ instantiate a ``BSON::Regexp::Raw`` object as follows:
759+
760+ .. code-block:: ruby
761+
762+ regexp = /^Hello.world/
763+ bson_regexp = BSON::Regexp::Raw.new(regexp.source, regexp.options)
764+ # => #<BSON::Regexp::Raw:0x000055df62e42d60 @pattern="^Hello.world", @options=0>
765+
766+ Note that the ``BSON::Regexp::Raw`` constructor accepts both the Ruby numeric
767+ options and the PCRE modifier strings.
768+
769+ To convert a BSON regular expression to a Ruby regular expression, call the
770+ ``compile`` method on the BSON regular expression:
771+
772+ .. code-block:: ruby
773+
774+ bson_regexp = BSON::Regexp::Raw.new("^hello.world", "s")
775+ bson_regexp.compile
776+ # => /^hello.world/m
777+
778+ bson_regexp = BSON::Regexp::Raw.new("^hello", "")
779+ bson_regexp.compile
780+ # => /^hello.world/
781+
782+ bson_regexp = BSON::Regexp::Raw.new("^hello.world", "m")
783+ bson_regexp.compile
784+ # => /^hello.world/
785+
786+ Note that the ``s`` PCRE modifier was converted to the ``m`` Ruby modifier
787+ in the first example, and the last two examples were converted to the same
788+ regular expression even though the original BSON regular expressions had
789+ different meanings.
790+
791+ When a BSON regular expression uses the non-portable ``^`` and ``$``
792+ anchors, its conversion to a Ruby regular expression can change its meaning:
793+
794+ .. code-block:: ruby
795+
796+ BSON::Regexp::Raw.new("^hello.world", "").compile =~ "42\nhello world"
797+ # => 3
798+
799+ When a Ruby regular expression is converted to a BSON regular expression
800+ (for example, to send to the server as part of a query), the BSON regular
801+ expression always has the ``m`` modifier set reflecting the behavior of
802+ ``^`` and ``$`` anchors in Ruby regular expressions.
803+
804+ Reading and Writing
805+ -------------------
806+
807+ Both Ruby and BSON regular expressions implement the ``to_bson`` method
808+ for serialization to BSON:
653809
654810.. code-block:: ruby
655811
@@ -659,27 +815,31 @@ This code example illustrates the difference between serializing a core Ruby
659815 # => #<BSON::ByteBuffer:0x007fcf20ab8028>
660816 _.to_s
661817 # => "^b403158\x00m\x00"
818+
662819 regexp_raw = Regexp::Raw.new("^b403158")
663820 # => #<BSON::Regexp::Raw:0x007fcf21808f98 @pattern="^b403158", @options="">
664821 regexp_raw.to_bson
665822 # => #<BSON::ByteBuffer:0x007fcf213622f0>
666823 _.to_s
667824 # => "^b403158\x00\x00"
668825
669-
670- Please use the ``Regexp::Raw`` class to instantiate your BSON regular
671- expressions to get the exact pattern and options you want.
672-
673- When regular expressions are deserialized, they return a wrapper that holds the
674- raw regex string, but do not compile it. In order to get the Ruby ``Regexp``
675- object, one must call ``compile`` on the returned object.
826+ Both ``Regexp`` and ``BSON::Regexp::Raw`` classes implement the ``from_bson``
827+ class method that deserializes a regular expression from a BSON byte buffer.
828+ Methods of both classes return a ``BSON::Regexp::Raw`` instance that
829+ must be converted to a Ruby regular expression using the ``compile`` method
830+ as described above.
676831
677832.. code-block:: ruby
678833
834+ byte_buffer = BSON::ByteBuffer.new("^b403158\x00\x00")
679835 regex = Regexp.from_bson(byte_buffer)
680- regex.pattern #=> Returns the pattern as a string.
681- regex.options #=> Returns the raw options as a String.
682- regex.compile #=> Returns the compiled Ruby Regexp object.
836+ # => #<BSON::Regexp::Raw:0x000055df63100d40 @pattern="^b403158", @options="">
837+ regex.pattern
838+ # => "^b403158"
839+ regex.options
840+ # => ""
841+ regex.compile
842+ # => /^b403158/
683843
684844
685845Key Order
0 commit comments