Update the nom parsing blog post

It used to be based on nom 3. It uses macros that aren't available in nom 4. Additionally, I believe I have made it more self-contained- The previous version used macros defined externally to the blog post and nom. I have also added accompanying tests for many of the functions declared. I believe it is worthwhile updating this. Nom links to it as documentation for learning nom.
2018-06-24 21:05:17 +01:00 · 2018-06-24 21:05:17 +01:00 · 782be09796
parent 47d2d8faca
commit 782be09796
1 changed files with 360 additions and 96 deletions
--- a/_posts/2015-07-15-omnomnom-parsing-iso8601-dates-using-nom.md
+++ b/_posts/2015-07-15-omnomnom-parsing-iso8601-dates-using-nom.md
@ -50,6 +50,27 @@ The date parts are separated by a dash (`-`) and the time parts by a colon (`:`)
 We will built a small parser for each of these parts and at the end combine them to parse a full date time string.
 ### Boiler Plate
 We will need to make a lib project.
 ~~~bash
 cargo new --lib date_parse
 ~~~
 Edit `Cargo.toml` and `src/lib.rs` so that our project depends on nom.
 ~~~toml
 [dependencies]
 nom = "^4.0"
 ~~~
 ~~~rust
 #[macro_use]
 extern crate nom;
 ~~~
 ### Parsing the date: 2015-07-16
 Let's start with the sign. As we need it several times, we create its own parser for that.
@ -61,6 +82,24 @@ named!(sign <&[u8], i32>, alt!(
        tag!("+") => { |_| 1 }
        )
    );
 #[cfg(test)]
 mod tests {
    use nom::Context::Code;
    use nom::Err::Error;
    use nom::Err::Incomplete;
    use nom::ErrorKind::Alt;
    use nom::Needed::Size;
    use sign;
    #[test]
    fn parse_sign() {
        assert_eq!(sign(b"-"), Ok((&[][..], -1)));
        assert_eq!(sign(b"+"), Ok((&[][..], 1)));
        assert_eq!(sign(b""), Err(Incomplete(Size(1))));
        assert_eq!(sign(b" "), Err(Error(Code(&b" "[..], Alt))));
    }
 }
 ~~~
 First, we parse either a plus or a minus sign.
@ -70,144 +109,262 @@ We can directly map the result of the sub-parsers to either `-1` or `1`, so we d
 Next we parse the year, which consists of an optional sign and 4 digits (I know, I know, it is possible to extend this to more digits, but let's keep it simple for now).
 ~~~rust
-named!(positive_year  <&[u8], i32>, map!(call!(take_4_digits), buf_to_i32));
+use std::ops::{AddAssign, MulAssign};
-named!(pub year <&[u8], i32>, chain!(
+
-        pref: opt!(sign) ~
+fn buf_to_int<T>(s: &[u8]) -> T
-        y:    positive_year
+where
-        ,
+    T: AddAssign + MulAssign + From<u8>,
-        || {
+{
-            pref.unwrap_or(1) * y
+    let mut sum = T::from(0);
-        }));
+    for digit in s {
        sum *= T::from(10);
        sum += T::from(*digit - b'0');
    }
    sum
 }
 named!(positive_year  <&[u8], i32>, map!(take_while_m_n!(4, 4, nom::is_digit), buf_to_int));
 named!(pub year <&[u8], i32>, do_parse!(
    pref: opt!(sign) >>
    y: positive_year >>
    (pref.unwrap_or(1) * y)
 ));
 #[cfg(test)]
 mod tests {
    use positive_year;
    use year;
    #[test]
    fn parse_positive_year() {
        assert_eq!(positive_year(b"2018"), Ok((&[][..], 2018)));
    }
    #[test]
    fn parse_year() {
        assert_eq!(year(b"2018"), Ok((&[][..], 2018)));
        assert_eq!(year(b"+2018"), Ok((&[][..], 2018)));
        assert_eq!(year(b"-2018"), Ok((&[][..], -2018)));
    }
 }
 ~~~
 A lot of additional stuff here. So let's separate it.
 ~~~rust
-named!(positive_year  <&[u8], i32>, map!(call!(take_4_digits), buf_to_i32));
+named!(positive_year  <&[u8], i32>, map!(take_while_m_n!(4, 4, nom::is_digit), buf_to_int));
 ~~~
 This creates a new named parser, that again returns the remaining input and an 32-bit integer.
-To work, it first calls `take_4_digits` and then maps that result to the corresponding integer (using a [small helper function][buftoi32]).
+To work, it first calls `take_4_digits` and then maps that result to the corresponding integer.
-`take_4_digits` is another small helper parser. We also got one for 2 digits:
+`take_while_m_n` is another small helper parser. We will also use one for 2 digits:
 ~~~rust
-named!(pub take_4_digits, flat_map!(take!(4), check!(is_digit)));
+take_while_m_n!(4, 4, nom::is_digit)
-named!(pub take_2_digits, flat_map!(take!(2), check!(is_digit)));
+take_while_m_n!(2, 2, nom::is_digit)
 ~~~
 This takes 4 (or 2) characters from the input and checks that each character is a digit.
 `flat_map!` and `check!` are quite generic, so they are useful for a lot of cases.
 ~~~rust
-named!(pub year <&[u8], i32>, chain!(
+named!(pub year <&[u8], i32>, do_parse!(
 ~~~
 The year is also returned as a 32-bit integer (there's a pattern!).
-Using the `chain!` macro, we can chain together multiple parsers and work with the sub-results.
+Using the `do_parse!` macro, we can chain together multiple parsers and work with the sub-results.
 ~~~rust
-        pref: opt!(sign) ~
+    pref: opt!(sign) >>
-        y:    positive_year
+    y: positive_year >>
 ~~~
 Our sign is directly followed by 4 digits. It's optional though, that's why we use `opt!`.
-`~` is the concatenation operator in the `chain!` macro.
+`>>` is the concatenation operator in the `chain!` macro.
 We save the sub-results to variables (`pref` and `y`).
 ~~~rust
-        ,
+    (pref.unwrap_or(1) * y)
        || {
            pref.unwrap_or(1) * y
        }));
 ~~~
 To get the final result, we multiply the prefix (which comes back as either `1` or `-1`) with the year.
 Don't forget the `,` (comma) right before the closure.
 This is a small syntactic hint for the `chain!` macro that the mapping function will follow and no more parsers.
 We can now successfully parse a year:
 ~~~rust
-assert_eq!(Done(&[][..], 2015), year(b"2015"));
+        assert_eq!(year(b"2018"), Ok((&[][..], 2018)));
-assert_eq!(Done(&[][..], -0333), year(b"-0333"));
+        assert_eq!(year(b"-0333"), Ok((&[][..], -0333)));
 ~~~
-Our nom parser will return an `IResult`. If all went well, we get `Done(I,O)` with `I` and `O` being the appropriate types.
+Our nom parser will return an `IResult`.
 ~~~rust
 type IResult<I, O, E = u32> = Result<(I, O), Err<I, E>>;
 pub enum Err<I, E = u32> {
    Incomplete(Needed),
    Error(Context<I, E>),
    Failure(Context<I, E>),
 }
 ~~~
 If all went well, we get `Ok(I,O)` with `I` and `O` being the appropriate types.
 For our case `I` is the same as the input, a buffer slice (`&[u8]`), and `O` is the output of the parser itself, an integer (`i32`).
-The return value could also be an `Error(Err)`, if something went completely wrong, or `Incomplete(u32)`, requesting more data to be able to satisfy the parser (you can't parse a 4-digit year with only 3 characters input).
+The return value could also be an `Err(Failure)`, if something went completely wrong, or `Err(Incomplete(Needed))`, requesting more data to be able to satisfy the parser (you can't parse a 4-digit year with only 3 characters input).
 Parsing the month and day is a bit easier now: we simply take the digits and map them to an integer:
 ~~~rust
-named!(pub month <&[u8], u32>, map!(call!(take_2_digits), buf_to_u32));
+named!(month <&[u8], u8>, map!(take_while_m_n!(2, 2, nom::is_digit), buf_to_int));
-named!(pub day   <&[u8], u32>, map!(call!(take_2_digits), buf_to_u32));
+named!(day   <&[u8], u8>, map!(take_while_m_n!(2, 2, nom::is_digit), buf_to_int));
 #[cfg(test)]
 mod tests {
    use day;
    use month;
    #[test]
    fn parse_month() {
        assert_eq!(month(b"06"), Ok((&[][..], 06)));
    }
    #[test]
    fn parse_day() {
        assert_eq!(day(b"18"), Ok((&[][..], 18)));
    }
 }
 ~~~
 All that's left is combining these 3 parts to parse a full date.
 Again we can chain the different parsers and map it to some useful value:
 ~~~rust
-named!(pub date <&[u8], Date>, chain!(
+#[derive(Eq, PartialEq, Debug)]
-        y: year      ~
+pub struct Date {
-           tag!("-") ~
+    year: i32,
-        m: month     ~
+    month: u8,
-           tag!("-") ~
+    day: u8,
        d: day
        ,
        || { Date{ year: y, month: m, day: d }
 }
 named!(pub date <&[u8], Date>, do_parse!(
    year: year >>
    tag!("-") >>
    month: month >>
    tag!("-") >>
    day: day >>
    (Date { year, month, day})
 ));
 #[cfg(test)]
 mod tests {
    use date;
    use Date;
    #[test]
    fn parse_date() {
        assert_eq!(
            Ok((
                &[][..],
                Date {
                    year: 2015,
                    month: 7,
                    day: 16
                }
            )),
            date(b"2015-07-16")
        );
        assert_eq!(
            Ok((
                &[][..],
                Date {
                    year: -333,
                    month: 6,
                    day: 11
                }
            )),
            date(b"-0333-06-11")
        );
    }
 }
 ~~~
-`Date` is a [small struct][datestruct], that can hold the necessary information, just as you would expect.
+And running the tests shows it already works!
 And it already works:
 ~~~rust
 assert_eq!(Done(&[][..], Date{ year: 2015, month: 7, day: 16  }), date(b"2015-07-16"));
 assert_eq!(Done(&[][..], Date{ year: -333, month: 6, day: 11  }), date(b"-0333-06-11"));
 ~~~
 ### Parsing the time: 16:43:52
 Next, we parse the time. The individual parts are really simple, just some digits:
 ~~~rust
-named!(pub hour   <&[u8], u32>, map!(call!(take_2_digits), buf_to_u32));
+named!(pub hour   <&[u8], u8>, map!(take_while_m_n!(2, 2, nom::is_digit), buf_to_int));
-named!(pub minute <&[u8], u32>, map!(call!(take_2_digits), buf_to_u32));
+named!(pub minute <&[u8], u8>, map!(take_while_m_n!(2, 2, nom::is_digit), buf_to_int));
-named!(pub second <&[u8], u32>, map!(call!(take_2_digits), buf_to_u32));
+named!(pub second <&[u8], u8>, map!(take_while_m_n!(2, 2, nom::is_digit), buf_to_int));
 ~~~
 Putting them together becomes a bit more complex, as the `second` part is optional:
 ~~~rust
-named!(pub time <&[u8], Time>, chain!(
+#[derive(Eq, PartialEq, Debug)]
-        h: hour      ~
+pub struct Time {
-           tag!(":") ~
+    hour: u8,
-        m: minute    ~
+    minute: u8,
-        s: empty_or!(chain!(tag!(":") ~ s:second , || { s }))
+    second: u8,
-        ,
+    tz_offset: i32,
        || { Time{ hour: h,
                   minute: m,
                   second: s.unwrap_or(0),
                   tz_offset: 0 }
 }
 named!(pub time <&[u8], Time>, do_parse!(
    hour: hour >>
    tag!(":") >>
    minute: minute >>
    second: opt!(complete!(do_parse!(
        tag!(":") >>
        second: second >>
        (second)
    ))) >>
    (Time {hour, minute, second: second.unwrap_or(0), tz_offset: 0})
 ));
 #[cfg(test)]
 mod tests {
    use time;
    use Time;
    #[test]
    fn parse_time() {
        assert_eq!(
            Ok((
                &[][..],
                Time {
                    hour: 16,
                    minute: 43,
                    second: 52,
                    tz_offset: 0
                }
            )),
            time(b"16:43:52")
        );
        assert_eq!(
            Ok((
                &[][..],
                Time {
                    hour: 16,
                    minute: 43,
                    second: 0,
                    tz_offset: 0
                }
            )),
            time(b"16:43")
        );
    }
 }
 ~~~
-As you can see, even `chain!` parsers can be nested.
+As you can see, even `do_parse!` parsers can be nested.
 The sub-parts then must be mapped once for the inner parser and once into the final value of the outer parser.
-`empty_or!` returns an `Option`. Either `None` if there is no input left or it applies the nested parser. If this parser doesn't fail, `Some(value)` is returned.
+`opt!` returns an `Option`. Either `None` if there is no input left or it applies the nested parser. If this parser doesn't fail, `Some(value)` is returned.
 Our parser now works for simple time information:
 ~~~rust
 assert_eq!(Done(&[][..], Time{ hour: 16, minute: 43, second: 52, tz_offset: 0}), time(b"16:43:52"));
 assert_eq!(Done(&[][..], Time{ hour: 16, minute: 43, second:  0, tz_offset: 0}), time(b"16:43"));
 ~~~
 Our parser now works for simple time information.
 But it leaves out one important bit: the timezone.
 ### Parsing the timezone: +0100
@ -234,12 +391,13 @@ It's a simple `Z` character, which we map to `0`.
 The other case is the sign-separated hour and minute offset.
 ~~~rust
-named!(timezone_hour <&[u8], i32>, chain!(
+named!(timezone_hour <&[u8], i32>, do_parse!(
-        s: sign ~
+    sign: sign >>
-        h: hour ~
+    hour: hour >>
-        m: empty_or!(chain!(tag!(":")? ~ m: minute , || { m }))
+    minute: opt!(complete!(do_parse!(
-        ,
+        opt!(tag!(":")) >> minute: minute >> (minute)
-        || { (s * (h as i32) * 3600) + (m.unwrap_or(0) * 60) as i32 }
+    ))) >>
    ((sign * (hour as i32 * 3600 + minute.unwrap_or(0) as i32 * 60)))
 ));
 ~~~
@ -248,7 +406,7 @@ The minutes are optional (and might be separated using a colon).
 Instead of keeping this as is, we're mapping it to the offset in seconds.
 We will see why later.
-We could also just map it to a tuple like <br>`(s, h, m.unwrap_or(0))` and handle conversion at a later point.
+We could also just map it to a tuple like <br>`(sign, hour, minute.unwrap_or(0))` and handle conversion at a later point.
 Combined we get
@ -256,6 +414,82 @@ Combined we get
 named!(timezone <&[u8], i32>, alt!(timezone_utc | timezone_hour));
 ~~~
 Putting this back into time we get:
 ~~~rust
 named!(pub time <&[u8], Time>, do_parse!(
    hour: hour >>
    tag!(":") >>
    minute: minute >>
    second: opt!(complete!(do_parse!(
        tag!(":") >>
        second: second >>
        (second)
    ))) >>
    tz_offset: opt!(complete!(timezone)) >>
    (Time {hour, minute, second: second.unwrap_or(0), tz_offset: tz_offset.unwrap_or(0)})
 ));
 #[cfg(test)]
 mod tests {
    use time;
    use Time;
    #[test]
    fn parse_time_with_offset() {
        assert_eq!(
            Ok((
                &[][..],
                Time {
                    hour: 16,
                    minute: 43,
                    second: 52,
                    tz_offset: 0
                }
            )),
            time(b"16:43:52Z")
        );
        assert_eq!(
            Ok((
                &[][..],
                Time {
                    hour: 16,
                    minute: 43,
                    second: 0,
                    tz_offset: 5 * 3600
                }
            )),
            time(b"16:43+05")
        );
        assert_eq!(
            Ok((
                &[][..],
                Time {
                    hour: 16,
                    minute: 43,
                    second: 15,
                    tz_offset: 5 * 3600
                }
            )),
            time(b"16:43:15+0500")
        );
        assert_eq!(
            Ok((
                &[][..],
                Time {
                    hour: 16,
                    minute: 43,
                    second: 0,
                    tz_offset: -(5 * 3600 + 30 * 60)
                }
            )),
            time(b"16:43-05:30")
        );
    }
 }
 ~~~
 ### Putting it all together
 We now got individual parsers for the date, the time and the timezone offset.
@ -263,19 +497,51 @@ We now got individual parsers for the date, the time and the timezone offset.
 Putting it all together, our final datetime parser looks quite small and easy to understand:
 ~~~rust
-named!(pub datetime <&[u8], DateTime>, chain!(
+#[derive(Eq, PartialEq, Debug)]
-        d:   date      ~
+pub struct DateTime {
-             tag!("T") ~
+    date: Date,
-        t:   time      ~
+    time: Time,
-        tzo: empty_or!(call!(timezone))
+}
-        ,
+named!(pub datetime <&[u8], DateTime>, do_parse!(
-        || {
+    date: date >>
    tag!("T") >>
    time: time >>
    (
        DateTime{
-                date: d,
+            date,
-                time: t.set_tz(tzo.unwrap_or(0)),
+            time
            }
        }
    )
 ));
 #[cfg(test)]
 mod tests {
    use datetime;
    use DateTime;
    #[test]
    fn parse_datetime() {
        assert_eq!(
            Ok((
                &[][..],
                DateTime {
                    date: Date {
                        year: 2007,
                        month: 08,
                        day: 31
                    },
                    time: Time {
                        hour: 16,
                        minute: 47,
                        second: 22,
                        tz_offset: 5 * 3600
                    }
                }
            )),
            datetime(b"2007-08-31T16:47:22+05:00")
        );
    }
 }
 ~~~
 Nothing special anymore. We can now parse all kinds of date strings:
@ -296,7 +562,7 @@ But this is fine for now. We can handle the actual validation in a later step.
 For example, we could use [chrono][], a time library, [to handle this for us][chrono-convert].
 Using chrono it's obvious why we already multiplied our timezone offset to be in seconds: this time we can just hand it off to chrono as is.
-The full code for this ISO8601 parser is available in [easy.rs][easy.rs]. The repository also includes [a more complex parser][lib.rs], that does some validation while parsing
+The full code for the previous version of this ISO8601 parser is available in [easy.rs][easy.rs]. The repository also includes [a more complex parser][lib.rs], that does some validation while parsing
 (it checks that the time and date are reasonable values, but it does not check that it is a valid date for example)
 ### What's left?
@ -336,7 +602,6 @@ Thanks to [Geoffroy][gcouprie] for the discussions, the help and for reading a d
 [nom]: https://github.com/Geal/nom
 [gcouprie]: https://twitter.com/gcouprie
 [taken]: https://github.com/badboy/iso8601/blob/master/src/macros.rs#L20-L39
 [datestruct]: https://github.com/badboy/iso8601/blob/master/src/lib.rs#L19-23
 [rdb-rs]: http://rdb.fnordig.de/
 [rsedis]: https://github.com/seppo0010/rsedis
 [rdb-rs-nom]: https://github.com/badboy/rdb-rs/tree/nom-parser
@ -348,5 +613,4 @@ Thanks to [Geoffroy][gcouprie] for the discussions, the help and for reading a d
 [consumer]: https://github.com/Geal/nom#consumers
 [machine]: https://github.com/Geal/machine
 [microstate]: https://github.com/badboy/microstate
 [buftoi32]: https://github.com/badboy/iso8601/blob/master/src/helper.rs#L8
 [read]: http://doc.rust-lang.org/nightly/std/io/trait.Read.html