1
Fork 0

Update the nom parsing blog post

It used to be based on nom 3.
It uses macros that aren't available in nom 4.

Additionally, I believe I have made it more self-contained-
The previous version used macros defined externally to the blog post and
nom.

I have also added accompanying tests for many of the functions declared.

I believe it is worthwhile updating this.
Nom links to it as documentation for learning nom.
This commit is contained in:
Chris Couzens 2018-06-24 21:05:17 +01:00
parent 47d2d8faca
commit 782be09796

View file

@ -50,6 +50,27 @@ The date parts are separated by a dash (`-`) and the time parts by a colon (`:`)
We will built a small parser for each of these parts and at the end combine them to parse a full date time string. We will built a small parser for each of these parts and at the end combine them to parse a full date time string.
### Boiler Plate
We will need to make a lib project.
~~~bash
cargo new --lib date_parse
~~~
Edit `Cargo.toml` and `src/lib.rs` so that our project depends on nom.
~~~toml
[dependencies]
nom = "^4.0"
~~~
~~~rust
#[macro_use]
extern crate nom;
~~~
### Parsing the date: 2015-07-16 ### Parsing the date: 2015-07-16
Let's start with the sign. As we need it several times, we create its own parser for that. Let's start with the sign. As we need it several times, we create its own parser for that.
@ -61,6 +82,24 @@ named!(sign <&[u8], i32>, alt!(
tag!("+") => { |_| 1 } tag!("+") => { |_| 1 }
) )
); );
#[cfg(test)]
mod tests {
use nom::Context::Code;
use nom::Err::Error;
use nom::Err::Incomplete;
use nom::ErrorKind::Alt;
use nom::Needed::Size;
use sign;
#[test]
fn parse_sign() {
assert_eq!(sign(b"-"), Ok((&[][..], -1)));
assert_eq!(sign(b"+"), Ok((&[][..], 1)));
assert_eq!(sign(b""), Err(Incomplete(Size(1))));
assert_eq!(sign(b" "), Err(Error(Code(&b" "[..], Alt))));
}
}
~~~ ~~~
First, we parse either a plus or a minus sign. First, we parse either a plus or a minus sign.
@ -70,144 +109,262 @@ We can directly map the result of the sub-parsers to either `-1` or `1`, so we d
Next we parse the year, which consists of an optional sign and 4 digits (I know, I know, it is possible to extend this to more digits, but let's keep it simple for now). Next we parse the year, which consists of an optional sign and 4 digits (I know, I know, it is possible to extend this to more digits, but let's keep it simple for now).
~~~rust ~~~rust
named!(positive_year <&[u8], i32>, map!(call!(take_4_digits), buf_to_i32)); use std::ops::{AddAssign, MulAssign};
named!(pub year <&[u8], i32>, chain!(
pref: opt!(sign) ~ fn buf_to_int<T>(s: &[u8]) -> T
y: positive_year where
, T: AddAssign + MulAssign + From<u8>,
|| { {
pref.unwrap_or(1) * y let mut sum = T::from(0);
})); for digit in s {
sum *= T::from(10);
sum += T::from(*digit - b'0');
}
sum
}
named!(positive_year <&[u8], i32>, map!(take_while_m_n!(4, 4, nom::is_digit), buf_to_int));
named!(pub year <&[u8], i32>, do_parse!(
pref: opt!(sign) >>
y: positive_year >>
(pref.unwrap_or(1) * y)
));
#[cfg(test)]
mod tests {
use positive_year;
use year;
#[test]
fn parse_positive_year() {
assert_eq!(positive_year(b"2018"), Ok((&[][..], 2018)));
}
#[test]
fn parse_year() {
assert_eq!(year(b"2018"), Ok((&[][..], 2018)));
assert_eq!(year(b"+2018"), Ok((&[][..], 2018)));
assert_eq!(year(b"-2018"), Ok((&[][..], -2018)));
}
}
~~~ ~~~
A lot of additional stuff here. So let's separate it. A lot of additional stuff here. So let's separate it.
~~~rust ~~~rust
named!(positive_year <&[u8], i32>, map!(call!(take_4_digits), buf_to_i32)); named!(positive_year <&[u8], i32>, map!(take_while_m_n!(4, 4, nom::is_digit), buf_to_int));
~~~ ~~~
This creates a new named parser, that again returns the remaining input and an 32-bit integer. This creates a new named parser, that again returns the remaining input and an 32-bit integer.
To work, it first calls `take_4_digits` and then maps that result to the corresponding integer (using a [small helper function][buftoi32]). To work, it first calls `take_4_digits` and then maps that result to the corresponding integer.
`take_4_digits` is another small helper parser. We also got one for 2 digits: `take_while_m_n` is another small helper parser. We will also use one for 2 digits:
~~~rust ~~~rust
named!(pub take_4_digits, flat_map!(take!(4), check!(is_digit))); take_while_m_n!(4, 4, nom::is_digit)
named!(pub take_2_digits, flat_map!(take!(2), check!(is_digit))); take_while_m_n!(2, 2, nom::is_digit)
~~~ ~~~
This takes 4 (or 2) characters from the input and checks that each character is a digit. This takes 4 (or 2) characters from the input and checks that each character is a digit.
`flat_map!` and `check!` are quite generic, so they are useful for a lot of cases.
~~~rust ~~~rust
named!(pub year <&[u8], i32>, chain!( named!(pub year <&[u8], i32>, do_parse!(
~~~ ~~~
The year is also returned as a 32-bit integer (there's a pattern!). The year is also returned as a 32-bit integer (there's a pattern!).
Using the `chain!` macro, we can chain together multiple parsers and work with the sub-results. Using the `do_parse!` macro, we can chain together multiple parsers and work with the sub-results.
~~~rust ~~~rust
pref: opt!(sign) ~ pref: opt!(sign) >>
y: positive_year y: positive_year >>
~~~ ~~~
Our sign is directly followed by 4 digits. It's optional though, that's why we use `opt!`. Our sign is directly followed by 4 digits. It's optional though, that's why we use `opt!`.
`~` is the concatenation operator in the `chain!` macro. `>>` is the concatenation operator in the `chain!` macro.
We save the sub-results to variables (`pref` and `y`). We save the sub-results to variables (`pref` and `y`).
~~~rust ~~~rust
, (pref.unwrap_or(1) * y)
|| {
pref.unwrap_or(1) * y
}));
~~~ ~~~
To get the final result, we multiply the prefix (which comes back as either `1` or `-1`) with the year. To get the final result, we multiply the prefix (which comes back as either `1` or `-1`) with the year.
Don't forget the `,` (comma) right before the closure.
This is a small syntactic hint for the `chain!` macro that the mapping function will follow and no more parsers.
We can now successfully parse a year: We can now successfully parse a year:
~~~rust ~~~rust
assert_eq!(Done(&[][..], 2015), year(b"2015")); assert_eq!(year(b"2018"), Ok((&[][..], 2018)));
assert_eq!(Done(&[][..], -0333), year(b"-0333")); assert_eq!(year(b"-0333"), Ok((&[][..], -0333)));
~~~ ~~~
Our nom parser will return an `IResult`. If all went well, we get `Done(I,O)` with `I` and `O` being the appropriate types. Our nom parser will return an `IResult`.
~~~rust
type IResult<I, O, E = u32> = Result<(I, O), Err<I, E>>;
pub enum Err<I, E = u32> {
Incomplete(Needed),
Error(Context<I, E>),
Failure(Context<I, E>),
}
~~~
If all went well, we get `Ok(I,O)` with `I` and `O` being the appropriate types.
For our case `I` is the same as the input, a buffer slice (`&[u8]`), and `O` is the output of the parser itself, an integer (`i32`). For our case `I` is the same as the input, a buffer slice (`&[u8]`), and `O` is the output of the parser itself, an integer (`i32`).
The return value could also be an `Error(Err)`, if something went completely wrong, or `Incomplete(u32)`, requesting more data to be able to satisfy the parser (you can't parse a 4-digit year with only 3 characters input). The return value could also be an `Err(Failure)`, if something went completely wrong, or `Err(Incomplete(Needed))`, requesting more data to be able to satisfy the parser (you can't parse a 4-digit year with only 3 characters input).
Parsing the month and day is a bit easier now: we simply take the digits and map them to an integer: Parsing the month and day is a bit easier now: we simply take the digits and map them to an integer:
~~~rust ~~~rust
named!(pub month <&[u8], u32>, map!(call!(take_2_digits), buf_to_u32)); named!(month <&[u8], u8>, map!(take_while_m_n!(2, 2, nom::is_digit), buf_to_int));
named!(pub day <&[u8], u32>, map!(call!(take_2_digits), buf_to_u32)); named!(day <&[u8], u8>, map!(take_while_m_n!(2, 2, nom::is_digit), buf_to_int));
#[cfg(test)]
mod tests {
use day;
use month;
#[test]
fn parse_month() {
assert_eq!(month(b"06"), Ok((&[][..], 06)));
}
#[test]
fn parse_day() {
assert_eq!(day(b"18"), Ok((&[][..], 18)));
}
}
~~~ ~~~
All that's left is combining these 3 parts to parse a full date. All that's left is combining these 3 parts to parse a full date.
Again we can chain the different parsers and map it to some useful value: Again we can chain the different parsers and map it to some useful value:
~~~rust ~~~rust
named!(pub date <&[u8], Date>, chain!( #[derive(Eq, PartialEq, Debug)]
y: year ~ pub struct Date {
tag!("-") ~ year: i32,
m: month ~ month: u8,
tag!("-") ~ day: u8,
d: day }
,
|| { Date{ year: y, month: m, day: d } named!(pub date <&[u8], Date>, do_parse!(
} year: year >>
)); tag!("-") >>
month: month >>
tag!("-") >>
day: day >>
(Date { year, month, day})
));
#[cfg(test)]
mod tests {
use date;
use Date;
#[test]
fn parse_date() {
assert_eq!(
Ok((
&[][..],
Date {
year: 2015,
month: 7,
day: 16
}
)),
date(b"2015-07-16")
);
assert_eq!(
Ok((
&[][..],
Date {
year: -333,
month: 6,
day: 11
}
)),
date(b"-0333-06-11")
);
}
}
~~~ ~~~
`Date` is a [small struct][datestruct], that can hold the necessary information, just as you would expect. And running the tests shows it already works!
And it already works:
~~~rust
assert_eq!(Done(&[][..], Date{ year: 2015, month: 7, day: 16 }), date(b"2015-07-16"));
assert_eq!(Done(&[][..], Date{ year: -333, month: 6, day: 11 }), date(b"-0333-06-11"));
~~~
### Parsing the time: 16:43:52 ### Parsing the time: 16:43:52
Next, we parse the time. The individual parts are really simple, just some digits: Next, we parse the time. The individual parts are really simple, just some digits:
~~~rust ~~~rust
named!(pub hour <&[u8], u32>, map!(call!(take_2_digits), buf_to_u32)); named!(pub hour <&[u8], u8>, map!(take_while_m_n!(2, 2, nom::is_digit), buf_to_int));
named!(pub minute <&[u8], u32>, map!(call!(take_2_digits), buf_to_u32)); named!(pub minute <&[u8], u8>, map!(take_while_m_n!(2, 2, nom::is_digit), buf_to_int));
named!(pub second <&[u8], u32>, map!(call!(take_2_digits), buf_to_u32)); named!(pub second <&[u8], u8>, map!(take_while_m_n!(2, 2, nom::is_digit), buf_to_int));
~~~ ~~~
Putting them together becomes a bit more complex, as the `second` part is optional: Putting them together becomes a bit more complex, as the `second` part is optional:
~~~rust ~~~rust
named!(pub time <&[u8], Time>, chain!( #[derive(Eq, PartialEq, Debug)]
h: hour ~ pub struct Time {
tag!(":") ~ hour: u8,
m: minute ~ minute: u8,
s: empty_or!(chain!(tag!(":") ~ s:second , || { s })) second: u8,
, tz_offset: i32,
|| { Time{ hour: h, }
minute: m,
second: s.unwrap_or(0), named!(pub time <&[u8], Time>, do_parse!(
tz_offset: 0 } hour: hour >>
} tag!(":") >>
)); minute: minute >>
second: opt!(complete!(do_parse!(
tag!(":") >>
second: second >>
(second)
))) >>
(Time {hour, minute, second: second.unwrap_or(0), tz_offset: 0})
));
#[cfg(test)]
mod tests {
use time;
use Time;
#[test]
fn parse_time() {
assert_eq!(
Ok((
&[][..],
Time {
hour: 16,
minute: 43,
second: 52,
tz_offset: 0
}
)),
time(b"16:43:52")
);
assert_eq!(
Ok((
&[][..],
Time {
hour: 16,
minute: 43,
second: 0,
tz_offset: 0
}
)),
time(b"16:43")
);
}
}
~~~ ~~~
As you can see, even `chain!` parsers can be nested. As you can see, even `do_parse!` parsers can be nested.
The sub-parts then must be mapped once for the inner parser and once into the final value of the outer parser. The sub-parts then must be mapped once for the inner parser and once into the final value of the outer parser.
`empty_or!` returns an `Option`. Either `None` if there is no input left or it applies the nested parser. If this parser doesn't fail, `Some(value)` is returned. `opt!` returns an `Option`. Either `None` if there is no input left or it applies the nested parser. If this parser doesn't fail, `Some(value)` is returned.
Our parser now works for simple time information:
~~~rust
assert_eq!(Done(&[][..], Time{ hour: 16, minute: 43, second: 52, tz_offset: 0}), time(b"16:43:52"));
assert_eq!(Done(&[][..], Time{ hour: 16, minute: 43, second: 0, tz_offset: 0}), time(b"16:43"));
~~~
Our parser now works for simple time information.
But it leaves out one important bit: the timezone. But it leaves out one important bit: the timezone.
### Parsing the timezone: +0100 ### Parsing the timezone: +0100
@ -234,13 +391,14 @@ It's a simple `Z` character, which we map to `0`.
The other case is the sign-separated hour and minute offset. The other case is the sign-separated hour and minute offset.
~~~rust ~~~rust
named!(timezone_hour <&[u8], i32>, chain!( named!(timezone_hour <&[u8], i32>, do_parse!(
s: sign ~ sign: sign >>
h: hour ~ hour: hour >>
m: empty_or!(chain!(tag!(":")? ~ m: minute , || { m })) minute: opt!(complete!(do_parse!(
, opt!(tag!(":")) >> minute: minute >> (minute)
|| { (s * (h as i32) * 3600) + (m.unwrap_or(0) * 60) as i32 } ))) >>
)); ((sign * (hour as i32 * 3600 + minute.unwrap_or(0) as i32 * 60)))
));
~~~ ~~~
We can re-use our already existing parsers and once again chain them to get what we want. We can re-use our already existing parsers and once again chain them to get what we want.
@ -248,7 +406,7 @@ The minutes are optional (and might be separated using a colon).
Instead of keeping this as is, we're mapping it to the offset in seconds. Instead of keeping this as is, we're mapping it to the offset in seconds.
We will see why later. We will see why later.
We could also just map it to a tuple like <br>`(s, h, m.unwrap_or(0))` and handle conversion at a later point. We could also just map it to a tuple like <br>`(sign, hour, minute.unwrap_or(0))` and handle conversion at a later point.
Combined we get Combined we get
@ -256,6 +414,82 @@ Combined we get
named!(timezone <&[u8], i32>, alt!(timezone_utc | timezone_hour)); named!(timezone <&[u8], i32>, alt!(timezone_utc | timezone_hour));
~~~ ~~~
Putting this back into time we get:
~~~rust
named!(pub time <&[u8], Time>, do_parse!(
hour: hour >>
tag!(":") >>
minute: minute >>
second: opt!(complete!(do_parse!(
tag!(":") >>
second: second >>
(second)
))) >>
tz_offset: opt!(complete!(timezone)) >>
(Time {hour, minute, second: second.unwrap_or(0), tz_offset: tz_offset.unwrap_or(0)})
));
#[cfg(test)]
mod tests {
use time;
use Time;
#[test]
fn parse_time_with_offset() {
assert_eq!(
Ok((
&[][..],
Time {
hour: 16,
minute: 43,
second: 52,
tz_offset: 0
}
)),
time(b"16:43:52Z")
);
assert_eq!(
Ok((
&[][..],
Time {
hour: 16,
minute: 43,
second: 0,
tz_offset: 5 * 3600
}
)),
time(b"16:43+05")
);
assert_eq!(
Ok((
&[][..],
Time {
hour: 16,
minute: 43,
second: 15,
tz_offset: 5 * 3600
}
)),
time(b"16:43:15+0500")
);
assert_eq!(
Ok((
&[][..],
Time {
hour: 16,
minute: 43,
second: 0,
tz_offset: -(5 * 3600 + 30 * 60)
}
)),
time(b"16:43-05:30")
);
}
}
~~~
### Putting it all together ### Putting it all together
We now got individual parsers for the date, the time and the timezone offset. We now got individual parsers for the date, the time and the timezone offset.
@ -263,19 +497,51 @@ We now got individual parsers for the date, the time and the timezone offset.
Putting it all together, our final datetime parser looks quite small and easy to understand: Putting it all together, our final datetime parser looks quite small and easy to understand:
~~~rust ~~~rust
named!(pub datetime <&[u8], DateTime>, chain!( #[derive(Eq, PartialEq, Debug)]
d: date ~ pub struct DateTime {
tag!("T") ~ date: Date,
t: time ~ time: Time,
tzo: empty_or!(call!(timezone)) }
, named!(pub datetime <&[u8], DateTime>, do_parse!(
|| { date: date >>
DateTime{ tag!("T") >>
date: d, time: time >>
time: t.set_tz(tzo.unwrap_or(0)), (
} DateTime{
date,
time
} }
)); )
));
#[cfg(test)]
mod tests {
use datetime;
use DateTime;
#[test]
fn parse_datetime() {
assert_eq!(
Ok((
&[][..],
DateTime {
date: Date {
year: 2007,
month: 08,
day: 31
},
time: Time {
hour: 16,
minute: 47,
second: 22,
tz_offset: 5 * 3600
}
}
)),
datetime(b"2007-08-31T16:47:22+05:00")
);
}
}
~~~ ~~~
Nothing special anymore. We can now parse all kinds of date strings: Nothing special anymore. We can now parse all kinds of date strings:
@ -296,7 +562,7 @@ But this is fine for now. We can handle the actual validation in a later step.
For example, we could use [chrono][], a time library, [to handle this for us][chrono-convert]. For example, we could use [chrono][], a time library, [to handle this for us][chrono-convert].
Using chrono it's obvious why we already multiplied our timezone offset to be in seconds: this time we can just hand it off to chrono as is. Using chrono it's obvious why we already multiplied our timezone offset to be in seconds: this time we can just hand it off to chrono as is.
The full code for this ISO8601 parser is available in [easy.rs][easy.rs]. The repository also includes [a more complex parser][lib.rs], that does some validation while parsing The full code for the previous version of this ISO8601 parser is available in [easy.rs][easy.rs]. The repository also includes [a more complex parser][lib.rs], that does some validation while parsing
(it checks that the time and date are reasonable values, but it does not check that it is a valid date for example) (it checks that the time and date are reasonable values, but it does not check that it is a valid date for example)
### What's left? ### What's left?
@ -336,7 +602,6 @@ Thanks to [Geoffroy][gcouprie] for the discussions, the help and for reading a d
[nom]: https://github.com/Geal/nom [nom]: https://github.com/Geal/nom
[gcouprie]: https://twitter.com/gcouprie [gcouprie]: https://twitter.com/gcouprie
[taken]: https://github.com/badboy/iso8601/blob/master/src/macros.rs#L20-L39 [taken]: https://github.com/badboy/iso8601/blob/master/src/macros.rs#L20-L39
[datestruct]: https://github.com/badboy/iso8601/blob/master/src/lib.rs#L19-23
[rdb-rs]: http://rdb.fnordig.de/ [rdb-rs]: http://rdb.fnordig.de/
[rsedis]: https://github.com/seppo0010/rsedis [rsedis]: https://github.com/seppo0010/rsedis
[rdb-rs-nom]: https://github.com/badboy/rdb-rs/tree/nom-parser [rdb-rs-nom]: https://github.com/badboy/rdb-rs/tree/nom-parser
@ -348,5 +613,4 @@ Thanks to [Geoffroy][gcouprie] for the discussions, the help and for reading a d
[consumer]: https://github.com/Geal/nom#consumers [consumer]: https://github.com/Geal/nom#consumers
[machine]: https://github.com/Geal/machine [machine]: https://github.com/Geal/machine
[microstate]: https://github.com/badboy/microstate [microstate]: https://github.com/badboy/microstate
[buftoi32]: https://github.com/badboy/iso8601/blob/master/src/helper.rs#L8
[read]: http://doc.rust-lang.org/nightly/std/io/trait.Read.html [read]: http://doc.rust-lang.org/nightly/std/io/trait.Read.html