1
Fork 0

Update the nom parsing blog post

It used to be based on nom 3.
It uses macros that aren't available in nom 4.

Additionally, I believe I have made it more self-contained-
The previous version used macros defined externally to the blog post and
nom.

I have also added accompanying tests for many of the functions declared.

I believe it is worthwhile updating this.
Nom links to it as documentation for learning nom.
This commit is contained in:
Chris Couzens 2018-06-24 21:05:17 +01:00
parent 47d2d8faca
commit 782be09796

View file

@ -50,6 +50,27 @@ The date parts are separated by a dash (`-`) and the time parts by a colon (`:`)
We will built a small parser for each of these parts and at the end combine them to parse a full date time string.
### Boiler Plate
We will need to make a lib project.
~~~bash
cargo new --lib date_parse
~~~
Edit `Cargo.toml` and `src/lib.rs` so that our project depends on nom.
~~~toml
[dependencies]
nom = "^4.0"
~~~
~~~rust
#[macro_use]
extern crate nom;
~~~
### Parsing the date: 2015-07-16
Let's start with the sign. As we need it several times, we create its own parser for that.
@ -61,6 +82,24 @@ named!(sign <&[u8], i32>, alt!(
tag!("+") => { |_| 1 }
)
);
#[cfg(test)]
mod tests {
use nom::Context::Code;
use nom::Err::Error;
use nom::Err::Incomplete;
use nom::ErrorKind::Alt;
use nom::Needed::Size;
use sign;
#[test]
fn parse_sign() {
assert_eq!(sign(b"-"), Ok((&[][..], -1)));
assert_eq!(sign(b"+"), Ok((&[][..], 1)));
assert_eq!(sign(b""), Err(Incomplete(Size(1))));
assert_eq!(sign(b" "), Err(Error(Code(&b" "[..], Alt))));
}
}
~~~
First, we parse either a plus or a minus sign.
@ -70,144 +109,262 @@ We can directly map the result of the sub-parsers to either `-1` or `1`, so we d
Next we parse the year, which consists of an optional sign and 4 digits (I know, I know, it is possible to extend this to more digits, but let's keep it simple for now).
~~~rust
named!(positive_year <&[u8], i32>, map!(call!(take_4_digits), buf_to_i32));
named!(pub year <&[u8], i32>, chain!(
pref: opt!(sign) ~
y: positive_year
,
|| {
pref.unwrap_or(1) * y
}));
use std::ops::{AddAssign, MulAssign};
fn buf_to_int<T>(s: &[u8]) -> T
where
T: AddAssign + MulAssign + From<u8>,
{
let mut sum = T::from(0);
for digit in s {
sum *= T::from(10);
sum += T::from(*digit - b'0');
}
sum
}
named!(positive_year <&[u8], i32>, map!(take_while_m_n!(4, 4, nom::is_digit), buf_to_int));
named!(pub year <&[u8], i32>, do_parse!(
pref: opt!(sign) >>
y: positive_year >>
(pref.unwrap_or(1) * y)
));
#[cfg(test)]
mod tests {
use positive_year;
use year;
#[test]
fn parse_positive_year() {
assert_eq!(positive_year(b"2018"), Ok((&[][..], 2018)));
}
#[test]
fn parse_year() {
assert_eq!(year(b"2018"), Ok((&[][..], 2018)));
assert_eq!(year(b"+2018"), Ok((&[][..], 2018)));
assert_eq!(year(b"-2018"), Ok((&[][..], -2018)));
}
}
~~~
A lot of additional stuff here. So let's separate it.
~~~rust
named!(positive_year <&[u8], i32>, map!(call!(take_4_digits), buf_to_i32));
named!(positive_year <&[u8], i32>, map!(take_while_m_n!(4, 4, nom::is_digit), buf_to_int));
~~~
This creates a new named parser, that again returns the remaining input and an 32-bit integer.
To work, it first calls `take_4_digits` and then maps that result to the corresponding integer (using a [small helper function][buftoi32]).
To work, it first calls `take_4_digits` and then maps that result to the corresponding integer.
`take_4_digits` is another small helper parser. We also got one for 2 digits:
`take_while_m_n` is another small helper parser. We will also use one for 2 digits:
~~~rust
named!(pub take_4_digits, flat_map!(take!(4), check!(is_digit)));
named!(pub take_2_digits, flat_map!(take!(2), check!(is_digit)));
take_while_m_n!(4, 4, nom::is_digit)
take_while_m_n!(2, 2, nom::is_digit)
~~~
This takes 4 (or 2) characters from the input and checks that each character is a digit.
`flat_map!` and `check!` are quite generic, so they are useful for a lot of cases.
~~~rust
named!(pub year <&[u8], i32>, chain!(
named!(pub year <&[u8], i32>, do_parse!(
~~~
The year is also returned as a 32-bit integer (there's a pattern!).
Using the `chain!` macro, we can chain together multiple parsers and work with the sub-results.
Using the `do_parse!` macro, we can chain together multiple parsers and work with the sub-results.
~~~rust
pref: opt!(sign) ~
y: positive_year
pref: opt!(sign) >>
y: positive_year >>
~~~
Our sign is directly followed by 4 digits. It's optional though, that's why we use `opt!`.
`~` is the concatenation operator in the `chain!` macro.
`>>` is the concatenation operator in the `chain!` macro.
We save the sub-results to variables (`pref` and `y`).
~~~rust
,
|| {
pref.unwrap_or(1) * y
}));
(pref.unwrap_or(1) * y)
~~~
To get the final result, we multiply the prefix (which comes back as either `1` or `-1`) with the year.
Don't forget the `,` (comma) right before the closure.
This is a small syntactic hint for the `chain!` macro that the mapping function will follow and no more parsers.
We can now successfully parse a year:
~~~rust
assert_eq!(Done(&[][..], 2015), year(b"2015"));
assert_eq!(Done(&[][..], -0333), year(b"-0333"));
assert_eq!(year(b"2018"), Ok((&[][..], 2018)));
assert_eq!(year(b"-0333"), Ok((&[][..], -0333)));
~~~
Our nom parser will return an `IResult`. If all went well, we get `Done(I,O)` with `I` and `O` being the appropriate types.
Our nom parser will return an `IResult`.
~~~rust
type IResult<I, O, E = u32> = Result<(I, O), Err<I, E>>;
pub enum Err<I, E = u32> {
Incomplete(Needed),
Error(Context<I, E>),
Failure(Context<I, E>),
}
~~~
If all went well, we get `Ok(I,O)` with `I` and `O` being the appropriate types.
For our case `I` is the same as the input, a buffer slice (`&[u8]`), and `O` is the output of the parser itself, an integer (`i32`).
The return value could also be an `Error(Err)`, if something went completely wrong, or `Incomplete(u32)`, requesting more data to be able to satisfy the parser (you can't parse a 4-digit year with only 3 characters input).
The return value could also be an `Err(Failure)`, if something went completely wrong, or `Err(Incomplete(Needed))`, requesting more data to be able to satisfy the parser (you can't parse a 4-digit year with only 3 characters input).
Parsing the month and day is a bit easier now: we simply take the digits and map them to an integer:
~~~rust
named!(pub month <&[u8], u32>, map!(call!(take_2_digits), buf_to_u32));
named!(pub day <&[u8], u32>, map!(call!(take_2_digits), buf_to_u32));
named!(month <&[u8], u8>, map!(take_while_m_n!(2, 2, nom::is_digit), buf_to_int));
named!(day <&[u8], u8>, map!(take_while_m_n!(2, 2, nom::is_digit), buf_to_int));
#[cfg(test)]
mod tests {
use day;
use month;
#[test]
fn parse_month() {
assert_eq!(month(b"06"), Ok((&[][..], 06)));
}
#[test]
fn parse_day() {
assert_eq!(day(b"18"), Ok((&[][..], 18)));
}
}
~~~
All that's left is combining these 3 parts to parse a full date.
Again we can chain the different parsers and map it to some useful value:
~~~rust
named!(pub date <&[u8], Date>, chain!(
y: year ~
tag!("-") ~
m: month ~
tag!("-") ~
d: day
,
|| { Date{ year: y, month: m, day: d }
#[derive(Eq, PartialEq, Debug)]
pub struct Date {
year: i32,
month: u8,
day: u8,
}
named!(pub date <&[u8], Date>, do_parse!(
year: year >>
tag!("-") >>
month: month >>
tag!("-") >>
day: day >>
(Date { year, month, day})
));
#[cfg(test)]
mod tests {
use date;
use Date;
#[test]
fn parse_date() {
assert_eq!(
Ok((
&[][..],
Date {
year: 2015,
month: 7,
day: 16
}
)),
date(b"2015-07-16")
);
assert_eq!(
Ok((
&[][..],
Date {
year: -333,
month: 6,
day: 11
}
)),
date(b"-0333-06-11")
);
}
}
~~~
`Date` is a [small struct][datestruct], that can hold the necessary information, just as you would expect.
And it already works:
~~~rust
assert_eq!(Done(&[][..], Date{ year: 2015, month: 7, day: 16 }), date(b"2015-07-16"));
assert_eq!(Done(&[][..], Date{ year: -333, month: 6, day: 11 }), date(b"-0333-06-11"));
~~~
And running the tests shows it already works!
### Parsing the time: 16:43:52
Next, we parse the time. The individual parts are really simple, just some digits:
~~~rust
named!(pub hour <&[u8], u32>, map!(call!(take_2_digits), buf_to_u32));
named!(pub minute <&[u8], u32>, map!(call!(take_2_digits), buf_to_u32));
named!(pub second <&[u8], u32>, map!(call!(take_2_digits), buf_to_u32));
named!(pub hour <&[u8], u8>, map!(take_while_m_n!(2, 2, nom::is_digit), buf_to_int));
named!(pub minute <&[u8], u8>, map!(take_while_m_n!(2, 2, nom::is_digit), buf_to_int));
named!(pub second <&[u8], u8>, map!(take_while_m_n!(2, 2, nom::is_digit), buf_to_int));
~~~
Putting them together becomes a bit more complex, as the `second` part is optional:
~~~rust
named!(pub time <&[u8], Time>, chain!(
h: hour ~
tag!(":") ~
m: minute ~
s: empty_or!(chain!(tag!(":") ~ s:second , || { s }))
,
|| { Time{ hour: h,
minute: m,
second: s.unwrap_or(0),
tz_offset: 0 }
#[derive(Eq, PartialEq, Debug)]
pub struct Time {
hour: u8,
minute: u8,
second: u8,
tz_offset: i32,
}
named!(pub time <&[u8], Time>, do_parse!(
hour: hour >>
tag!(":") >>
minute: minute >>
second: opt!(complete!(do_parse!(
tag!(":") >>
second: second >>
(second)
))) >>
(Time {hour, minute, second: second.unwrap_or(0), tz_offset: 0})
));
#[cfg(test)]
mod tests {
use time;
use Time;
#[test]
fn parse_time() {
assert_eq!(
Ok((
&[][..],
Time {
hour: 16,
minute: 43,
second: 52,
tz_offset: 0
}
)),
time(b"16:43:52")
);
assert_eq!(
Ok((
&[][..],
Time {
hour: 16,
minute: 43,
second: 0,
tz_offset: 0
}
)),
time(b"16:43")
);
}
}
~~~
As you can see, even `chain!` parsers can be nested.
As you can see, even `do_parse!` parsers can be nested.
The sub-parts then must be mapped once for the inner parser and once into the final value of the outer parser.
`empty_or!` returns an `Option`. Either `None` if there is no input left or it applies the nested parser. If this parser doesn't fail, `Some(value)` is returned.
Our parser now works for simple time information:
~~~rust
assert_eq!(Done(&[][..], Time{ hour: 16, minute: 43, second: 52, tz_offset: 0}), time(b"16:43:52"));
assert_eq!(Done(&[][..], Time{ hour: 16, minute: 43, second: 0, tz_offset: 0}), time(b"16:43"));
~~~
`opt!` returns an `Option`. Either `None` if there is no input left or it applies the nested parser. If this parser doesn't fail, `Some(value)` is returned.
Our parser now works for simple time information.
But it leaves out one important bit: the timezone.
### Parsing the timezone: +0100
@ -234,12 +391,13 @@ It's a simple `Z` character, which we map to `0`.
The other case is the sign-separated hour and minute offset.
~~~rust
named!(timezone_hour <&[u8], i32>, chain!(
s: sign ~
h: hour ~
m: empty_or!(chain!(tag!(":")? ~ m: minute , || { m }))
,
|| { (s * (h as i32) * 3600) + (m.unwrap_or(0) * 60) as i32 }
named!(timezone_hour <&[u8], i32>, do_parse!(
sign: sign >>
hour: hour >>
minute: opt!(complete!(do_parse!(
opt!(tag!(":")) >> minute: minute >> (minute)
))) >>
((sign * (hour as i32 * 3600 + minute.unwrap_or(0) as i32 * 60)))
));
~~~
@ -248,7 +406,7 @@ The minutes are optional (and might be separated using a colon).
Instead of keeping this as is, we're mapping it to the offset in seconds.
We will see why later.
We could also just map it to a tuple like <br>`(s, h, m.unwrap_or(0))` and handle conversion at a later point.
We could also just map it to a tuple like <br>`(sign, hour, minute.unwrap_or(0))` and handle conversion at a later point.
Combined we get
@ -256,6 +414,82 @@ Combined we get
named!(timezone <&[u8], i32>, alt!(timezone_utc | timezone_hour));
~~~
Putting this back into time we get:
~~~rust
named!(pub time <&[u8], Time>, do_parse!(
hour: hour >>
tag!(":") >>
minute: minute >>
second: opt!(complete!(do_parse!(
tag!(":") >>
second: second >>
(second)
))) >>
tz_offset: opt!(complete!(timezone)) >>
(Time {hour, minute, second: second.unwrap_or(0), tz_offset: tz_offset.unwrap_or(0)})
));
#[cfg(test)]
mod tests {
use time;
use Time;
#[test]
fn parse_time_with_offset() {
assert_eq!(
Ok((
&[][..],
Time {
hour: 16,
minute: 43,
second: 52,
tz_offset: 0
}
)),
time(b"16:43:52Z")
);
assert_eq!(
Ok((
&[][..],
Time {
hour: 16,
minute: 43,
second: 0,
tz_offset: 5 * 3600
}
)),
time(b"16:43+05")
);
assert_eq!(
Ok((
&[][..],
Time {
hour: 16,
minute: 43,
second: 15,
tz_offset: 5 * 3600
}
)),
time(b"16:43:15+0500")
);
assert_eq!(
Ok((
&[][..],
Time {
hour: 16,
minute: 43,
second: 0,
tz_offset: -(5 * 3600 + 30 * 60)
}
)),
time(b"16:43-05:30")
);
}
}
~~~
### Putting it all together
We now got individual parsers for the date, the time and the timezone offset.
@ -263,19 +497,51 @@ We now got individual parsers for the date, the time and the timezone offset.
Putting it all together, our final datetime parser looks quite small and easy to understand:
~~~rust
named!(pub datetime <&[u8], DateTime>, chain!(
d: date ~
tag!("T") ~
t: time ~
tzo: empty_or!(call!(timezone))
,
|| {
#[derive(Eq, PartialEq, Debug)]
pub struct DateTime {
date: Date,
time: Time,
}
named!(pub datetime <&[u8], DateTime>, do_parse!(
date: date >>
tag!("T") >>
time: time >>
(
DateTime{
date: d,
time: t.set_tz(tzo.unwrap_or(0)),
}
date,
time
}
)
));
#[cfg(test)]
mod tests {
use datetime;
use DateTime;
#[test]
fn parse_datetime() {
assert_eq!(
Ok((
&[][..],
DateTime {
date: Date {
year: 2007,
month: 08,
day: 31
},
time: Time {
hour: 16,
minute: 47,
second: 22,
tz_offset: 5 * 3600
}
}
)),
datetime(b"2007-08-31T16:47:22+05:00")
);
}
}
~~~
Nothing special anymore. We can now parse all kinds of date strings:
@ -296,7 +562,7 @@ But this is fine for now. We can handle the actual validation in a later step.
For example, we could use [chrono][], a time library, [to handle this for us][chrono-convert].
Using chrono it's obvious why we already multiplied our timezone offset to be in seconds: this time we can just hand it off to chrono as is.
The full code for this ISO8601 parser is available in [easy.rs][easy.rs]. The repository also includes [a more complex parser][lib.rs], that does some validation while parsing
The full code for the previous version of this ISO8601 parser is available in [easy.rs][easy.rs]. The repository also includes [a more complex parser][lib.rs], that does some validation while parsing
(it checks that the time and date are reasonable values, but it does not check that it is a valid date for example)
### What's left?
@ -336,7 +602,6 @@ Thanks to [Geoffroy][gcouprie] for the discussions, the help and for reading a d
[nom]: https://github.com/Geal/nom
[gcouprie]: https://twitter.com/gcouprie
[taken]: https://github.com/badboy/iso8601/blob/master/src/macros.rs#L20-L39
[datestruct]: https://github.com/badboy/iso8601/blob/master/src/lib.rs#L19-23
[rdb-rs]: http://rdb.fnordig.de/
[rsedis]: https://github.com/seppo0010/rsedis
[rdb-rs-nom]: https://github.com/badboy/rdb-rs/tree/nom-parser
@ -348,5 +613,4 @@ Thanks to [Geoffroy][gcouprie] for the discussions, the help and for reading a d
[consumer]: https://github.com/Geal/nom#consumers
[machine]: https://github.com/Geal/machine
[microstate]: https://github.com/badboy/microstate
[buftoi32]: https://github.com/badboy/iso8601/blob/master/src/helper.rs#L8
[read]: http://doc.rust-lang.org/nightly/std/io/trait.Read.html