diff --git a/_posts/2024-12-30-a-lil-data-processing.md b/_posts/2024-12-30-a-lil-data-processing.md new file mode 100644 index 0000000..59dbfa3 --- /dev/null +++ b/_posts/2024-12-30-a-lil-data-processing.md @@ -0,0 +1,198 @@ +--- +permalink: "/{{ year }}/{{ month }}/{{ day }}/a-lil-data-processing" +title: "A Lil data processing" +published_date: "2024-12-30 10:00:00 +0100" +layout: post.liquid +data: + route: blog +excerpt: | + I used Lil to process some JSON data in a git repository and turn that into SQL statements. +--- + +As I [mentioned before](/2024/12/20/a-lil-advent-of-code/) I've been playing around with [Lil] and I like it so far. +So much that for a side project I wrote yet another small one-off script in Lil. + +[Lil]: https://beyondloom.com/decker/lil.html + +## The scenario + +I have a git repository with various files, each of which contains coordinates of a given trip. +Early on in this project's lifetime a script fetched new data, converted it to JSON, updated the corresponding file and committed the changes. +Only later I extended it to actually save additional metadata, such as a timestamp per data point. + +So for these early tracking points I now want to restore _some_ sort of timeline, not an exact one, but as close as I can get. +The best I can do is to associate every tracking point added by a commit with that commit's timestamp. +Oh, and also I will end up storing that data in SQLite, not in JSON anymore. + +## The data + +Every commit diff looks something like this[^1]: + +```diff +diff --git trip001.json trip001.json +index 99af4c9..e0c1dea 100644 +--- trip001.json ++++ trip001.json +@@ -42,4 +42,6 @@ + ,{"type":"Feature", "properties":{}, "geometry": { "type": "Point", "coordinates": [-28.01577,46.74132] }} + ,{"type":"Feature", "properties":{}, "geometry": { "type": "Point", "coordinates": [-28.01572,46.74132] }} + ,{"type":"Feature", "properties":{}, "geometry": { "type": "Point", "coordinates": [-28.01574,46.74126] }} ++,{"type":"Feature", "properties":{}, "geometry": { "type": "Point", "coordinates": [-28.01568,46.74132] }} ++,{"type":"Feature", "properties":{}, "geometry": { "type": "Point", "coordinates": [-28.01571,46.74704] }} + ] } +``` + +I will look for each line starting with a `+` and read that line as JSON (after stripping its first `,`). +Then parse out the coordinates[^2]. + +## The script + +Let's start by getting all commits for a particular file: + +```lil +commits:"\n" split shell["git log --follow --pretty=format:'%H' -- trip001.json"].out +commits:extract value orderby index desc from commits +``` + +The second line sorts the resulting list in reverse, ensuring I have the earliest commit first[^3]. + +```lil +valuefmt:"(%i, %f, %f, '%s', %i);" +insertstmt:"INSERT INTO messages (id, latitude, longitude, dateTime, unixTime) VALUES" +``` + +Just some globals to use later: The SQL statement to generate and a formatting string for the values to put in. + +```lil +cmd:" " fuse "git log -1 --pretty=format:%at",commits[0] +id:"%i" parse shell[cmd].out +``` + +I need _some_ ID and I decided to use the earliest commit's UNIX timestamp to start of with. +This gets incremented to generate a new unique ID per data point. + +```lil +each commit in commits +``` + +Now iterating each commit in the list. + +```lil + cmd:" " fuse "git show --pretty=format:%aI",commit + diff:"\n" split shell[cmd].out + dateTime:diff[0] + unixTime:"%e" parse dateTime +``` + +To get the diff for every commit invoke `git show`. +The first line will be the commit date in strict ISO 8601 format (that's the `%aI` in that command). +Lil's parsing functionality can turn this into a UNIX timestamp for me. + +```lil + each line in diff +``` + +The rest of the diff contains the actual text diff. +Go through it line by line. + +```lil + if line[0] = "+" + s:1 drop line + if s[0] = "," + s:1 drop s + end + j:"%j" parse s +``` + +Only lines that start with a `+` (added lines in the diff) are processed further, +stripping the `,` at the start and parsing it as JSON. + +```lil + coord:j.geometry.coordinates + long:coord[0] + lat:coord[1] +``` + +`j` is already the parsed JSON, so I can access its fields. +Should those fields not exist it will coerce to `0`. No error is thrown. + +``` + if long = 0 + else + values:valuefmt format (id, lat, long, dateTime, unixTime) + stmt:" " fuse (insertstmt, values) + print[stmt] + id:id + 1 + end +``` + +The coordinates might still be `0`, e.g. for the initial `+++ trip001.json`, which will never be valid JSON. +No need to insert null coordinates. +Last but not least, we print the final statement and increment the ID. + +```lil + end + end +end +``` + +And it's done. +Now for every data point it outputs an SQL statement: + +```shell +; lilt import-trip001.lil +INSERT INTO messages (id, latitude, longitude, dateTime, unixTime) VALUES (1717709835, 46.74132, 28.01577, '2024-06-06T21:37:15Z', 1717709835); +INSERT INTO messages (id, latitude, longitude, dateTime, unixTime) VALUES (1717709836, 46.74132, 28.01572, '2024-06-06T21:37:15Z', 1717709835); +INSERT INTO messages (id, latitude, longitude, dateTime, unixTime) VALUES (1717709837, 46.74126, 28.01574, '2024-06-06T21:37:15Z', 1717709835); +INSERT INTO messages (id, latitude, longitude, dateTime, unixTime) VALUES (1717709838, 46.74132, 28.01568, '2024-06-06T21:37:15Z', 1717709835); +INSERT INTO messages (id, latitude, longitude, dateTime, unixTime) VALUES (1717709839, 46.74704, 28.01571, '2024-06-06T21:37:15Z', 1717709835); +``` + +Here's the full script once again: + +```lil +commits:"\n" split shell["git log --follow --pretty=format:'%H' -- trip001.json"].out +commits:extract value orderby index desc from commits + +valuefmt:"(%i, %f, %f, '%s', %i);" +insertstmt:"INSERT INTO messages (id, latitude, longitude, dateTime, unixTime) VALUES" + +cmd:" " fuse "git log -1 --pretty=format:%at",commits[0] +id:"%i" parse shell[cmd].out + +each commit in commits + cmd:" " fuse "git show --pretty=format:%aI",commit + diff:"\n" split shell[cmd].out + dateTime:diff[0] + unixTime:"%e" parse dateTime + + each line in diff + if line[0] = "+" + s:1 drop line + if s[0] = "," + s:1 drop s + end + j:"%j" parse s + coord:j.geometry.coordinates + long:coord[0] + lat:coord[1] + + if long = 0 + else + values:valuefmt format (id, lat, long, dateTime, unixTime) + stmt:" " fuse (insertstmt, values) + print[stmt] + id:id + 1 + end + end + end +end +``` + +--- + +_Footnotes:_ + +[^1]: The coordinates are intentionally obscured. +[^2]: They are in `longitude, latitude` order. In case you were wondering. Like me every time I stare at them. +[^3]: Sure, I could have used `--reverse`, but where's the fun in that?