1
Fork 0

new post: A Lil data processing

This commit is contained in:
Jan-Erik Rediger 2024-12-30 09:57:03 +01:00
parent 7cc273a9a8
commit 2484d27cdb

View file

@ -0,0 +1,198 @@
---
permalink: "/{{ year }}/{{ month }}/{{ day }}/a-lil-data-processing"
title: "A Lil data processing"
published_date: "2024-12-30 10:00:00 +0100"
layout: post.liquid
data:
route: blog
excerpt: |
I used Lil to process some JSON data in a git repository and turn that into SQL statements.
---
As I [mentioned before](/2024/12/20/a-lil-advent-of-code/) I've been playing around with [Lil] and I like it so far.
So much that for a side project I wrote yet another small one-off script in Lil.
[Lil]: https://beyondloom.com/decker/lil.html
## The scenario
I have a git repository with various files, each of which contains coordinates of a given trip.
Early on in this project's lifetime a script fetched new data, converted it to JSON, updated the corresponding file and committed the changes.
Only later I extended it to actually save additional metadata, such as a timestamp per data point.
So for these early tracking points I now want to restore _some_ sort of timeline, not an exact one, but as close as I can get.
The best I can do is to associate every tracking point added by a commit with that commit's timestamp.
Oh, and also I will end up storing that data in SQLite, not in JSON anymore.
## The data
Every commit diff looks something like this[^1]:
```diff
diff --git trip001.json trip001.json
index 99af4c9..e0c1dea 100644
--- trip001.json
+++ trip001.json
@@ -42,4 +42,6 @@
,{"type":"Feature", "properties":{}, "geometry": { "type": "Point", "coordinates": [-28.01577,46.74132] }}
,{"type":"Feature", "properties":{}, "geometry": { "type": "Point", "coordinates": [-28.01572,46.74132] }}
,{"type":"Feature", "properties":{}, "geometry": { "type": "Point", "coordinates": [-28.01574,46.74126] }}
+,{"type":"Feature", "properties":{}, "geometry": { "type": "Point", "coordinates": [-28.01568,46.74132] }}
+,{"type":"Feature", "properties":{}, "geometry": { "type": "Point", "coordinates": [-28.01571,46.74704] }}
] }
```
I will look for each line starting with a `+` and read that line as JSON (after stripping its first `,`).
Then parse out the coordinates[^2].
## The script
Let's start by getting all commits for a particular file:
```lil
commits:"\n" split shell["git log --follow --pretty=format:'%H' -- trip001.json"].out
commits:extract value orderby index desc from commits
```
The second line sorts the resulting list in reverse, ensuring I have the earliest commit first[^3].
```lil
valuefmt:"(%i, %f, %f, '%s', %i);"
insertstmt:"INSERT INTO messages (id, latitude, longitude, dateTime, unixTime) VALUES"
```
Just some globals to use later: The SQL statement to generate and a formatting string for the values to put in.
```lil
cmd:" " fuse "git log -1 --pretty=format:%at",commits[0]
id:"%i" parse shell[cmd].out
```
I need _some_ ID and I decided to use the earliest commit's UNIX timestamp to start of with.
This gets incremented to generate a new unique ID per data point.
```lil
each commit in commits
```
Now iterating each commit in the list.
```lil
cmd:" " fuse "git show --pretty=format:%aI",commit
diff:"\n" split shell[cmd].out
dateTime:diff[0]
unixTime:"%e" parse dateTime
```
To get the diff for every commit invoke `git show`.
The first line will be the commit date in strict ISO 8601 format (that's the `%aI` in that command).
Lil's parsing functionality can turn this into a UNIX timestamp for me.
```lil
each line in diff
```
The rest of the diff contains the actual text diff.
Go through it line by line.
```lil
if line[0] = "+"
s:1 drop line
if s[0] = ","
s:1 drop s
end
j:"%j" parse s
```
Only lines that start with a `+` (added lines in the diff) are processed further,
stripping the `,` at the start and parsing it as JSON.
```lil
coord:j.geometry.coordinates
long:coord[0]
lat:coord[1]
```
`j` is already the parsed JSON, so I can access its fields.
Should those fields not exist it will coerce to `0`. No error is thrown.
```
if long = 0
else
values:valuefmt format (id, lat, long, dateTime, unixTime)
stmt:" " fuse (insertstmt, values)
print[stmt]
id:id + 1
end
```
The coordinates might still be `0`, e.g. for the initial `+++ trip001.json`, which will never be valid JSON.
No need to insert null coordinates.
Last but not least, we print the final statement and increment the ID.
```lil
end
end
end
```
And it's done.
Now for every data point it outputs an SQL statement:
```shell
; lilt import-trip001.lil
INSERT INTO messages (id, latitude, longitude, dateTime, unixTime) VALUES (1717709835, 46.74132, 28.01577, '2024-06-06T21:37:15Z', 1717709835);
INSERT INTO messages (id, latitude, longitude, dateTime, unixTime) VALUES (1717709836, 46.74132, 28.01572, '2024-06-06T21:37:15Z', 1717709835);
INSERT INTO messages (id, latitude, longitude, dateTime, unixTime) VALUES (1717709837, 46.74126, 28.01574, '2024-06-06T21:37:15Z', 1717709835);
INSERT INTO messages (id, latitude, longitude, dateTime, unixTime) VALUES (1717709838, 46.74132, 28.01568, '2024-06-06T21:37:15Z', 1717709835);
INSERT INTO messages (id, latitude, longitude, dateTime, unixTime) VALUES (1717709839, 46.74704, 28.01571, '2024-06-06T21:37:15Z', 1717709835);
```
Here's the full script once again:
```lil
commits:"\n" split shell["git log --follow --pretty=format:'%H' -- trip001.json"].out
commits:extract value orderby index desc from commits
valuefmt:"(%i, %f, %f, '%s', %i);"
insertstmt:"INSERT INTO messages (id, latitude, longitude, dateTime, unixTime) VALUES"
cmd:" " fuse "git log -1 --pretty=format:%at",commits[0]
id:"%i" parse shell[cmd].out
each commit in commits
cmd:" " fuse "git show --pretty=format:%aI",commit
diff:"\n" split shell[cmd].out
dateTime:diff[0]
unixTime:"%e" parse dateTime
each line in diff
if line[0] = "+"
s:1 drop line
if s[0] = ","
s:1 drop s
end
j:"%j" parse s
coord:j.geometry.coordinates
long:coord[0]
lat:coord[1]
if long = 0
else
values:valuefmt format (id, lat, long, dateTime, unixTime)
stmt:" " fuse (insertstmt, values)
print[stmt]
id:id + 1
end
end
end
end
```
---
_Footnotes:_
[^1]: The coordinates are intentionally obscured.
[^2]: They are in `longitude, latitude` order. In case you were wondering. Like me every time I stare at them.
[^3]: Sure, I could have used `--reverse`, but where's the fun in that?