new post: A Lil data processing
This commit is contained in:
parent
7cc273a9a8
commit
2484d27cdb
198
_posts/2024-12-30-a-lil-data-processing.md
Normal file
198
_posts/2024-12-30-a-lil-data-processing.md
Normal file
|
@ -0,0 +1,198 @@
|
|||
---
|
||||
permalink: "/{{ year }}/{{ month }}/{{ day }}/a-lil-data-processing"
|
||||
title: "A Lil data processing"
|
||||
published_date: "2024-12-30 10:00:00 +0100"
|
||||
layout: post.liquid
|
||||
data:
|
||||
route: blog
|
||||
excerpt: |
|
||||
I used Lil to process some JSON data in a git repository and turn that into SQL statements.
|
||||
---
|
||||
|
||||
As I [mentioned before](/2024/12/20/a-lil-advent-of-code/) I've been playing around with [Lil] and I like it so far.
|
||||
So much that for a side project I wrote yet another small one-off script in Lil.
|
||||
|
||||
[Lil]: https://beyondloom.com/decker/lil.html
|
||||
|
||||
## The scenario
|
||||
|
||||
I have a git repository with various files, each of which contains coordinates of a given trip.
|
||||
Early on in this project's lifetime a script fetched new data, converted it to JSON, updated the corresponding file and committed the changes.
|
||||
Only later I extended it to actually save additional metadata, such as a timestamp per data point.
|
||||
|
||||
So for these early tracking points I now want to restore _some_ sort of timeline, not an exact one, but as close as I can get.
|
||||
The best I can do is to associate every tracking point added by a commit with that commit's timestamp.
|
||||
Oh, and also I will end up storing that data in SQLite, not in JSON anymore.
|
||||
|
||||
## The data
|
||||
|
||||
Every commit diff looks something like this[^1]:
|
||||
|
||||
```diff
|
||||
diff --git trip001.json trip001.json
|
||||
index 99af4c9..e0c1dea 100644
|
||||
--- trip001.json
|
||||
+++ trip001.json
|
||||
@@ -42,4 +42,6 @@
|
||||
,{"type":"Feature", "properties":{}, "geometry": { "type": "Point", "coordinates": [-28.01577,46.74132] }}
|
||||
,{"type":"Feature", "properties":{}, "geometry": { "type": "Point", "coordinates": [-28.01572,46.74132] }}
|
||||
,{"type":"Feature", "properties":{}, "geometry": { "type": "Point", "coordinates": [-28.01574,46.74126] }}
|
||||
+,{"type":"Feature", "properties":{}, "geometry": { "type": "Point", "coordinates": [-28.01568,46.74132] }}
|
||||
+,{"type":"Feature", "properties":{}, "geometry": { "type": "Point", "coordinates": [-28.01571,46.74704] }}
|
||||
] }
|
||||
```
|
||||
|
||||
I will look for each line starting with a `+` and read that line as JSON (after stripping its first `,`).
|
||||
Then parse out the coordinates[^2].
|
||||
|
||||
## The script
|
||||
|
||||
Let's start by getting all commits for a particular file:
|
||||
|
||||
```lil
|
||||
commits:"\n" split shell["git log --follow --pretty=format:'%H' -- trip001.json"].out
|
||||
commits:extract value orderby index desc from commits
|
||||
```
|
||||
|
||||
The second line sorts the resulting list in reverse, ensuring I have the earliest commit first[^3].
|
||||
|
||||
```lil
|
||||
valuefmt:"(%i, %f, %f, '%s', %i);"
|
||||
insertstmt:"INSERT INTO messages (id, latitude, longitude, dateTime, unixTime) VALUES"
|
||||
```
|
||||
|
||||
Just some globals to use later: The SQL statement to generate and a formatting string for the values to put in.
|
||||
|
||||
```lil
|
||||
cmd:" " fuse "git log -1 --pretty=format:%at",commits[0]
|
||||
id:"%i" parse shell[cmd].out
|
||||
```
|
||||
|
||||
I need _some_ ID and I decided to use the earliest commit's UNIX timestamp to start of with.
|
||||
This gets incremented to generate a new unique ID per data point.
|
||||
|
||||
```lil
|
||||
each commit in commits
|
||||
```
|
||||
|
||||
Now iterating each commit in the list.
|
||||
|
||||
```lil
|
||||
cmd:" " fuse "git show --pretty=format:%aI",commit
|
||||
diff:"\n" split shell[cmd].out
|
||||
dateTime:diff[0]
|
||||
unixTime:"%e" parse dateTime
|
||||
```
|
||||
|
||||
To get the diff for every commit invoke `git show`.
|
||||
The first line will be the commit date in strict ISO 8601 format (that's the `%aI` in that command).
|
||||
Lil's parsing functionality can turn this into a UNIX timestamp for me.
|
||||
|
||||
```lil
|
||||
each line in diff
|
||||
```
|
||||
|
||||
The rest of the diff contains the actual text diff.
|
||||
Go through it line by line.
|
||||
|
||||
```lil
|
||||
if line[0] = "+"
|
||||
s:1 drop line
|
||||
if s[0] = ","
|
||||
s:1 drop s
|
||||
end
|
||||
j:"%j" parse s
|
||||
```
|
||||
|
||||
Only lines that start with a `+` (added lines in the diff) are processed further,
|
||||
stripping the `,` at the start and parsing it as JSON.
|
||||
|
||||
```lil
|
||||
coord:j.geometry.coordinates
|
||||
long:coord[0]
|
||||
lat:coord[1]
|
||||
```
|
||||
|
||||
`j` is already the parsed JSON, so I can access its fields.
|
||||
Should those fields not exist it will coerce to `0`. No error is thrown.
|
||||
|
||||
```
|
||||
if long = 0
|
||||
else
|
||||
values:valuefmt format (id, lat, long, dateTime, unixTime)
|
||||
stmt:" " fuse (insertstmt, values)
|
||||
print[stmt]
|
||||
id:id + 1
|
||||
end
|
||||
```
|
||||
|
||||
The coordinates might still be `0`, e.g. for the initial `+++ trip001.json`, which will never be valid JSON.
|
||||
No need to insert null coordinates.
|
||||
Last but not least, we print the final statement and increment the ID.
|
||||
|
||||
```lil
|
||||
end
|
||||
end
|
||||
end
|
||||
```
|
||||
|
||||
And it's done.
|
||||
Now for every data point it outputs an SQL statement:
|
||||
|
||||
```shell
|
||||
; lilt import-trip001.lil
|
||||
INSERT INTO messages (id, latitude, longitude, dateTime, unixTime) VALUES (1717709835, 46.74132, 28.01577, '2024-06-06T21:37:15Z', 1717709835);
|
||||
INSERT INTO messages (id, latitude, longitude, dateTime, unixTime) VALUES (1717709836, 46.74132, 28.01572, '2024-06-06T21:37:15Z', 1717709835);
|
||||
INSERT INTO messages (id, latitude, longitude, dateTime, unixTime) VALUES (1717709837, 46.74126, 28.01574, '2024-06-06T21:37:15Z', 1717709835);
|
||||
INSERT INTO messages (id, latitude, longitude, dateTime, unixTime) VALUES (1717709838, 46.74132, 28.01568, '2024-06-06T21:37:15Z', 1717709835);
|
||||
INSERT INTO messages (id, latitude, longitude, dateTime, unixTime) VALUES (1717709839, 46.74704, 28.01571, '2024-06-06T21:37:15Z', 1717709835);
|
||||
```
|
||||
|
||||
Here's the full script once again:
|
||||
|
||||
```lil
|
||||
commits:"\n" split shell["git log --follow --pretty=format:'%H' -- trip001.json"].out
|
||||
commits:extract value orderby index desc from commits
|
||||
|
||||
valuefmt:"(%i, %f, %f, '%s', %i);"
|
||||
insertstmt:"INSERT INTO messages (id, latitude, longitude, dateTime, unixTime) VALUES"
|
||||
|
||||
cmd:" " fuse "git log -1 --pretty=format:%at",commits[0]
|
||||
id:"%i" parse shell[cmd].out
|
||||
|
||||
each commit in commits
|
||||
cmd:" " fuse "git show --pretty=format:%aI",commit
|
||||
diff:"\n" split shell[cmd].out
|
||||
dateTime:diff[0]
|
||||
unixTime:"%e" parse dateTime
|
||||
|
||||
each line in diff
|
||||
if line[0] = "+"
|
||||
s:1 drop line
|
||||
if s[0] = ","
|
||||
s:1 drop s
|
||||
end
|
||||
j:"%j" parse s
|
||||
coord:j.geometry.coordinates
|
||||
long:coord[0]
|
||||
lat:coord[1]
|
||||
|
||||
if long = 0
|
||||
else
|
||||
values:valuefmt format (id, lat, long, dateTime, unixTime)
|
||||
stmt:" " fuse (insertstmt, values)
|
||||
print[stmt]
|
||||
id:id + 1
|
||||
end
|
||||
end
|
||||
end
|
||||
end
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
_Footnotes:_
|
||||
|
||||
[^1]: The coordinates are intentionally obscured.
|
||||
[^2]: They are in `longitude, latitude` order. In case you were wondering. Like me every time I stare at them.
|
||||
[^3]: Sure, I could have used `--reverse`, but where's the fun in that?
|
Loading…
Reference in a new issue