6.3 KiB
permalink | title | published_date | layout | data | excerpt | ||
---|---|---|---|---|---|---|---|
/{{ year }}/{{ month }}/{{ day }}/a-lil-data-processing | A Lil data processing | 2024-12-30 10:00:00 +0100 | post.liquid |
|
I used Lil to process some JSON data in a git repository and turn that into SQL statements. |
As I mentioned before I've been playing around with Lil and I like it so far. So much that for a side project I wrote yet another small one-off script in Lil.
The scenario
I have a git repository with various files, each of which contains coordinates of a given trip. Early on in this project's lifetime a script fetched new data, converted it to JSON, updated the corresponding file and committed the changes. Only later I extended it to actually save additional metadata, such as a timestamp per data point.
So for these early tracking points I now want to restore some sort of timeline, not an exact one, but as close as I can get. The best I can do is to associate every tracking point added by a commit with that commit's timestamp. Oh, and also I will end up storing that data in SQLite, not in JSON anymore.
The data
Every commit diff looks something like this1:
diff --git trip001.json trip001.json
index 99af4c9..e0c1dea 100644
--- trip001.json
+++ trip001.json
@@ -42,4 +42,6 @@
,{"type":"Feature", "properties":{}, "geometry": { "type": "Point", "coordinates": [-28.01577,46.74132] }}
,{"type":"Feature", "properties":{}, "geometry": { "type": "Point", "coordinates": [-28.01572,46.74132] }}
,{"type":"Feature", "properties":{}, "geometry": { "type": "Point", "coordinates": [-28.01574,46.74126] }}
+,{"type":"Feature", "properties":{}, "geometry": { "type": "Point", "coordinates": [-28.01568,46.74132] }}
+,{"type":"Feature", "properties":{}, "geometry": { "type": "Point", "coordinates": [-28.01571,46.74704] }}
] }
I will look for each line starting with a +
and read that line as JSON (after stripping its first ,
).
Then parse out the coordinates2.
The script
Let's start by getting all commits for a particular file:
commits:"\n" split shell["git log --follow --pretty=format:'%H' -- trip001.json"].out
commits:extract value orderby index desc from commits
The second line sorts the resulting list in reverse, ensuring I have the earliest commit first3.
valuefmt:"(%i, %f, %f, '%s', %i);"
insertstmt:"INSERT INTO messages (id, latitude, longitude, dateTime, unixTime) VALUES"
Just some globals to use later: The SQL statement to generate and a formatting string for the values to put in.
cmd:" " fuse "git log -1 --pretty=format:%at",commits[0]
id:"%i" parse shell[cmd].out
I need some ID and I decided to use the earliest commit's UNIX timestamp to start of with. This gets incremented to generate a new unique ID per data point.
each commit in commits
Now iterating each commit in the list.
cmd:" " fuse "git show --pretty=format:%aI",commit
diff:"\n" split shell[cmd].out
dateTime:diff[0]
unixTime:"%e" parse dateTime
To get the diff for every commit invoke git show
.
The first line will be the commit date in strict ISO 8601 format (that's the %aI
in that command).
Lil's parsing functionality can turn this into a UNIX timestamp for me.
each line in diff
The rest of the diff contains the actual text diff. Go through it line by line.
if line[0] = "+"
s:1 drop line
if s[0] = ","
s:1 drop s
end
j:"%j" parse s
Only lines that start with a +
(added lines in the diff) are processed further,
stripping the ,
at the start and parsing it as JSON.
coord:j.geometry.coordinates
long:coord[0]
lat:coord[1]
j
is already the parsed JSON, so I can access its fields.
Should those fields not exist it will coerce to 0
. No error is thrown.
if long = 0
else
values:valuefmt format (id, lat, long, dateTime, unixTime)
stmt:" " fuse (insertstmt, values)
print[stmt]
id:id + 1
end
The coordinates might still be 0
, e.g. for the initial +++ trip001.json
, which will never be valid JSON.
No need to insert null coordinates.
Last but not least, we print the final statement and increment the ID.
end
end
end
And it's done. Now for every data point it outputs an SQL statement:
; lilt import-trip001.lil
INSERT INTO messages (id, latitude, longitude, dateTime, unixTime) VALUES (1717709835, 46.74132, 28.01577, '2024-06-06T21:37:15Z', 1717709835);
INSERT INTO messages (id, latitude, longitude, dateTime, unixTime) VALUES (1717709836, 46.74132, 28.01572, '2024-06-06T21:37:15Z', 1717709835);
INSERT INTO messages (id, latitude, longitude, dateTime, unixTime) VALUES (1717709837, 46.74126, 28.01574, '2024-06-06T21:37:15Z', 1717709835);
INSERT INTO messages (id, latitude, longitude, dateTime, unixTime) VALUES (1717709838, 46.74132, 28.01568, '2024-06-06T21:37:15Z', 1717709835);
INSERT INTO messages (id, latitude, longitude, dateTime, unixTime) VALUES (1717709839, 46.74704, 28.01571, '2024-06-06T21:37:15Z', 1717709835);
Here's the full script once again:
commits:"\n" split shell["git log --follow --pretty=format:'%H' -- trip001.json"].out
commits:extract value orderby index desc from commits
valuefmt:"(%i, %f, %f, '%s', %i);"
insertstmt:"INSERT INTO messages (id, latitude, longitude, dateTime, unixTime) VALUES"
cmd:" " fuse "git log -1 --pretty=format:%at",commits[0]
id:"%i" parse shell[cmd].out
each commit in commits
cmd:" " fuse "git show --pretty=format:%aI",commit
diff:"\n" split shell[cmd].out
dateTime:diff[0]
unixTime:"%e" parse dateTime
each line in diff
if line[0] = "+"
s:1 drop line
if s[0] = ","
s:1 drop s
end
j:"%j" parse s
coord:j.geometry.coordinates
long:coord[0]
lat:coord[1]
if long = 0
else
values:valuefmt format (id, lat, long, dateTime, unixTime)
stmt:" " fuse (insertstmt, values)
print[stmt]
id:id + 1
end
end
end
end
Footnotes: