1
Fork 0
blog/_posts/2024-12-30-a-lil-data-processing.md

6.3 KiB

permalink title published_date layout data excerpt
/{{ year }}/{{ month }}/{{ day }}/a-lil-data-processing A Lil data processing 2024-12-30 10:00:00 +0100 post.liquid
route
blog
I used Lil to process some JSON data in a git repository and turn that into SQL statements.

As I mentioned before I've been playing around with Lil and I like it so far. So much that for a side project I wrote yet another small one-off script in Lil.

The scenario

I have a git repository with various files, each of which contains coordinates of a given trip. Early on in this project's lifetime a script fetched new data, converted it to JSON, updated the corresponding file and committed the changes. Only later I extended it to actually save additional metadata, such as a timestamp per data point.

So for these early tracking points I now want to restore some sort of timeline, not an exact one, but as close as I can get. The best I can do is to associate every tracking point added by a commit with that commit's timestamp. Oh, and also I will end up storing that data in SQLite, not in JSON anymore.

The data

Every commit diff looks something like this1:

diff --git trip001.json trip001.json
index 99af4c9..e0c1dea 100644
--- trip001.json
+++ trip001.json
@@ -42,4 +42,6 @@
 ,{"type":"Feature", "properties":{}, "geometry": { "type": "Point", "coordinates": [-28.01577,46.74132] }}
 ,{"type":"Feature", "properties":{}, "geometry": { "type": "Point", "coordinates": [-28.01572,46.74132] }}
 ,{"type":"Feature", "properties":{}, "geometry": { "type": "Point", "coordinates": [-28.01574,46.74126] }}
+,{"type":"Feature", "properties":{}, "geometry": { "type": "Point", "coordinates": [-28.01568,46.74132] }}
+,{"type":"Feature", "properties":{}, "geometry": { "type": "Point", "coordinates": [-28.01571,46.74704] }}
 ] }

I will look for each line starting with a + and read that line as JSON (after stripping its first ,). Then parse out the coordinates2.

The script

Let's start by getting all commits for a particular file:

commits:"\n" split shell["git log --follow --pretty=format:'%H' -- trip001.json"].out
commits:extract value orderby index desc from commits

The second line sorts the resulting list in reverse, ensuring I have the earliest commit first3.

valuefmt:"(%i, %f, %f, '%s', %i);"
insertstmt:"INSERT INTO messages (id, latitude, longitude, dateTime, unixTime) VALUES"

Just some globals to use later: The SQL statement to generate and a formatting string for the values to put in.

cmd:" " fuse "git log -1 --pretty=format:%at",commits[0]
id:"%i" parse shell[cmd].out

I need some ID and I decided to use the earliest commit's UNIX timestamp to start of with. This gets incremented to generate a new unique ID per data point.

each commit in commits

Now iterating each commit in the list.

  cmd:" " fuse "git show --pretty=format:%aI",commit
  diff:"\n" split shell[cmd].out
  dateTime:diff[0]
  unixTime:"%e" parse dateTime

To get the diff for every commit invoke git show. The first line will be the commit date in strict ISO 8601 format (that's the %aI in that command). Lil's parsing functionality can turn this into a UNIX timestamp for me.

  each line in diff

The rest of the diff contains the actual text diff. Go through it line by line.

    if line[0] = "+"
      s:1 drop line
      if s[0] = ","
        s:1 drop s
      end
      j:"%j" parse s

Only lines that start with a + (added lines in the diff) are processed further, stripping the , at the start and parsing it as JSON.

      coord:j.geometry.coordinates
      long:coord[0]
      lat:coord[1]

j is already the parsed JSON, so I can access its fields. Should those fields not exist it will coerce to 0. No error is thrown.

      if long = 0
      else
        values:valuefmt format (id, lat, long, dateTime, unixTime)
        stmt:" " fuse (insertstmt, values)
        print[stmt]
        id:id + 1
      end

The coordinates might still be 0, e.g. for the initial +++ trip001.json, which will never be valid JSON. No need to insert null coordinates. Last but not least, we print the final statement and increment the ID.

    end
  end
end

And it's done. Now for every data point it outputs an SQL statement:

; lilt import-trip001.lil
INSERT INTO messages (id, latitude, longitude, dateTime, unixTime) VALUES (1717709835, 46.74132, 28.01577, '2024-06-06T21:37:15Z', 1717709835);
INSERT INTO messages (id, latitude, longitude, dateTime, unixTime) VALUES (1717709836, 46.74132, 28.01572, '2024-06-06T21:37:15Z', 1717709835);
INSERT INTO messages (id, latitude, longitude, dateTime, unixTime) VALUES (1717709837, 46.74126, 28.01574, '2024-06-06T21:37:15Z', 1717709835);
INSERT INTO messages (id, latitude, longitude, dateTime, unixTime) VALUES (1717709838, 46.74132, 28.01568, '2024-06-06T21:37:15Z', 1717709835);
INSERT INTO messages (id, latitude, longitude, dateTime, unixTime) VALUES (1717709839, 46.74704, 28.01571, '2024-06-06T21:37:15Z', 1717709835);

Here's the full script once again:

commits:"\n" split shell["git log --follow --pretty=format:'%H' -- trip001.json"].out
commits:extract value orderby index desc from commits

valuefmt:"(%i, %f, %f, '%s', %i);"
insertstmt:"INSERT INTO messages (id, latitude, longitude, dateTime, unixTime) VALUES"

cmd:" " fuse "git log -1 --pretty=format:%at",commits[0]
id:"%i" parse shell[cmd].out

each commit in commits
  cmd:" " fuse "git show --pretty=format:%aI",commit
  diff:"\n" split shell[cmd].out
  dateTime:diff[0]
  unixTime:"%e" parse dateTime

  each line in diff
    if line[0] = "+"
      s:1 drop line
      if s[0] = ","
        s:1 drop s
      end
      j:"%j" parse s
      coord:j.geometry.coordinates
      long:coord[0]
      lat:coord[1]

      if long = 0
      else
        values:valuefmt format (id, lat, long, dateTime, unixTime)
        stmt:" " fuse (insertstmt, values)
        print[stmt]
        id:id + 1
      end
    end
  end
end

Footnotes:


  1. The coordinates are intentionally obscured. ↩︎

  2. They are in longitude, latitude order. In case you were wondering. Like me every time I stare at them. ↩︎

  3. Sure, I could have used --reverse, but where's the fun in that? ↩︎