adamwiggins@mas.to
adamwiggins@mas.to

What's the best tool for lightweight data science that makes it easy to pull together tabular data (usually CSVs) from multiple sources?

Doing some number crunching for @elicitorg but I'm reminded this is a perpetual pain point in every early-stage company I've been at.

|
Embed
Progress spinner
adamwiggins@mas.to
adamwiggins@mas.to

I can export CSVs from all of these places, but importing then "joining" the data (by user ID or email address) is pretty clunky.

|
Embed
Progress spinner
adamwiggins@mas.to
adamwiggins@mas.to

Example scenario here is that I want to do some ad-hoc queries to understand credit usage as broken down by user demographic.

• Billing data is in a SQL database (Postgres / @Metabase)
• Signup survey is in @Typeform
• Product metrics are in @MixPanel

|
Embed
Progress spinner
adamwiggins@mas.to
adamwiggins@mas.to

The fastest method I've found to date is writing one-off Ruby scripts using CSV.read(), doing the query in memory.

Then output the results to the terminal, or print a summarized CSV for import into a spreadsheet to create a chart.

Is this really the best way?

|
Embed
Progress spinner
adamwiggins@mas.to
adamwiggins@mas.to

Another approach is to export data from other sources and import into MixPanel via their User->Import from CSV. This... works, and MixPanel is good for ad-hoc queries, but the import itself is slow, manual, and error-prone.

|
Embed
Progress spinner
adamwiggins@mas.to
adamwiggins@mas.to

Also kind of amazing to me that spreadsheets (including Excel, Google Sheets, and Numbers) get painfully slow or stop working completely with a relatively small number of rows e.g. 20,000.

|
Embed
Progress spinner
adamwiggins@mas.to
adamwiggins@mas.to

Two I've used a lot are spreadsheets and various computational notebooks (e.g. @DeepnoteHQ with Python/pandas/numpy).

I always find I spend a huge amount of time just trying to get the CSV data imported, cleaned up, and joined before I can do any queries. Feels bad.

|
Embed
Progress spinner
adamwiggins@mas.to
adamwiggins@mas.to

Feels like a classic case of end-user programming (e.g. lightly technical product managers want to do this regularly). Curious what's out there that I haven't tried yet.

|
Embed
Progress spinner
KevinMarks@xoxo.zone
KevinMarks@xoxo.zone

@adamwiggins have a look at observablehq.com - very flexible graphing based on pulling in data from multiple formats

|
Embed
Progress spinner
lucapette@hachyderm.io
lucapette@hachyderm.io

@adamwiggins datasette.io ?

|
Embed
Progress spinner
lucapette@hachyderm.io
lucapette@hachyderm.io

@adamwiggins I pressed enter by mistake... when you combine datasette with sqlite-utils.datasette.io/ (it imports csv very nicely) you get a pretty productive workflow!

|
Embed
Progress spinner
adamwiggins@mas.to
adamwiggins@mas.to

@KevinMarks Yes I've used it, but getting in CSVs is annoying. I had better luck with Observable Framework since I can just drop the CSVs into the same folder and read that way.

|
Embed
Progress spinner
adamwiggins@mas.to
adamwiggins@mas.to

@lucapette Interesting, and true one toolchain I forgot is using the sqlite console to load in CSVs and turn into a table for querying! Not very repeatable or shareable though.

|
Embed
Progress spinner
jack@berlin.social
jack@berlin.social

@adamwiggins I would recommend @avi’s DabbleDB 😉

|
Embed
Progress spinner
marianoguerra@hachyderm.io
marianoguerra@hachyderm.io

@adamwiggins different points in the "latent space":

- rilldata.com/
- duckdb.org/
- pola.rs/
- malloydata.dev/ (maybe?)
- trino.io/

|
Embed
Progress spinner
lucapette@hachyderm.io
lucapette@hachyderm.io

@adamwiggins repeatable yeah that's fair. it's a pretty ad-hoc workflow. As for shareable, I think you can publish datasette results but I have never used the feature so I'm not sure it'd cover your use case.

I have never used this myself but I wonder how far you could get with evidence.dev/

It's funny how difficult this use case is to cover well (especially compared to how often you end up doing csv "joining")

|
Embed
Progress spinner
billseitz@toolsforthought.social
billseitz@toolsforthought.social

@jack @adamwiggins @avi maybe Steampipe? cc @judell

|
Embed
Progress spinner
adamwiggins@mas.to
adamwiggins@mas.to

@jack @avi haha yep, if only

|
Embed
Progress spinner
avi@cosocial.ca
avi@cosocial.ca

@adamwiggins @jack ❤️🥲

|
Embed
Progress spinner
In reply to
Avancee
Avancee

@adamwiggins heh.. this was literally an issue I had last week… didn’t solve it (ran away to an AI expo and saw vendors who also wished they did a very common item). Weird how few “low-code” tools there are for this which aren’t a simple file comparison

|
Embed
Progress spinner
bjtitus@mastodon.social
bjtitus@mastodon.social

@adamwiggins I sometimes use VisiData for one off things.

It has a bit of a problem with undiscoverable commands so I usually just search for what I want and find a guide online.

|
Embed
Progress spinner
judell@social.coop
judell@social.coop

@billseitz @jack @adamwiggins @avi Steampipe can indeed join across CSVs and plugin-wrapped APIs, and share resulting analysis. For cleanup, though, yeah DabbleDB was very nice.

So is openrefine.org/ I think, though I haven't touched it in a long while.

|
Embed
Progress spinner
ikesau@mastodon.nz
ikesau@mastodon.nz

@adamwiggins nushell might be interesting, especially if you want to learn a bit of polars

it's a shell based around piping structured data to a library of modern & nice common functions. makes it super easy to do frictionless data exploration on the fly.

and it has a polars integration for more advanced wrangling

|
Embed
Progress spinner
simonmic@fosstodon.org
simonmic@fosstodon.org

Hi @adamwiggins … I think github.com/multiprocessio/data might be in this space and not yet mentioned. Also ultorg.com is quite interesting.

|
Embed
Progress spinner
raiderrobert@mastodon.social
raiderrobert@mastodon.social

@adamwiggins two suggestions:
mode.com - if you want to keep using your other tools
posthog.com - if you're willing to move all in on them for survey + product metrics

|
Embed
Progress spinner