@adamwiggins@mas.to heh.. this was literally an issue I had last week… didn’t solve it (ran away to an AI expo and saw vendors who also wished they did a very common item). Weird how few “low-code” tools there are for this which aren’t a simple file comparison

What's the best tool for lightweight data science that makes it easy to pull together tabular data (usually CSVs) from multiple sources?

Doing some number crunching for @elicitorg but I'm reminded this is a perpetual pain point in every early-stage company I've been at.

2024-05-14 7:49 am

|

Embed

adamwiggins@mas.to

I can export CSVs from all of these places, but importing then "joining" the data (by user ID or email address) is pretty clunky.

2024-05-14 7:49 am

|

Embed

adamwiggins@mas.to

Example scenario here is that I want to do some ad-hoc queries to understand credit usage as broken down by user demographic.

• Billing data is in a SQL database (Postgres / @Metabase)
• Signup survey is in @Typeform
• Product metrics are in @MixPanel

2024-05-14 7:49 am

|

Embed

adamwiggins@mas.to

The fastest method I've found to date is writing one-off Ruby scripts using CSV.read(), doing the query in memory.

Then output the results to the terminal, or print a summarized CSV for import into a spreadsheet to create a chart.

Is this really the best way?

2024-05-14 7:49 am

|

Embed

adamwiggins@mas.to

Another approach is to export data from other sources and import into MixPanel via their User->Import from CSV. This... works, and MixPanel is good for ad-hoc queries, but the import itself is slow, manual, and error-prone.

2024-05-14 7:49 am

|

Embed

adamwiggins@mas.to

Also kind of amazing to me that spreadsheets (including Excel, Google Sheets, and Numbers) get painfully slow or stop working completely with a relatively small number of rows e.g. 20,000.

2024-05-14 7:49 am

|

Embed

adamwiggins@mas.to

Two I've used a lot are spreadsheets and various computational notebooks (e.g. @DeepnoteHQ with Python/pandas/numpy).

I always find I spend a huge amount of time just trying to get the CSV data imported, cleaned up, and joined before I can do any queries. Feels bad.

2024-05-14 7:49 am

|

Embed

adamwiggins@mas.to

Feels like a classic case of end-user programming (e.g. lightly technical product managers want to do this regularly). Curious what's out there that I haven't tried yet.

2024-05-14 7:49 am

|

Embed

KevinMarks@xoxo.zone

@adamwiggins have a look at observablehq.com - very flexible graphing based on pulling in data from multiple formats

2024-05-14 8:01 am

|

Embed

lucapette@hachyderm.io

@adamwiggins datasette.io ?

2024-05-14 8:04 am

|

Embed

lucapette@hachyderm.io

@adamwiggins I pressed enter by mistake... when you combine datasette with https://sqlite-utils.datasette.io/ (it imports csv very nicely) you get a pretty productive workflow!

2024-05-14 8:11 am

|

Embed

adamwiggins@mas.to

@KevinMarks Yes I've used it, but getting in CSVs is annoying. I had better luck with Observable Framework since I can just drop the CSVs into the same folder and read that way.

2024-05-14 8:48 am

|

Embed

adamwiggins@mas.to

@lucapette Interesting, and true one toolchain I forgot is using the sqlite console to load in CSVs and turn into a table for querying! Not very repeatable or shareable though.

2024-05-14 8:49 am

|

Embed

jack@berlin.social

@adamwiggins I would recommend @avi’s DabbleDB 😉

2024-05-14 9:29 am

|

Embed

marianoguerra@hachyderm.io

@adamwiggins different points in the "latent space":

- https://www.rilldata.com/
- https://duckdb.org/
- https://pola.rs/
- https://www.malloydata.dev/ (maybe?)
- https://trino.io/

2024-05-14 9:33 am

|

Embed

lucapette@hachyderm.io

@adamwiggins repeatable yeah that's fair. it's a pretty ad-hoc workflow. As for shareable, I think you can publish datasette results but I have never used the feature so I'm not sure it'd cover your use case.

I have never used this myself but I wonder how far you could get with https://evidence.dev/

It's funny how difficult this use case is to cover well (especially compared to how often you end up doing csv "joining")

2024-05-14 9:53 am

|

Embed

billseitz@toolsforthought.social

@jack @adamwiggins @avi maybe Steampipe? cc @judell

2024-05-14 12:30 pm

|

Embed

adamwiggins@mas.to

@jack @avi haha yep, if only

2024-05-14 1:17 pm

|

Embed

avi@cosocial.ca

@adamwiggins @jack ❤️🥲

2024-05-14 1:18 pm

|

Embed

In reply to

Avancee

@adamwiggins heh.. this was literally an issue I had last week… didn’t solve it (ran away to an AI expo and saw vendors who also wished they did a very common item). Weird how few “low-code” tools there are for this which aren’t a simple file comparison

2024-05-14 1:56 pm

|

Embed

bjtitus@mastodon.social

@adamwiggins I sometimes use VisiData for one off things.

It has a bit of a problem with undiscoverable commands so I usually just search for what I want and find a guide online.

2024-05-14 2:11 pm

|

Embed

judell@social.coop

@billseitz @jack @adamwiggins @avi Steampipe can indeed join across CSVs and plugin-wrapped APIs, and share resulting analysis. For cleanup, though, yeah DabbleDB was very nice.

So is https://openrefine.org/ I think, though I haven't touched it in a long while.

2024-05-14 2:51 pm

|

Embed

ikesau@mastodon.nz

@adamwiggins nushell might be interesting, especially if you want to learn a bit of polars

it's a shell based around piping structured data to a library of modern & nice common functions. makes it super easy to do frictionless data exploration on the fly.

and it has a polars integration for more advanced wrangling

2024-05-15 2:10 am

|

Embed

simonmic@fosstodon.org

Hi @adamwiggins … I think https://github.com/multiprocessio/datastation might be in this space and not yet mentioned. Also https://www.ultorg.com is quite interesting.

2024-05-15 9:39 pm

|

Embed

raiderrobert@mastodon.social

@adamwiggins two suggestions:
mode.com - if you want to keep using your other tools
posthog.com - if you're willing to move all in on them for survey + product metrics

2024-05-16 12:17 am

|

Embed

Micro.blog

Micro.blog