Top Stories

Democratizing Machine Learning at Netflix: Building the Model Lifecycle Graph
IMAP

May 07, 2026 06:08 AM

Netflix's Model Lifecycle Graph is a centralized Metadata Service (MDS) that connects fragmented ML assets (models, features, pipelines, datasets, and experiments) across the entire company into a single, queryable graph. By ingesting real-time events, normalizing them with a unified URI-based model, enriching relationships, and storing them in Datomic + Elasticsearch, Netflix enables easy discovery, lineage tracking, impact analysis, and cross-domain reuse of models.

Read More
DuckDB Internals: Why is DuckDB Fast?
IMAP

May 07, 2026 06:08 AM

DuckDB is fast because it runs in-process, avoids server/client data movement, and combines columnar storage, query optimization, predicate pushdown, vectorized execution, and row-group pruning to scan only the data it needs. This post explains how DuckDB turns SQL into an executable plan and why its storage and Parquet-reading model make analytics feel unusually fast on a single machine.

Read More
From SSH to REST: A Security-Driven Modernization of Slack's EMR Data Pipelines
IMAP

May 07, 2026 06:08 AM

Slack modernized its data pipelines by migrating over 700 SSH-based operators on AWS EMR to a secure REST-based architecture with zero downtime across 8 regions. Its team replaced direct SSH access with Quarry, their internal REST job submission gateway, and used YARN's Distributed Shell to run arbitrary commands for proper resource management, reliable tracking, clean cancellation, and server-side lifecycle handling.

Read More
Building Self-Healing Data Pipelines at Halodoc
IMAP

May 07, 2026 06:08 AM

Build targeted self-healing layers for recurring pipeline failures: CDC auto-restarts with safe checkpoint rewind, source-vs-lake consistency checks, size-aware mini-batching, Spark retry memory scaling, warehouse lock cleanup using query watermarks, and dependency-aware backfills. The design pattern is: alert first, validate eligibility, recover safely, measure impact. Results included CDC recovery dropping from 45+ min to <5 min and backfill setup from 4-8 h to <15 min.

Read More
Can Agents Replace the Search Stack?
IMAP

May 07, 2026 06:08 AM

A lightweight LLM agent, given basic retrieval tools (BM25 and/or embeddings), can outperform complex search backends and reranking pipelines, simplifying the search architecture. In experiments on Amazon ESCI data, agentic setups delivered big gains (NDCG from ~0.29 baseline to 0.41-0.45), with agents intelligently rewriting queries, exploring, and evaluating results.

Read More
Beyond the hype: The enterprise AI architecture we actually need
IMAP

May 07, 2026 06:08 AM

Enterprise AI is moving toward a federated stack: native AI inside systems of record like SAP, Salesforce, Workday, and ServiceNow; sovereign private models hosted on internal infrastructure; curated data lakes; and AI analytics layers that can federate queries across domains. Agent orchestration sits on top, with full traceability, timestamps, and auditability to satisfy compliance demands such as the EU AI Act. Two missing capabilities: a trusted marketplace for external agents using verifiable identities, and an employee intelligence layer that embeds AI into workspaces so users can query operational data without switching tools.

Read More
We're Missing Data: The Other Half of AI Transformation
IMAP

May 07, 2026 06:08 AM

AI in data and engineering orgs is overfocused on tools and underinvested in the operating model needed to absorb them. Technical gains from coding agents, eval infra, and internal assistants are real, but without redesigning management, career ladders, team composition, trust mechanics, and communication norms, productivity typically rises for about 6 months and then plateaus. AI transformation is multiplicative, not additive: fund both the technical stack and the operating stack, or the investment will underdeliver.

Read More
S3 is the perfect place to store data, until you try to search it
IMAP

May 07, 2026 06:08 AM

Firn is an open-source API for fast vector and full-text search on S3-backed data, using Lance plus caching to make repeated queries extremely fast. It's useful for teams that want searchable object storage without the cost or complexity of running OpenSearch.

Read More
Integrating AI Into Apache Kafka Architectures: Patterns and Best Practices
IMAP

May 07, 2026 06:08 AM

When integrating LLMs with Apache Kafka, use Kafka strictly as a durable event backbone and keep all model inference outside the broker. Use one of three main inference patterns (external RPC, embedded models like ONNX/TFLite, or sidecar), and follow best practices for topic design (raw-events → enriched-context → model-outputs), replayability, dead-letter queues, idempotency, and cost/latency/governance considerations.

Read More
How We Accelerated Transpilation by Compiling SQLGlot with mypyc
IMAP

May 07, 2026 06:08 AM

Fivetran dramatically accelerated SQLGlot (the popular pure-Python SQL parser, transpiler, and optimizer) by compiling it with mypyc, a tool that turns well-typed Python code into fast C extensions. They ship the compiled version as an optional package that delivers ~5x faster parsing, ~2.5x faster SQL generation, and 2-2.5x faster optimization, while keeping the original pure-Python version as the default for maximum compatibility.

Read More
Implementing Statistical Guardrails for Non-Deterministic Agents
IMAP

May 07, 2026 06:08 AM

Statistical guardrails, like semantic drift detection using cosine-distance z-scores against a safe baseline embedding and confidence thresholding using Shannon entropy on token probabilities, add an automated safety layer for non-deterministic agents.

Read More
Redis Array Type: Short Story of a Long Development
IMAP

May 07, 2026 06:08 AM

Redis Array is a proposed new data type, currently under review in a pull request, that natively supports numerical indexing as part of its semantics, combining efficient sparse and dense representations with automatic internal reshaping for optimal memory usage and performance, creating a powerful structure ideal for use cases like ring buffers, large indexed collections, and storing documents/files with fast access, scanning, and search capabilities.

Read More
SAP to acquire data lakehouse vendor Dremio
IMAP

May 07, 2026 06:08 AM

SAP's Dremio acquisition is a pragmatic bet on AI-ready enterprise data, using Iceberg-native federated access to unify SAP and non-SAP data without major migration.

Read More
Validate Smarter at the Row Level: A Four-Layer Approach
IMAP

May 07, 2026 06:08 AM

Practical blueprint for selectively enforcing schema, format, business, and metric-specific checks with Pydantic.

Read More
Apply here
IMAP

May 07, 2026 06:08 AM

Remi Turpaud

Read More
create your own role
IMAP

May 07, 2026 06:08 AM

Remi Turpaud

Read More
Inc.'s Best Bootstrapped businesses
IMAP

May 07, 2026 06:08 AM

Remi Turpaud

Read More