Top Stories

reduce regulatory risk
IMAP

November 17, 2025 06:07 AM

Read More
BARC report
IMAP

November 17, 2025 06:07 AM

Read More
Read now
IMAP

November 17, 2025 06:07 AM

Read More
I/O Observability for Uber's Massive Petabyte-Scale Data Lake
IMAP

November 17, 2025 06:07 AM

Uber engineered a unified, zero-touch I/O observability system to monitor read/write patterns across its petabyte-scale data lake, spanning on-premises and cloud storage. By intercepting file system operations, it efficiently monitors 400K daily Spark apps, 2M Presto queries, and 6.7M YARN containers with less than 5-minute delay, enabling precise network egress attribution, cross-zone congestion detection, and dataset heat-maps for tiering.

Read More
How Yelp modernized its data infrastructure with a streaming lakehouse on AWS
IMAP

November 17, 2025 06:07 AM

Yelp overhauled its legacy data pipelines by adopting a streaming lakehouse architecture built with Apache Flink, Apache Paimon, Amazon MSK, and S3, slashing analytics data latency from 18 hours to minutes and reducing storage costs by over 80%. The migration replaced bespoke CDC formats and complex Kafka chains with SQL-accessible, versioned tables and community-supported standards, enabling flexible scaling, real-time CDC, and built-in data management features like schema evolution.

Read More
How We Made Our Internal Data Warehouse AI-first
IMAP

November 17, 2025 06:07 AM

ClickHouse transformed its internal data warehouse from a traditional BI-focused system to an AI-first setup, allowing users to gain insights via natural language queries without SQL. This addresses analytics bottlenecks, handling ~70% of use cases, driven by advanced LLMs (e.g., Anthropic's Claude models) and the Model Context Protocol (MCP). The result, DWAINE (Data Warehouse AI Natural Expert), democratizes data access, slashing query times from 30-45 minutes to near-instant, while reducing analyst workload by 50-70%.

Read More
650GB of Data
IMAP

November 17, 2025 06:07 AM

Testing DuckDB, Polars, and Daft on a 32GB EC2 node against a 650GB Delta Lake table demonstrated that single-node engines can efficiently process large lakehouse datasets: Polars completed a full aggregation in 12 minutes, DuckDB in 16, Daft in 50, and PySpark in over an hour. Distributed clusters, while still relevant, are no longer essential for such workloads thanks to impressive Larger-Than-Memory support and simple integration. Single-node frameworks deliver substantial cost savings, reduced operational complexity, and straightforward code, challenging conventional lakehouse architecture assumptions.

Read More
Organizing Code, Experiments, and Research for Kaggle Competitions
IMAP

November 17, 2025 06:07 AM

Robust code organization, meticulous experiment tracking, and reproducible environments are non-negotiable for successful data science projects. Using modular repo structures, version control, and tools like wandb and Hydra streamlines experimentation and collaboration across platforms.

Read More
Batch Vs Real-Time Data Pipelines – Do We Still Need To Pick?
IMAP

November 17, 2025 06:07 AM

While the debate between batch and real-time/streaming data pipelines persists, modern tools like Estuary allow seamless toggling between modes, letting teams run real-time for high-stakes needs and batch/micro-batch elsewhere to cut costs. This enables dynamic pipeline adjustments based on business value and use cases, with hybrid or "right-time" approaches offering flexibility to balance cost, complexity, and timeliness.

Read More
Scaling Data Products Starts With Fixing the Foundation: Five Lessons We've Learned
IMAP

November 17, 2025 06:07 AM

Most failures in scaling data products stem from weak foundational pipelines rather than flawed models or analytics. By treating pipelines like products that come with clear ownership, versioned changes, SLOs, and standardized plumbing, teams can build reliable, observable data foundations that could scale.

Read More
Waiting for SQL:202y: GROUP BY ALL
IMAP

November 17, 2025 06:07 AM

GROUP BY ALL is a new SQL feature that automatically groups by all non-aggregate expressions in the SELECT list, removing the need to repeat columns manually. It works well for simple queries but intentionally avoids handling ambiguous cases where expressions mix aggregates and non-aggregates. Changes to the SELECT list implicitly change the grouping, so it should be used with care, especially in complex queries.

Read More
Announcing BigQuery-managed AI functions for better SQL
IMAP

November 17, 2025 06:07 AM

BigQuery now supports AI.IF, AI.CLASSIFY, and AI.SCORE functions, enabling direct semantic filtering, classification, and ranking of unstructured data in SQL without manual prompt tuning or model selection. These managed AI functions use Gemini LLMs and integrate with WHERE, JOIN, GROUP BY, and ORDER BY clauses.

Read More
Capital One at EMNLP 2025: Trust and efficiency in AI
IMAP

November 17, 2025 06:07 AM

Capital One showcased several peer-reviewed advances at EMNLP 2025, including a multi-agent LLM framework for complex financial workflows (MACAW) and a data augmentation framework (GRAID), which boosts guardrail model F1 scores by 12%. Additional key contributions feature a merged-embedding approach delivering 47.5% greater RAG consistency, TruthTorchLM for multi-method truthfulness evaluation, and activation-based confidence estimation enabling fast and trustworthy LLM deployment.

Read More
Reflections from JupyterCon 2025 — A Week of Ideas, Code, and Community
IMAP

November 17, 2025 06:07 AM

The Jupyter ecosystem has evolved far beyond notebooks: it's powering enterprise AI pipelines, reproducible research, and data-science applications at scale. From extension workshops to community sprints, what stood out most at JupyterCon 2025 was the human side - showing that thriving open-source isn't just about code, it's about people.

Read More
Effective context engineering for AI agents
IMAP

November 17, 2025 06:07 AM

Context engineering has become critical for building reliable, long-horizon LLM agents, shifting focus from prompt optimization to actively managing the set of information (‘tokens') sent to models at inference time. Effective strategies, such as compaction, structured note-taking, dynamic just-in-time retrieval, and sub-agent architectures, address context window constraints, minimize context rot, and improve agent coherence across complex workflows.

Read More
Beyond Quantization: Bringing Sparse Inference to PyTorch
IMAP

November 17, 2025 06:07 AM

Sparse inference is emerging as a transformative optimization for LLM deployment.

Read More
Race Condition in DynamoDB DNS System: Analyzing the AWS US-EAST-1 Outage
IMAP

November 17, 2025 06:07 AM

The prolonged AWS US-EAST-1 outage of October 19 was caused by a latent race condition in DynamoDB's automated DNS management system.

Read More
Apply here
IMAP

November 17, 2025 06:07 AM

Remi Turpaud

Read More