🌈
Interviewee: Dongxu Huang, Co‑founder and CTO of PingCAP; veteran infrastructure software engineer and architect. He previously worked at Microsoft Research Asia, NetEase Youdao, and Wandoujia. He specializes in distributed systems and database development, with extensive experience and distinctive insights in distributed storage. A passionate open‑source enthusiast and creator, his notable projects include Codis, a distributed Redis caching solution, and TiDB, a distributed relational database.

Interviewer: jiang, contributor of Social Layer

跳转中文版

Transcript & Translation: Carrie

Visuals & Editor: dca5 & Shiyu


Jiang: This event is Papers We Love. It's a distributed event organization I encountered a few years ago at some events in the U.S. They organize meetups in various places to share and discuss academic papers. Later, I got in touch with them to bring Papers We Love to China, back when I was in Beijing. Our events cover not only computer science papers but also share insights on engineering architecture. This year, I'm particularly interested in rethinking the transformations facing the software industry in its new phase of development. So, this year's talks will generally be a bit more in-depth.

Today, we are honored to have Dongxu Huang from PingCAP. He will be sharing his experiences from a decade of building the company, as well as his perspectives on the current industry landscape and broader technological shifts.

Dongxu Huang: I am Dongxu Huang, Co-founder and CTO of PingCAP. Our company develops TiDB, a distributed relational database that is also a prominent open-source project in China. Now in our tenth year, those in the Chinese tech community are likely familiar with it, and many are active users.

I still define myself as an engineer who writes code daily, though now I lean more toward research. Over the past decade, I've primarily worked on distributed systems engineering, combining scalability with database problems to help enterprise customers solve data scale issues.

Over the last six months, my interests and role have shifted a bit. In an era where AI is set to change almost everything, I've been leading a new, small team within PingCAP to research and experiment with agents and the intersection of data and AI. We're exploring what a database or data platform should look like in the age of AI and what crazy new possibilities might emerge. 

Personally, I've always been a typical open-source hacker who enjoys the outdoors. I moved to my current location in 2022 and now travel between China and the US. Today's environment is indeed very special - sharing with everyone from deep in the mountains.

Jiang: Let's start with PingCAP. Can you tell us how it all began 10 years ago? And why did you choose to focus on building a technical community and disseminating technical knowledge back then?

Dongxu Huang: When we founded the company a decade ago, the three co-founders were all engineers by background. Our motivation was not financial gain; rather, we were drawn to the challenge of building a distributed database, which is often considered one of the "three great romances" of programming, alongside operating systems and compilers. It is a challenge that any programmer who is passionate about their craft aspires to tackle.

Ten years ago, I was at my previous company, Wandoujia, a mobile application distribution platform. My role was to manage the data infrastructure. To put it bluntly, the company didn't have a dedicated DBA, so we had to maintain the MySQL databases ourselves.

The advantage, however, was Wandoujia's rapid business growth, which meant we quickly encountered scalability issues. Our platform was not just an app store; it also provided an iCloud-like service for Android users, enabling data backups that could be restored from the cloud onto a new device. The storage system behind this was partly Hadoop and HBase, but all the structured data was in relational databases.

We used MySQL, and every few months, we had to re-shard the cluster. As MySQL is a standalone database, managing large data volumes necessitates sharding—a term that database professionals know to be notoriously difficult.

With no DBA, only myself and our current CEO, Max, were responsible for MySQL's scalability and maintenance. The SLA (Service Level Agreement)  set by our management was extremely demanding, as any system downtime during a period of rapid growth translated directly to revenue loss. Consequently, we frequently worked through the night on expansions and maintenance, managing data and re-sharding, which was an arduous process.

Simultaneously, we faced significant friction with the business teams. They would question why our infrastructure limited the database's ability to perform JOINs, GROUP BYs, and other flexible SQL queries. Since it was a SQL database, users expected full SQL functionality. However, post-sharding, approximately 99% of SQL features became unusable, and queries required an unsightly sharding key. Application developers could not comprehend these technical constraints and frequently expressed frustration. We were caught between two pressures: we felt we were impeding business development and were often held responsible for limitations, while also finding the maintenance itself to be incredibly demanding.

A third catalyst arose from our maintenance of the distributed cache, which was Redis. The situation with Redis was more severe. We developed a middleware solution to address its scalability limitations, which we later open-sourced as a project named Codis. The project gained rapid popularity because, at the time, Redis version 2.x lacked a native distributed system. Our solution was one of the only open-source distributed Redis solutions available, and it achieved widespread adoption.

This experience was revealing. By solving a critical problem and open-sourcing the solution, we saw nearly every major internet company in China adopt our software. Codis was developed by just myself and Max in about two weeks.

We recognized this as a powerful form of leverage that genuinely helped others. This led us to wonder why we couldn't apply the same open-source approach to MySQL's scalability problem. By doing so, we could help other data infrastructure engineers facing the same challenges with their OLTP systems. At the time, we were confident in our abilities, believing we could build any system. Thus, we decided to take on the database challenge.

The second impetus was the publication of two papers by Google in 2012 and 2013. The first was on Google Spanner, a landmark paper familiar to anyone in the database field. The second, from 2013, was on F1, which detailed the construction of a SQL layer atop distributed storage.

Those two papers inspired us a lot. From a contemporary perspective, they do not delve into granular implementation details but rather provide a high-level architectural overview. However, for us ten years ago, the takeaways were profound: first, the problem had already been addressed; second, Google had solved it. At a time when we were manually sharding, Google Spanner was already employing distributed horizontal scaling and appeared to transcend the CAP theorem. The most crucial insight was that this feat was achievable. Our reasoning was that if it was achievable, then as capable engineers, we should be able to build it as well. We decided then to pursue this goal through an open-source model.

Fortuitously, it was a golden age for venture capital in China. In 2014, numerous Chinese tech companies had successfully listed in the U.S., establishing a clear path to market. This gave rise to a new wave of well-capitalized, globally-minded VCs, similar to those in Silicon Valley.

Looking back, I had no familiarity with the venture capital process. Remarkably, when meeting with angel investors, we had no strategy for fundraising. I just went to a whiteboard and drew the system architecture for Google Spanner, explaining our proposed implementation of algorithms like Raft, the rationale for a SQL interface, its inherent challenges, and why building from scratch was necessary. I later learned that the angel investors did not grasp the technical specifics. They were, however, impressed by the vision and our track record with Codis. As a result, we successfully raised over a million dollars to begin our work.

Regarding marketing, as three programmer-founders, we had no experience. Our initial reaction was to study how other companies promoted their products. I was dismayed to find that most of what I saw were merely advertorials.

On the Chinese internet a decade ago, many marketing teams lacked a technical background. From an engineer's perspective, their content was promotional and lacked substance. This convinced me that we should produce content that would resonate with an engineering audience. First, our expertise lay in building an open-source community and transparently documenting our development process.

Second, we perceived an unmet need in the market for a source of substantive, in-depth technical content. While this approach has become commonplace for tech projects today, it was quite distinctive ten years ago.

The third reason was personal preference. I have always valued the sense of community found in gatherings like this one—the human connection, shared interests, and exchange of ideas. In its early days, PingCAP hosted weekly offline meetups. The format was informal; we would share progress on system features or discuss interesting academic papers, much in the spirit of Papers We Love.

In summary, there was no overarching strategy. We pursued what we were passionate about, and it resonated with our target audience. As we were building a database, a tool for engineers, our community naturally comprised our potential customers. The process evolved organically.

Jiang: Why did you choose to build it from scratch instead of modifying MySQL's codebase, like forking their storage engine?

Dongxu Huang: Actually, my first instinct was the same as your question. In the very first week of PingCAP, I began by writing a custom MySQL storage engine. Both MySQL and PostgreSQL provide an interface for the underlying storage layer, so my initial thinking was to build a distributed storage layer and connect it to MySQL's interface to create a distributed MySQL.

However, I quickly encountered the first major obstacle: the fundamental limitation of standalone databases like MySQL and PostgreSQL lies in their SQL optimizers. The user interface is always SQL. We were never interested in building a NoSQL key-value store, as SQL-based systems remain the most commercially viable. I also had no interest in developing OLAP systems like Spark or ClickHouse. Despite their prevalence, such systems have limited profitability. OLTP is where the real money is, with companies like Oracle. But the problem is that once the data volume grows, the entire SQL execution engine and optimizer in OLTP systems are not designed for distributed storage.

For example, consider a SELECT COUNT(*) query. A MySQL optimizer treats a table scan as a monolithic, single-node operation. However, in a distributed storage system where you know the data range on each machine, the execution plan for this query should be inherently distributed. It should be distributed as an aggregation task to the underlying nodes, with the results returned for a final aggregation. This is a basic example, but it highlights a broader pattern. It became clear that modifying a two-decade-old system to accommodate such optimizations was infeasible. To achieve a higher ceiling for future development, we had to build it from scratch.

This was an insight affirmed by the Spanner and F1 papers. In an interview years ago, Jeff Dean was asked why Spanner was built from scratch and offered a similar rationale: when modifying a legacy system, one is invariably constrained. We chose the more difficult path of building from scratch because we believed the potential return and architectural ceiling would be significantly higher. That is the official rationale. On a personal note, the MySQL codebase is notoriously ugly, and I have a strong aversion to C and C++. We wanted to build our company with a language we enjoyed, which led us to Go and Rust.

Jiang: Over these ten years of building a database, what do you consider the most important moments or the best decisions you made?

Dongxu Huang: The first pivotal decision was the one just discussed: not to build upon the MySQL codebase. I remain convinced this was the correct choice, as building from scratch afforded us far greater flexibility.

The second decision, which in retrospect I might reconsider, was our unwavering choice to adopt MySQL's wire protocol. I am averse to systems like MongoDB that introduce proprietary syntax. For many customers, migration costs present a formidable challenge. Our choice—to be compatible with MySQL without using its codebase—was correct, as it accelerated our time-to-market by at least two years by allowing us to leverage an existing ecosystem.

The third important decision, which followed a challenging period, was to abandon our initial HBase-based prototype. We had spent over six months building a transaction layer on HBase, but I made the irreversible decision to develop our own storage layer from the ground up. This became TiKV, our distributed transactional key-value database written in Rust, which is now used by many. This was another crucial turning point.

A more recent major decision was pivoting the company's entire strategy to a cloud-centric model, which involved redesigning the TiDB kernel for the cloud. This has proven to be an unequivocally correct move.

The most recent decision, whose outcome remains to be seen, is our investment in AI, which I mentioned earlier. I believe this will be another critical decision for the next three years.

Jiang: You mentioned the cloud, which is a really important point. From 2016 to 2018, cloud computing grew incredibly fast globally and in China, and it was also the period when Kubernetes and related cloud-native concepts exploded.

Dongxu Huang: Our journey to embrace the cloud involved a significant learning curve. When people say "cloud native", they often mean deploying with Kubernetes. My initial assumption was that embracing the cloud simply meant providing a Kubernetes operator and an automated cloud service. This proved to be a profound misconception.

I see two stages of cloud adoption. The first is automated deployment at the DevOps level, characterized by providing tools like a Kubernetes operator. But this is insufficient. The more critical aspect of embracing the cloud is whether the system's architecture is fundamentally designed for cloud infrastructure. When you build database software—be it TiDB, Redis, or MySQL—it is an application. Kubernetes is merely a tool for deploying that application in a distributed environment.

However, if the guiding assumption shifts to designing a database service built on cloud infrastructure, the architectural choices change entirely. For instance, I would not rely on local disks, because a cloud service would be architected around object storage and other native cloud primitives.

Second, the primary assumption for such a service is that it must be inherently multi-tenant and support virtualized tenants. It is unfeasible to provision dedicated resources for, say, a million customers; the cost would be prohibitive for both the provider and the user. The system must be designed as a collection of microservices.

These are two fundamentally different design philosophies. The cloud-centric decision I mentioned pertains to this latter philosophy. The former, focusing merely on Kubernetes, was based on what I now recognize was a rudimentary understanding of cloud architecture at the time.

Jiang: What stage of development do you believe cloud computing is in today?

Dongxu Huang: Our perception of the cloud has evolved beyond viewing it as a mere collection of virtual machines. The initial cloud computing paradigm mirrored that of the traditional data center (IDC), where the cloud simply hosted servers. One was abstracted from the physical hardware but still interacted with a Linux box.

Today, the focus for many cloud users has shifted away from the VM itself. With a service like Vercel, for example, one is abstracted away from the physical location of the servers entirely. The service provides code hosting and seamless scaling on demand.

I believe cloud computing has matured from IaaS to providing robust PaaS layers, which are now sufficiently developed to support a new generation of third-party service providers like us. My assumption about S3 is "capacity is approximately infinite". I am not concerned with the underlying machine count; the critical factors I depend on are its SLA, durability, throughput, and consistency. We build our systems upon these abstractions. Both compute and storage are rapidly undergoing this service-ization.

The next stage will likely be even more abstract. Currently, compute is still represented by code, and storage by databases or object stores. In the next era, I believe the paradigm will be agentic. The interface will become progressively simpler and more abstract, and natural language appears to be the logical endpoint for this abstraction.

Jiang: What is the current state of the database industry? Considering recent events like acquisitions by Snowflake and Databricks, and OpenAI's acquisition of Rockset, what trends are you seeing?

Dongxu Huang: I believe the past five years have been a period of relative stagnation for the database industry, with the primary focus on commercialization. Aside from the paradigm shifts introduced by cloud infrastructure, there has been little that I found genuinely exciting.

This year, however, is different. Before discussing trends, let's identify the core problem with databases today: their fundamental issue is their human-centric design. They are built for programmers using interfaces like SQL, Java, and Python. One must learn, code against, and use their APIs. The primary users are still engineers. Whether it is Snowflake, Databricks, TiDB, or OceanBase, the interface is programmatic, and the programmer is human. This is the central limitation. It may seem radical, but I predict that soon, AI agents will become the predominant users, potentially accounting for 90% or more of interactions. Our design focus must shift from creating interfaces for humans to creating interfaces for AI.

One might ask about the difference between human and agent database access. The distinction is substantial. Through my recent work on agentic systems, I have observed that agents interact with databases in a fundamentally different manner.

The most significant difference is that historical data interfaces were static. A RESTful API, for instance, would encapsulate predefined CRUD operations. In contrast, agents generate vast amounts of transient data and execute numerous ad-hoc queries. In a prototype I've built, I've integrated my personal data and allowed an AI agent to query it using various tools, including SQL, via a MCP(Model Context Protocol). A natural language request like, "Identify emails from the last 10 days requiring an immediate response," prompts the agent to ad hoc generate SQL or even a Python script to retrieve the necessary data. This data is not necessarily vectorized. The term "AI database" is often mistakenly equated with vector databases; however, vector indexing constitutes only a minor component of a comprehensive AI data platform.

My system allows the agent to perform relational queries with SQL, full-text searches, semantic searches, and knowledge graph traversals. The result of a query is frequently synthesized from multiple data sources, representations, and types.

So, to answer your question about the future of databases: my perspective is as follows. First, SQL as we know it will become obsolete. The future database interface will resemble that of models like ChatGPT, accepting natural language as well as raw data formats like JSON, relational tables, and binary data. The system will intelligently determine the optimal storage method based on the ingested data.

The primary interface could be termed save or memorize—a flexible API for storing information. The second interface, chat or query, would allow for versatile questioning of the data via natural language, SQL, or other modalities. These two interfaces would be powered by a suite of back-end MCP tools for data manipulation, knowledge graph construction, vectorization, and full-text search. These underlying mechanisms will be abstracted away behind the agent's toolset. This is my high-level thesis on the evolution of data infrastructure interfaces, a direction that leads to many fascinating conclusions.

Jiang: Will we see MCP-native databases?

Dongxu Huang: I believe MCP is more relevant at the tool layer. Exposing capabilities as MCP tools to the large model is sufficient. My approach is to provide a rich set of tools via MCP, ranging from simple insert and query interfaces to more complex ones for tasks like knowledge graph construction. These tools will be orchestrated by an LLM agent.

Jiang: So, the key may be enabling agents to query databases directly, constructing their own queries, while also potentially exposing that query capability, similar to GraphQL?

Dongxu Huang: It is crucial not to become fixated on specific protocols, as these will be determined by the agent or the LLM. The goal should be to provide maximum flexibility—access to file systems, databases, and a variety of tools—and allow the large model to determine their usage. This is why I do not define PingCAP solely as a database company. The agent platform I am currently developing, for instance, exposes only a file system interface at present.

Jiang: Regarding AI,  what's the vibe like in Silicon Valley right now? I was there from late May to early June, and it felt very different from last year. AI is a dominant and fervent topic.

Dongxu Huang: Exactly. This wave represents a significant window of opportunity. From a non-technical standpoint, the U.S. is anticipating interest rate cuts, which is beneficial for entrepreneurs as it increases the flow of speculative capital. When interest rates were above 4%, capital could generate solid returns with minimal risk. Lower rates incentivize investment in higher-growth, hype-driven sectors, which is currently AI. While certain market signals, such as exorbitant recruitment offers, are not sustainable, they indicate a bullish phase for the technology sector. Building during such a period offers two advantages: increased visibility and higher valuation premiums.

For example, a database startup today would struggle to achieve the valuation multiples of 2021, even with comparable performance. In contrast, an AI company with $10 million in revenue might command a valuation exceeding $1 billion. Consequently, this is an opportune moment for fundraising, founding a new venture, or an exit. Given the current economic cycle, ideas must be pursued with urgency.

From a technical standpoint, however, I maintain a more cautious perspective. While I can afford to conduct research, I do not believe it is the time for maximal investment. I have two hypotheses.

First, although the industry is at a peak, I believe the cost of computation must decrease by another order of magnitude before truly transformative applications can emerge. While costs have already fallen significantly, I anticipate another order-of-magnitude reduction over the next two to three years. Therefore, building foundational infrastructure now may be premature.

Second, AI is clearly in a hype-driven phase technically, as it has yet to produce a killer consumer application—its "iPhone moment." The current focus remains predominantly on B2B applications.

Claims of rapid ARR growth from AI applications should be viewed with skepticism. While some may be true, such trajectories are unlikely to be sustainable. Even if these companies ultimately succeed, their future product offerings will be fundamentally different from their current forms.

From a commercial standpoint, the AI field is flush with capital and hype. Technically, however, it remains immature. This is the rationale for my current focus on research: the work we are doing today is building the infrastructure for the applications of three years from now.

Jiang: In the domain of AI engineering and products, apart from your work in databases, are there any other projects or sectors you find particularly interesting?

Dongxu Huang: The domain of LLM pre-training is outside my focus, and I am skeptical about its long-term prospects as an independent field. I do not consider training or inference infrastructure to be high-growth sectors. While heavily funded by large corporations like Meta, I believe this area will eventually converge on a few dominant models. The intelligence provided by foundation models will become a commodity, akin to electricity or water—universally available and inexpensive. Therefore, perfecting the means of production becomes irrelevant when the utility is commoditized.

The true value in the AI domain lies in data and context. Intelligence cannot be achieved through reasoning alone. Genuine intelligence emerges from interaction with the external world, learning through a feedback loop, which transforms raw intelligence into useful capability.

Therefore, the agentic paradigm is sound, and this is becoming an industry consensus. Foundation models like GPT-4o have reached a threshold of capability that makes agents practical. Furthermore, the industry recognizes that accomplishing complex tasks with AI requires leveraging user context and effective planning. Modern tools like Claude Code employ multi-agent, multi-LLM, multi-round orchestration, a departure from the one-shot conversations of early models.

The competitive advantage, or "moat," lies in organizing context, providing relevant data, maintaining state, and ensuring the AI allocates its token budget toward broader state exploration. These engineering challenges are not yet fully appreciated. Key future infrastructure will include: first, a search framework for exploring multiple state-spaces; second, efficient state management, including representation, compression, and storage. This is a database problem. Simple branching represents only a rudimentary form of state management. A general-purpose search framework, robust state representation, and sandboxing environments will be critical for future agent architectures.

Memory is just one small component of this system. Consider an AI playing chess. One approach is to select the single best move from the current state. A more intelligent agent would identify several promising paths and explore them further. Similarly, current coding assistants guide the user down a single path, which may not lead to the optimal outcome. A better approach would be for the AI to explore multiple solution branches for several steps and present the divergent outcomes to the user, who could then select a branch for further development. What is truly required is a tree search methodology. While this entails an exponential increase in token and storage consumption, under my prior assumption of commoditized compute and storage, this advanced search becomes not only feasible but necessary.

Jiang: What is your perspective on current AI coding tools like Cursor and the concept of "vibe coding"?

Dongxu Huang: I fully embrace these developments; AI now generates approximately 90% of my code. However, I believe we must go further. Programmers may still operate under the illusion that AI is merely an assistive tool and that they remain the primary authors. This, in my view, is the core issue with products like Cursor. To offer a contrarian viewpoint, I believe Cursor's strategic position is precarious because its product is designed from a human-centric perspective. It remains an IDE that assists programmers.

In contrast, tools like Claude Code and Devin are architected from an agent-centric perspective, where the AI is the primary actor. The distinction is critical. From an AI-centric viewpoint, AI's capabilities in many programming tasks will far surpass those of humans. Imposing human constraints and inputs often hinders the AI's potential.

The critical distinction is whether a product is designed from an AI-centric or a human-centric perspective. This is the most significant classification for AI programming tools. My conviction is to minimize human interference and maximize the scope of AI code generation. This is the direction of the future.

When code production becomes an order of magnitude more efficient, should our focus not shift to designing products that leverage this 10x productivity gain? A future database must be an order of magnitude better than current systems to be considered adequate. The goal is not to use AI to write legacy programs more efficiently but to raise the ceiling of what is possible. The exact form of this "10x programmer" is still unknown, but ambitious thinking is now more critical than ever.

Jiang: How do you handle the long-term maintainability of AI-generated programming projects?

Dongxu Huang: The fundamental question is, who is the agent of maintenance? I am largely aligned with the philosophy behind Devin: if code is generated by an AI, that AI should be responsible for its entire lifecycle. The premise that AI-generated code requires human maintenance is inherently flawed.

The current limitation is that AI has not yet closed the operational loop. AI generates code, which humans then deploy and debug. But we must consider whether we have given AI the opportunity to manage the entire lifecycle: hosting, debugging, maintenance, and releases. Current AI capabilities are insufficient, but this is a practical, not a theoretical, limitation.

My point concerns a fundamental shift in mindset. I am a strong proponent of the autonomous approach. I have more radical ideas that I have yet to introduce to my team. Currently, our research team is still human-led, with AI as a powerful assistant. The next stage, I believe, should involve AI writing all code within a closed environment. Every individual's role shifts to that of a "product manager" or "AI trainer." At some point in the future, I may even prohibit human-written code within this context.

Questioner: I have a thought on this. Software engineering, whether conducted by humans or AI, is fundamentally about controlling entropy. Both can produce flawed or convoluted code. The challenge, then, may be to design an engineering system that provides quality control for all agents, human or artificial. This cannot be solved simply by assuming AI will become hyper-intelligent, as even highly intelligent human teams can produce chaotic complex systems.

Dongxu Huang: Regarding complexity and entropy control, human capabilities are inherently limited compared to a well-designed AI system. Decades of software engineering have shown that the primary bottleneck in complex systems is often the human element; our capacity for managing complexity is finite. Conversely, I am highly optimistic about AI's potential to comprehend and manage the entropy of such systems far more effectively than humans ever could.