Conversation with RoboScience’s Shao Lin: The VLOA Large Model Breaks Through the Generalization Bottleneck in Embodied Intelligence

Release Date:

2025-08-25 16:28

Source:

Shao Lin’s WeChat profile picture is an old photograph.

In Stanford University’s AI Lab, the Franka robotic arm, equipped with a Schunk gripper, steadily lifts an apple. He says this image instantly transports him back to his days as a PhD student: debugging, aligning, and training the data—over and over again—until the robot’s hand learns “how to pick up.”

Ten years ago, Shao Lin’s research, known as “cross-embodiment grasping,” marked an important step in bringing robots from the laboratory into the real world.

At that time, robotics research was still far from breaking into the mainstream, yet Shao Lin and his classmate Tian Ye, who was working in Andrew Ng’s lab, had already begun tirelessly refining their ideas around one central question: when would robots truly make their way into ordinary homes? Tian Ye, a native of Sichuan, is a skilled cook of Sichuan cuisine and often treated Shao Lin to Sichuan dishes in the Bay Area. “Even now, recalling those days brings with it the aroma of Sichuan food,” Shao Lin said with a smile, reminiscing for “Jiazi Light-Year.”

Today, Shao Lin is an assistant professor at the National University of Singapore and also the co-founder and chief scientist of RoboScience, a company specializing in embodied intelligence. He studied under Jeannette Bohg, with Leonidas J. Guibas serving as his co‑advisor; Guibas is also the only IEEE Technical Committee on Robot Learning chair in Asia.

Meanwhile, the field worker who took Shao Lin to enjoy Sichuan cuisine is the co-founder and CEO of RoboScience, and previously served as the technical lead for Apple’s on-device machine learning platform team.

A group photo of Shao Lin (left) and Tian Ye (right).

Having been friends for so long, a natural rapport has developed—“sometimes, just one look is enough to know what the other person is thinking.” They’ve also repeatedly confirmed a shared direction: We build technology with warmth, putting people at the center and ensuring our products address real-world challenges—rather than merely looking impressive on paper.

Their technology and products have also earned recognition from investment firms. On July 30 this year, RoboScience announced the completion of a nearly RMB 200 million angel-round financing, led by JD.com, with participation from China Merchants Capital and SenseTime Guoxiang Capital, while existing investor 01 Venture continued to invest.

But we have a question: since the two have known each other for ten years, why didn’t they set out to found RoboScience sooner?

Shao Lin’s answer was “the right timing, the right place, and the right people.” Rather than simply following what others are doing, they first rigorously validate the underlying technologies and approaches, develop a long-term roadmap, and assess feasibility from multiple perspectives.

The real turning point came in 2024: advances in large models brought “generalization” into sharp focus, prompting systematic discussions on how to design decision‑making systems that endow embodied intelligence with capabilities as broad as those of ChatGPT. The two shared a common vision—“human‑centered, technology with warmth”—and maintained frequent communication.

Beyond the hype, Shao Lin still regards “implementation” as a key word.

His criteria are straightforward: the existing technology must be able to operate reliably in the short term and deliver sufficient commercial returns. As for everything else, we’ll stick to the pace set years ago—bringing that apple from the lab bench into the real world, and turning those oft‑repeated words at the dinner table into concrete applications and everyday life.

In this article, “Jiazi Light-Year” interviews Dr. Lin Shao, Assistant Professor at the National University of Singapore and Co-founder and Chief Scientist of RoboScience.

1. On the model: The VLA should be regarded as a decision‑mapping function from input to output, rather than getting bogged down in conceptual debates.

Jiazi Light-Year: Let’s get straight to the point—competition in the embodied intelligence space is fierce. So, what exactly is RoboScience aiming to achieve?

Shao Lin: We primarily focus on developing embodied intelligent systems with general-purpose capabilities, aiming to bring robots into everyday households and enable them to perform diverse, complex tasks in real-world settings.

Jiazi Light-Year: In the field of LLMs (large language models), having ample, high-quality data typically leads to better model performance. However, in the realm of embodied intelligence, this doesn’t seem to hold true—why is that?

Shao Lin: The inescapable topic when it comes to next-generation AI models is data.

VLMs (visual-language models) and LLMs fundamentally differ in the data formats they process: CV (computer vision) works with pixels, particularly in images, while NLP relies on tokenization—the process of breaking text into characters, words, or subwords—which is more straightforward for it. This leads to the scaling law phenomenon: as more data is mapped onto the same coordinate system, larger models can be designed, yielding better performance and ultimately giving rise to a coherent training framework.

Embodied intelligence, of course, aspires to replicate the success of large models in computer vision (CV) and natural language processing (NLP). However, the diversity of data in embodied intelligence far surpasses that of these two domains. If the issue of data‑format standardization is not addressed and the CV, VLM, or LLM paradigms are applied directly, numerous challenges will arise.

Jiazi Light-Year: In what ways does the data diversity of embodied intelligence manifest itself?

Shao Lin: First, there’s task diversity. As household robots enter real-world environments, we expect them to handle a wide range of activities—serving tea, pouring water, doing laundry, and preparing meals. These tasks vary significantly in nature. Second, there’s object diversity. In the home, robots must manage flexible objects—for example, folding clothes—requiring an understanding of their deformable properties; opening and closing doors involves hinge dynamics; rigid objects present yet another category; and deformable objects themselves come in 1D, 2D, and 3D forms, each with distinct physical characteristics. Moreover, objects differ in their geometric shapes. For robots to learn how to use tools and manipulate objects, they need to grasp these geometric distinctions. At the same time, there’s also diversity in the robot’s physical embodiment—what we call “cross-embodiment.” Today, hardware designs are highly varied: end-effectors can have two, three, or five fingers, and their actuation mechanisms and structural configurations differ as well.

These circumstances make the development of large-scale embodied‑intelligence models more complex. Since task execution ultimately depends on the robot’s physical body, the model must be compatible with diverse hardware platforms. Consequently, enabling robots to account for these characteristics renders data collection, preprocessing, and learning significantly more challenging.

If we are to develop a unified large model, it must encapsulate three key sources of diversity: tasks, objects, and the robot’s own body. We need to design a unified data format that projects disparate data into a common coordinate system, and then build a corresponding training paradigm on this foundation—only then can we achieve maximal generalization. This is the core challenge; it demands careful, in-depth consideration, rather than simply transplanting insights from computer vision or natural language processing.

Jiazi Light-Year: What is your solution?

Shao Lin: Continuing from the point just raised about the issue of a unified data format, such a standardization is a prerequisite for building large-scale embodied‑task models. Only by establishing a universal representation can we integrate vast amounts of data into a single framework and fully unlock the knowledge they contain.

The importance of a unified coordinate system is critical. An object’s trajectory can be represented in a relatively consistent data format. Particularly in manipulation scenarios, the core task is for the robot to manipulate or make contact with an object, apply forces to it, and alter its motion state, thereby transitioning the object from state 1 to state 2.

This state transition is, at its core, a change in the object’s trajectory. The most direct manifestation of embodied manipulation is the alteration of an object’s shape and position in three-dimensional space, which we refer to as the object trajectory.

It can provide a relatively comprehensive description of various embodied manipulation tasks. Centered on the object trajectory, the framework can be progressively expanded: different robots apply distinct actions to different objects, yielding varied trajectory patterns that characterize distinct tasks. The diversity of tasks, objects, and agents can all be systematically articulated through object properties, ultimately giving rise to a network that encompasses the multifaceted variations across these three levels.

Based on this line of thinking, we have developed VLOA (vision-language-object-action) model Its hallmark is a general-purpose task-planning model that takes vision and language as inputs to the planning layer and outputs object trajectories as an intermediate interface. Meanwhile, the execution layer’s general-purpose manipulation model interprets the state changes that objects should undergo, thereby generating the robot’s actions required to achieve those changes.

Jiazi Light-Year: VLOA achieves three-dimensional generalization across tasks, objects, and the underlying entity. Could you explain how this is accomplished?

Shao Lin: Generalization is both essential and foundational. What distinguishes VLOA is its ability to achieve task understanding and handle diversity by enabling robots to predict the motion trajectories of objects, with a stronger focus on changes in task-relevant states.

At the foundational level, from an object’s motion trajectory to the robot’s body and finally to its actuation output, we enable the robot to understand physical laws and use them as a guiding principle. Suppose we already know what state change we want the object to undergo; the robot must then determine which actions to apply in order to induce that change along the desired trajectory. This is, at its core, a physics‑based process that strengthens the foundations of generalization and brings it closer to the intrinsic nature of physical phenomena. After all, the essence of manipulation lies in the robot’s contact with the object, transmitting forces and torques to alter the object’s state.

Jiazi Light-Year: Compared with the VLA (Vision-Language-Action) model, what is the most significant difference in the design philosophy of VLOA?

Shao Lin: VLOA primarily focuses on the core aspect of embodied intelligence manipulation: altering the motion state of objects. On this basis, we conducted the architectural design, enabling VLOA to possess several advantages.

The first advantage lies in the representation of intermediate states. This hierarchical structure makes data collection and processing more systematic. The upper layer, from V to O, involves a robot or an embodied manipulation model mapping the task’s semantic information onto changes in the object’s state—specifically, determining how the object should change to signify task completion. At this level, we can fully interpret data from diverse sources and formats, as it does not directly concern the concrete execution strategy. This explicit intermediate-state representation simultaneously enhances both interpretability and safety.

In the lower layer, the process from O to A involves learning physical laws. The robot must operate objects in accordance with these laws to produce the desired changes in motion. In other words, the upper layer learns semantic information, while the lower layer learns physical principles. This decouples data collection from concrete execution: the upper layer can extract semantics from diverse data formats, whereas the lower layer, guided by physical laws, provides a stable foundation for generalization. This constitutes VLOA’s second advantage.

The third advantage is that VLOA aligns more closely with the essence of physical manipulation. Centered on changes in an object’s motion trajectory, VLOA’s inductive basis differs from that of traditional models, with a generalization foundation that better reflects real-world object manipulation and human–robot interaction. Consequently, its data efficiency is significantly higher.

Jiazi Light-Year: Recently, the industry has seen some rather critical commentary on VLA. What’s your take on VLA?

Shao Lin: I believe that a VLA is, in essence, not a specific model per se, but rather a decision‑making mapping mechanism that transforms inputs into outputs. It is a goal, actually. Our VLOA goes a step further than the VLA.

We aim to build a system that enables general-purpose embodied intelligence. The robot’s ultimate operating system must incorporate perception, with vision and language serving as the primary sources of input, while action constitutes the robot’s output.

Regardless of whether external evaluations are positive or negative, there is no need to get bogged down in conceptual debates; instead, we should focus on how to continuously enhance model capabilities through architectural innovation. From the perspective of decision‑making systems, such a framework is bound to exist. As for the current stage—whether we adopt a fully end-to-end model or a decoupled end-to-end model, and how challenging data collection proves to be—these considerations do not alter the core positioning of vision, language, and action as “input‑to‑output mappings.”

Jiazi Light-Year: When developing the VLOA model, did you take security into account?

Shao Lin: Safety is of paramount importance. For instance, when a robot is deployed in the kitchen to use a knife for food preparation, we cannot allow it to operate entirely as a black box. We must have a clear understanding of its objectives, the rationale behind its actions, and its anticipated behavioral patterns before we can use it with confidence.

Currently, the industry often fails to prioritize security in the design phase, but we have taken it into account from the very beginning.

Our approach is straightforward: if we are to develop a large-scale, embodied‑task‑oriented model in the future, it must adhere to certain guiding principles. To determine these, we need to work backward and identify the requisite design, architecture, and data‑processing strategies.

So we designed the object trajectory. The intermediate interface of VLOA includes explicit state prediction, allowing us to validate the planned actions using various models and methods before execution. For example, in a simulation environment, we can first verify whether the robot’s manipulation of an object might lead to unsafe outcomes. This effectively adds an extra safety safeguard, providing early warnings prior to actual execution.

We can explicitly understand and verify the object motion trajectories in the robot’s plan; should deviations arise during actual execution, the lower-level O-to-A “fast‑brain” system can promptly correct them. Its advantages lie in its ability to avert risks through predictive analysis before execution and to swiftly rectify deviations once they occur during execution. This multi-layered security framework effectively addresses shortcomings in security.

Jiazi Light-Year: Does the intermediate state refer to the process by which an action unfolds?

Shao Lin: Here, the intermediate state refers to the model’s ability to predict or infer the motion trajectory of the manipulated object. For example, when lifting a cup from the tabletop, the process can be understood as the cup’s position transitioning from the table to mid-air, at which point the task is considered complete. The change in an object’s position within three-dimensional space constitutes its motion trajectory, and this is precisely what the intermediate state captures.

2. On Paradigms: Layered vs. End-to-End—The Two Are Not Mutually Exclusive

Jiazi Light-Year: We understand that RoboScience initially adopted a “fast–slow brain” hierarchical model. What is the relationship between that model and the current VLOA?

Shao Lin: You can think of it this way: there’s a planning layer and an execution layer. The planning layer corresponds to the slow brain, while the execution layer corresponds to the fast brain. The process from V to O constitutes the planning pipeline and is handled by the slow brain; the process from O to A forms the execution pipeline and is managed by the fast brain. Although our model is divided into upper and lower layers, it remains a decoupled end-to-end architecture: the planning and execution layers are both general-purpose and can be trained end-to-end independently. Moreover, thanks to the intermediate interfaces that mediate between them, the entire system can also be trained as a unified end-to-end model.

Jiazi Light-Year: Assuming there is sufficient data, will the end-to-end VLA of a single system ultimately outperform the hierarchical end-to-end approach in terms of generalization?
Shao Lin: To be honest, this is neither provable nor falsifiable. After all, such a scenario has never actually occurred, and we don’t even know how much data would constitute “sufficient.” If the answer to this question won’t be available for another two hundred years, then discussing it now has already lost its practical significance. Nevertheless, it remains a crucial issue worthy of reflection.

Under the current architecture and data scale, we have conducted numerous experiments and made the latest results publicly available online; one of these efforts is called VLA-OS (paper link: https://arxiv.org/pdf/2506.17561). Experimental results show that, under the current conditions, hierarchical end-to-end learning indeed exhibits superior generalization performance compared to a single end-to-end approach. This is not my personal opinion, but rather a conclusion drawn from experiments. The experimental findings clearly indicate that such a result does indeed occur; however, the underlying reasons for this phenomenon still require further analysis.

Jiazi Light-Year: Do you think the end-to-end paradigm is a viable path toward AGI?

Shao Lin: I believe that the outside world harbors certain preconceived notions and excessive expectations regarding end-to-end systems.

In fact, the concept of “end-to-end” itself is inherently ambiguous, particularly with regard to the definition of the “end.” Without clear delineation, discussing this paradigm in vague terms can easily lead to misinterpretations. The way the end is defined determines the specific design trajectory and implementation approach—this point must be clarified. At the same time, end-to-end remains a pivotal technology in modern artificial intelligence. Its defining characteristic is that it takes sensor or observational data as input and maps it directly to the output, enabling joint optimization across all parameters. This approach allows gradients to flow seamlessly between input and output, facilitating global optimization and substantially reducing the need for engineering‑driven interventions at intermediate stages.

Data-driven approaches can significantly reduce the amount of manual effort required in engineering workflows, as they encompass the entire process from input to output. From this perspective, end-to-end is indeed a highly valuable technological paradigm. However, it is important to emphasize that the notion of opposing end-to-end design to layered architecture is untenable. An end-to-end system can readily incorporate a layered design, and a layered implementation can equally adopt an end-to-end approach; the two are not mutually exclusive.

Jiazi Light-Year: Different companies adopt varying approaches to hierarchical architecture. We’ve noticed that you’ve opted for an explicit information‑passing paradigm—what were your considerations in making this choice?

Shao Lin: We chose to prioritize information display primarily because it can effectively convey the core message. As you noted, each company adopts a distinct hierarchical architecture; even when all aim to present information, the criteria they use for prioritization reflect their understanding of intelligent‑system design and the depth of their strategic thinking.

By choosing the object trajectory, we ensure that it possesses sufficiently strong representational power. Trajectories can capture the state changes of a wide variety of objects—ranging from deformable bodies and articulated structures to region-based entities—expressing these transformations in a compact, unified form. Not only is the trajectory information rich and informative, but it also aligns closely with the outcomes of manipulation tasks: after all, manipulation fundamentally involves altering an object’s state, and trajectories provide a direct, explicit characterization of such changes. Moreover, trajectories can effectively filter out irrelevant factors—such as background lighting—that bear no relation to the task at hand. This dual advantage ensures both adequate representational capacity and noise suppression, bringing the representation closer to the core of the task.

Secondly, it offers significant advantages in data utilization. Trajectory prediction is fundamentally about learning semantic information; it can be trained using only relevant semantic data, without being limited to actions the robot has previously performed. For example, changes in an object’s state during human manipulation can also be learned by the robot. This enables cross-platform data reuse without constraints. Moreover, since trajectories themselves obey physical laws, we can generate vast amounts of trajectory data through extensive simulations, allowing robots to rapidly acquire an understanding of these laws at low cost.

Third, there are also considerations regarding safety and deployment frequency. As I mentioned earlier, it offers interpretability and controllability. The layered architecture ensures that the lower layers operate at a higher frequency than the upper layers, enabling faster responses—essentially serving as a “safety valve” to maintain system stability.

Jiazi Light-Year: The simulation engine serves as the “training ground” for embodied intelligence research—whether in end-to-end or hierarchical architectures, it remains indispensable. So, is your simulation engine developed in-house?

Shao Lin: Yes, we place great emphasis on simulation development, as simulations can provide rich supervisory signals for large-scale operational models. This is precisely why we are committed to developing our own simulation tools. At the same time, our application requirements are unique and cannot yet be met by off-the-shelf simulators, necessitating in-house research and development. We have focused our efforts on two key areas.

First is physical accuracy. We aim to make our simulator more precise in modeling collisions and contacts. At their core, all physics engines are numerical optimization problems, so we have invested heavily in research on numerical optimization and solver algorithms. The result is more accurate collision simulation and force computation, effectively preventing penetration artifacts. For example, when a robot picks up a cup, if the cup wall is thin, a conventional simulator might allow the robot’s fingers to pass right through it—clearly violating physical laws. Our engine was designed from the ground up to eliminate this issue. It also supports the simulation of deformable objects. Furthermore, we were the first team worldwide to enable a robot to tie a necktie. During the tying process, complex entanglements and deformations arise; our simulator prevents the tie from penetrating itself and delivers highly accurate force and collision calculations, ensuring that the robot can reliably perform this intricate task.

Second, there is the Differentiable Mechanism. Traditional physics simulators typically perform only forward prediction—given an input, they predict the future state. By contrast, our simulator also supports backward computation: if we want to alter the future state, how should the inputs be adjusted? This effectively embeds a differentiable computational graph within the physics simulation, analogous to the backpropagation mechanism in neural networks. As a result, tuning system parameters becomes significantly more efficient.

Jiazi Light-Year: You just mentioned that the penetration issue arises because much of the data doesn’t conform to physical laws. So what components make up your dataset? Is it primarily derived from your physics simulator?

Shao Lin: Our understanding of semantic information does not rely solely on physics‑based simulation engines. These engines primarily simulate the fundamental laws of motion and often do not incorporate semantic data. Building extensive semantic information within a simulated environment is quite challenging, particularly when it comes to constructing and populating scenes, as the associated costs are very high.

So at the foundational level, the physics engine first equips us with the study of physical laws. It primarily provides object‑manipulation data under purely non‑semantic conditions. For example, if I have an object and want it to transition to another state, the physics engine tells me how to achieve that—its focus is solely on the physical process, while whether the target state carries any semantic meaning is irrelevant. We decouple this semantic aspect from the underlying mechanics. The upper layers, by contrast, are where semantic information is learned.

At the semantic level, we can leverage vast amounts of internet‑derived data, including extensive video datasets, to train our models. Since semantic information does not directly pertain to physical execution, we can extract it from videos and infer the underlying semantics of the actions being performed. In addition, we also incorporate instructional‑manual‑type data; although such materials may lack rich descriptions, they still capture key procedural aspects of object manipulation, which our model can likewise process and learn from.

With this hierarchical structure, data from diverse sources can be seamlessly integrated into a unified framework—much like “the sea embraces all rivers.” The model can extract and learn valuable insights from vast amounts of data, capturing underlying patterns and semantic relationships, thereby enabling the development of more robust and effective models.

Jiazi Guangnian: The “one brain, multiple modalities” approach to embodied intelligence represents a development goal, and its relationship with end-to-end systems, hierarchical architectures, and simulation engines is one of methodology and tooling. What’s your take on the “one brain, multiple modalities” paradigm?

Shao Lin: The logic behind “one brain, multiple architectures” is to enable the software or model being operated to adapt to different hardware platforms. Robot design spaces are vast and highly diverse, and the control model must understand the distribution of this design space, then tailor its output strategies to each specific configuration. In this way, it can run on a wide range of hardware—much like an operating system that abstracts away the diversity of underlying hardware.

Its significance is also quite straightforward. For instance, when we develop hardware ourselves, design variations arise across different release cycles, and the hardware requirements of various use cases differ as well. However, these differences can all be accommodated by the model, allowing it to leverage the strengths of diverse configurations and thereby delivering substantial value to the hardware.

Another advantage of the “one brain, multiple forms” architecture is its ability to migrate data. Data collected by one configuration can be transferred to another, enabling knowledge sharing across different embodiments. Moreover, it proves highly useful during rapid iteration and deployment: no matter how the configuration evolves, only a single operational model needs to be adapted. As more hardware becomes compatible, the model grows increasingly robust—much like the Trisolarans in science fiction, where all their diverse bodies can share knowledge seamlessly.

3. Addressing the challenges: The goal is to enable the model to genuinely integrate diverse technologies and operations.

Jiazi Light-Year: We’ve noticed that the videos you’ve posted feature furniture assembly. What are the main operational challenges you encounter in this process? And is furniture assembly relatively less challenging when it comes to handling flexible objects?

Shao Lin: Even within the broad category of furniture‑assembly tasks, many operations involve flexible‑object manipulation—though the particular chair we selected did not require handling flexible materials. Consequently, the main challenges in this process lie in several key aspects. First, it demands bimanual control—dual‑arm manipulation—which necessitates addressing the problem of object reorientation. During real‑time reorientation, in‑hand manipulation becomes critical, while also requiring full exploitation of external dexterity: the robot must perceive and leverage the constraints and conditions of its surrounding environment to execute the task more effectively.

In addition, it involves a series of highly precise maneuvers, such as peg insertion, which are integral to robotic assembly processes. These tasks demand stringent force control and sensing capabilities, as well as multimodal fusion. In essence, assembling furniture encapsulates nearly all the challenging aspects of robotic manipulation.

But the key lies not in these specific challenges, but in how to enable the model to truly integrate diverse technologies and operations.

Because in practice, no one bothers to distinguish whether the task involves in-hand manipulation, external dexterity, object reorientation, or peg insertion. What really matters is whether the task can be completed.

Jiazi Light-Year: This year, your team won the ICRA Best Paper Award in the Robotics Manipulation and Locomotion category, primarily for a novel method to improve dexterous grasping—specifically, by introducing the D(R,O) notation. Could you please elaborate on this?

Shao Lin: D(R,O) simultaneously characterizes the relative pose between the robotic arm and the object, enabling it to output both the arm’s state and configuration during prediction. As a result, grasping speed is significantly accelerated. Previous methods might take tens of seconds or even over a minute, whereas our approach can directly generate a high‑degree‑of‑freedom grasping plan in just 0.65 seconds.

We have also enhanced our perception capabilities, particularly in handling partially occluded objects, resulting in greater algorithmic robustness.

Jiazi Light-Year: What is the core contribution cited by the reviewers?

Shao Lin: The original text reads, “for contributions to learning-based representations for generalizable dexterous grasping across diverse objects and robots.” (Hereby recognizing his contributions to developing learning-based representation methods that enable generalized dexterous grasping across a wide range of objects and robots.)

Jiazi Light-Year: Looking back, what areas of embodied intelligence have your research covered so far?

Shao Lin: Robot manipulation itself is a complex system. It encompasses not only robot learning but also hardware design, tactile sensing and simulation, machine-learning algorithms, and more. I have been working in this field for over a decade, accumulating substantial expertise across these areas, and have also collaborated with others on the development of dexterous robotic hands.

Jiazi Light-Year: You’ve long been deeply immersed in the field of manipulation. In your view, what is the fundamental bottleneck currently hindering further advances in embodied manipulation research?

Shao Lin: I believe the core issue is that we have not truly approached the problem from an essential, logical standpoint—namely, what our ultimate goal is and what design trajectory will enable us to achieve it.

Jiazi Light-Year: Could you elaborate? Do you think people haven’t thought this through yet?

Shao Lin: It’s not that I haven’t thought it through at all; rather, this issue has yet to be thoroughly examined or systematically addressed—at least, that’s how it appears to me. As for the “root cause,” opinions naturally vary from person to person. In my view, the greatest challenge lies in designing and building concrete embodied large-scale models within the specific context of the embodied intelligence industry. This encompasses hardware design, the construction of perception systems—such as visual and tactile sensing—diversity in data sources, and how to architect AI models tailored to different types of data. From the model itself to perception, and from hardware to data, every aspect demands a re‑evaluation from a more fundamental perspective: what kind of design logic can effectively integrate these components and truly move us toward our intended goals?

Jiazi Light-Year: You just mentioned the dexterous hand. At the 2025 World Robot Conference, we observed that many companies still rely on grippers to perform tasks like folding clothes or handling simple household chores. By contrast, Figure 2 recently demonstrated a robot using a dexterous hand to fold clothes and load them into a washing machine. What are the advantages and disadvantages of grippers versus dexterous hands?

Shao Lin: The two-finger gripper is, in itself, a type of end effector. Its advantage lies in its simple structure, making it well suited for basic tasks like pick-and-place. If you use a dexterous hand solely for straightforward pick-and-place operations, you’re actually wasting much of its degrees of freedom. More complex manipulation, on the other hand, requires greater flexibility. However, more degrees of freedom aren’t always better; there’s an optimal balance to strike. From a broader perspective, dexterous hands offer greater design potential. After all, robots are meant to integrate into human society, and most industrial and household products are designed with human hands in mind—considering both their shape and strength. When an end effector closely mimics the human hand, it imposes fewer constraints when interacting with such objects.

Jiazi Light-Year: This year, the concept of embodied intelligence has exploded, with numerous companies in this space securing substantial funding and seeing their valuations soar. Yet the pace of fundraising and valuation growth appears to outstrip the actual rate at which embodied intelligence is being deployed in real-world applications. What’s your take on this phenomenon?

Shao Lin: For a company, the ability to truly achieve practical implementation is of paramount importance. Real-world deployment reveals a host of underlying issues—whether the company’s technology stack is robust, whether it can align with actual market needs, and whether it can scale broadly while enabling rapid rollout. These factors reflect the company’s technical depth, R&D progress, operational efficiency, and team collaboration. The industry must devote more time to refining real‑world use cases. At the same time, it’s crucial to recognize that embodied intelligence is not a fast‑track endeavor; it demands sustained patience. Companies need to strike a balance between near‑term deployment and long‑term growth—neither fixating on whether they can deliver within a month nor dragging things out for a decade without tangible results. The key lies in demonstrating clear commitment and substantial investment.

Jiazi Light-Year: Today, the deployment of embodied intelligence is beginning to take shape in several niche domains, such as retail, industry, and healthcare. In your view, which of these hold the greatest promise?

Shao Lin: I can’t offer an absolute answer, but for a use case to be viable, it must meet three criteria: first, the embodied intelligence technology must differ significantly from conventional automation in that specific context; second, the technology must be able to operate reliably over a short timeframe; and third, it must generate sufficient commercial profitability. These three factors are key to assessing its potential for real-world deployment.

Jiazi Light-Year: As both a professor and an entrepreneur, what advice would you offer to young people who aspire to enter the robotics field in the AI era?

Shao Lin: My advice is this: young people should strive to broaden and deepen their knowledge base as much as possible. Embodied intelligence serves as a crucial bridge between the virtual and the real, and it represents an exceptionally pivotal direction for the future of artificial intelligence. Once the relevant technologies mature, they will profoundly reshape societal structures and individuals’ ways of life. At the same time, the skills and expertise required in this field must be both deep and broad. Robotic systems are inherently complex, integrating electronic hardware, sensors, AI, large-scale models, and human–robot interaction, among other components. To stand out in this domain, one must cultivate a holistic, system‑level mindset. You don’t need to become an expert in every single subfield, but you should at least grasp the fundamental principles across different disciplines and gain substantial hands‑on experience. Only then can you make well‑rounded, multidimensional decisions and develop a systematic understanding. That’s why I emphasize that young people should aim to become “full‑stack roboticists”—individuals who possess both breadth and depth of knowledge.

Chief Scientist Shao Lin has been awarded the Best Paper Award in Robotics Manipulation and Motion at ICRA 2025.

RoboScience has closed a nearly RMB 200 million angel-round financing, led by JD.com, with participation from China Merchants Capital and SenseTime Guoxiang Capital. Existing investor 01 Venture also increased its investment.

Chief Scientist Shao Lin has been awarded the Best Paper Award in Robotics Manipulation and Motion at ICRA 2025.

北京机科未来科技有限公司

Conversation with RoboScience’s Shao Lin: The VLOA Large Model Breaks Through the Generalization Bottleneck in Embodied Intelligence

1. On the model: The VLA should be regarded as a decision‑mapping function from input to output, rather than getting bogged down in conceptual debates.

2. On Paradigms: Layered vs. End-to-End—The Two Are Not Mutually Exclusive