| Developers: | T-Bank AI Research, ITMO (Scientific and Educational Corporation), Sberbank |
| Date of the premiere of the system: | 2025/10/27 |
| Technology: | Video Analytics Systems |
Main article: Video analytics (terms, applications, technologies)
2025: Development of Accurate Visual Localization Method
On October 27, 2025, the researchers T-Bank AI Research , together with the BE2R Laboratory ITMO and the Center robotics Sberbank , presented GSplatLoc, a visual localization method that smartphone determines the robot camera position up to centimeters and orientation up to degrees in one RGB frame (with or).
As reported, the method combines classical key point matching with visual (photometric) optimization based on 3D Gaussian Splatting (3DGS) - a "fast" three-dimensional representation of the scene, and works in real time in three modes of work quality.
GSplatLoc reduces hardware requirements: a conventional RGB camera is enough for reliable localization. This allows you to partially abandon lidars and depth sensors (RGB-D, Time-of-Flight (ToF)), reducing the cost of robots and AR devices. Examples:
- Robotics (in shopping centers and warehouses): instead of the "lidar + camera + IMU" bundle, in many cases, the camera and IMU (accelerometer/gyroscope) are enough
- AR navigation in buildings: a smartphone camera and a pre-prepared 3DGS map are enough, without special markers or RGB-D
- Semantics and Agents Path: Embedded features in 3D representation provide the basis for further integration with semantic and language modules, which is useful for standalone agents and intelligent assistants
The method consists of two stages:
- Simulation of the scene (preparation, performed once). A 3DGS representation is built from a set of images of space with known camera poses: the scene is described by a set of three-dimensional "spots" (Gaussian), which allows you to quickly render the image. For each original frame, the pre-trained model finds key points and extracts descriptors for them - compact numerical "prints" of noticeable places. Then, during training, these descriptors are embedded (distilled) into the parameters of the 3D Gaussian: the presentation is trained so that when rendering from the corresponding poses, the synthesized scene image matches the original personnel as much as possible - not only in color and geometry, but also in the same "prints." In other words, the 3D representation receives built-in "memory for finding" matches. (In normal 3DGS, the default color is RGB.)
- Evaluation of the pose of the new image (use, in real time). For each input frame, the problem of one-frame absolute relocation in a pre-prepared 3DGS card is solved (this is not a SLAM that builds a map and tracks the pose along the video stream). The process includes two sub-steps:
- coarse pose for comparing 2D key points with 3D model taking into account "built-in" descriptors (hereinafter - standard calculation of camera position)
- position refinement due to visual adjustment (photometric optimization): differences between the real image and the synthesized image generated by the same 3DGS scene representation are compared
This separation - preparing once → repeatedly used - provides real-time operation and stable centimeter accuracy in practice.
Two key solutions were added to GSplatLoc: firstly, the distillation of key point descriptors into 3D Gaussian parameters at the stage of building a scene representation, and secondly, the use of 3DGS as the basis for rapid visual optimization. Unlike the classic structure-based approaches (SIFT/ORB/ORB + PnP/RANSAC), highly dependent on the quality of matches and texture of the scene, and on neural network regressors of the pose/coordinates of the scene, less scalable to large street locations, GSplatLoc combines 2D-3D matching with photometric refinement of the pose in real time on a "fast" differentiable renderer of the 3DGS. This provides several practical possibilities:
- Distilling features into 3D Gaussians turns 3D representation into a search base for correspondences between 2D key points in the image and 3D Gaussians
- The use of 3DGS simplifies training and application (inference) compared to implicit neural network representations (NeRF), which optimizes scaling to large and dynamic street scenes
- Three modes of operation have been proposed - "rough," "basic" and "accurate" - allowing you to adjust the balance between speed and accuracy for a specific hardware and task
The results on the benchmarks confirm the effectiveness: in the premises (7-Scenes), the method showed state-of-the-art among the approaches based on neural rendering - with an average error in position within a few centimeters and in orientation - about 1 °; on street stages (Cambridge Landmarks) - the optimal quality among the methods compared, with an average error of about tens of centimeters and about 1 ° in orientation. GSplatLoc is resistant to complex dynamic conditions (moving people, glass surfaces, mirrors). The processing time of one frame in three modes: "rough" - ≈ 0.2 s, "basic" - ≈ 0.8 s, "accurate" - ≈ 2.0 s.
| Imagine a courier robot delivering food in a large shopping mall. Ordinary navigation systems, like GPS, inside buildings do not work or give an error of several meters - the robot may get confused in the corridors or not find the desired store. The GSplatLoc method allows the robot to "see" the environment and accurately determine where it is, up to a centimeter. He compares the image from the camera with his 3D map and instantly clarifies the position. The robot quickly finds the route even in halls with moving people, glass doors and mirrors. The technology can also be used in AR glasses - for example, to accurately apply virtual pointers or navigation to real space. told Rakhimov Ruslan, head of the scientific group CV Research, T-Bank AI Research |
GSplatLoc demonstrates that fast 3D representations combined with classic 2D-3D mappings provide practical, fast, and accurate localization from a single RGB frame. This reduces sensor requirements (to camera or camera + IMU configurations) and allows you to scale solutions for robotics and AR services in real-world conditions.
