Projects at Huawei Technologies, Munich Research Center, Germany

Matthias Gries, principal engineer, since 2015

Since 2020: Advanced Computing, HiSilicon branch: ecosystem enabling for Arm AArch64 and AI processing

Technical lead: Strengthening the ecosystem for HiSilicon Kunpeng AArch64 CPU and HiSilicon Ascend AI/ML processors: HPC / Arm SVE / scientific computing software enabling for HPC applications (e.g., GPAW, GROMACS) and frameworks/libs (e.g., BLIS, ginkgo, BeeGFS), performance characterization (profiling, benchmarking, analysis), architecture evaluation on real hardware (ARM64 and x86-64 variants), as well as SoC performance modeling and analysis by simulation (gem5 & internal micro-architecture simulators).

  • HiSilicon Kunpeng CPU: HiSilicon Kunpeng product page (external link)
  • HiSilicon Ascend AI processors: HiSilicon Ascend product page (external link)
  • 2019: Automotive Engineering Lab, HiSilicon branch: Targeting distributed SoC solutions for autonomous driving

    Determining compute and communication demands of computer vision (perception by sensors) and vehicle dynamics (path planning, trajectory control) methods while considering requirements for reliability, cost and functional safety based on IP blocks from HiSilicon Kunpeng and Ascend chip series.

    Central Hardware Dept.: DIMM-NDP: Near-data processing using memory modules in main memory, 2017 - '18

    Technical lead: Evaluation of the hardware feasibility, programming effort and performance of Near-Data Processing (NDP) on memory modules for server applications (DIMM-NDP). Building on standard IP, NDP units enhance the MediaController (MedC) on a memory module. The MedC is a discrete buffer chip positioned side-by-side the DRAM devices on the module and needed for forthcoming interface standards like JEDEC NVDIMM-P and Gen-Z (now CXL). DIMM-NDP employs unmodified standard DRAM chips and exploits unused rank-level bandwidth on DIMM, such that we follow the economy of scale of manufacturing standard DRAM, such as DDR4/DDR5. The memory module appears as normal Load-Reduced DIMM if NDP is switched off.

    Simulation results show up to 6.3x better performance for bandwidth-limited applications, representing 79% of the theoretical peak of the evaluated configuration. We complement the evaluation with feasibility checks for DIMM-like form factors to offer 32GB to 128GB capacity per DIMM, hardware overhead costs (below 20%), and power envelopes for standard (13W) and custom DIMMs (40W).

    Central Hardware Dept.: ARMv8-A ecosystem enabling for server systems, 2015 - '16

    Benchmarking and microarchitecture analysis of HiSilicon Hi1612/1610 generation of ARM-based multicore server chipsets with elementary tests (e.g., stream, lmbench) and app-level benchmarks (SPEC CPU, OMP, jbb2015) to determine microarchitectural improvements with respect to fairness, latency and utilization of the uncore (DDR3/DDR4 memory subsystem and interconnect) for future HiSilicon Kunpeng products. I also assessed the feasibility of integrating forthcoming interface standards (Gen-Z, JEDEC NVDIMM-P, DDR5) for use with near and far memory, as well as hybrid memory solutions (DRAM plus NVM).



    M. Gries home