airMeng

AI Framework Engineer

Shanghai.

About

AI Framework Engineer in Shanghai

Projects

LLAMA.cpp SYCL

Summary

Developed SYCL/DPCPP backend for llama.cpp referred from CUDA, achieving >10x performance gains on Intel GPU(Max, Flex, Arc) compared with OpenCL implementation. Co-work with the community owners and response to issues related to Intel. Published blog: Run LLMs on Intel GPUs Using llama.cpp https://medium.com/intel-analytics-software/run-llm-on-intel-gpus-using-llama-cpp-579d3595954e

Intel-Extension-For-Transformers

Summary

Extending Hugging Face transformers APIs for Transformer-based models for collaborations with ecosystem Highly optimized hand written X86 assembly [kernels](https://github.com/intel/neural-speed/tree/main) for Intel hardware, targeting advanced compression algorithm especially for LLM. Develop [GPU kernels](https://github.com/intel/neural-speed/tree/xetla) for intel client GPU based on SYCL efficient low-level programming(ESIMD, "Explicit SIMD" SYCL extension)

Intel Neural Compressor

Summary

Python package for SOTA low-bit LLM quantization. worked on ONNXRunTime backend and finally integrated by ONNX community.

Intel-Extension-for-Pytorch

Jan 2024

Summary

Working on weight-only-quantization optimization for Intel Client GPU. enabled on Windows & Linux, achieving geomean >2x performance gains compared with normal F16 implementation. Blogs: Llama2 support on MTL <https://www.intel.com/content/www/us/en/developer/articles/technical/weight-only-quantization-in-llm-inference.html> Llama3 day0 support on MTL iGPU <https://www.intel.com/content/www/us/en/developer/articles/technical/accelerate-meta-llama3-with-intel-ai-solutions.html>

Work

Intel

AI Framework Engineer

Jan 2020

→

Present

Awards

2023 Intel China Eployee of the Year(EOY)

Jan 2023

Publications

Method and apparatus for accelerating deep leaning inference based on hw-aware sparsity pattern

Jan 2022

Published by

US Patent

Summary

HW-aware sparsity patterns

Methods and apparatus to perform artificial intelligence-based sparse computation based on hybrid pattern and dynamic encoding

Jan 2022

Published by

US Patent

Method and apparatus for optimizing inference of deep neural networks

Jan 2021

Published by

US Patent

Summary

HW-aware cost model to predict performance for quantization recipes