4 bit NormalFloat Quantization



Agentic Intelligence Framework



Alternating Local Global Attention


Asynchronous Reinforcement Learning Infrastructure



Auxiliary Loss Free Load Balancing



Binary Cross Entropy Policy Optimization


Block Diagonal Attention Masking


Block wise k bit Quantization



Break Tokens for Image Tokenization


Closed Form Optimal Policy Extraction


Code Training Benefits Mathematical Reasoning



CommonCrawl Quality Filtering




Computation Communication Overlap


Computational Efficiency in Large Language Models



Contextual Flexibility Retrieval




Cross Modal Reasoning Capabilities





DeepSeekMoE Architecture



Dependency Aware Tree Traversal





Distillation of Reasoning Capability


Double Quantization


Dynamic Expert Routing


Dynamic Mode Switching


Dynamic Per Example Importance Weighting




Efficient Cross Node All to All Communication



Efficient Inference with Reduced Active Parameters



Efficient Mixture of Experts Architecture



Efficient Transformer Architecture


Elo Rating Tournament Style Evaluation


Entity Knowledge Graph Extraction



Expert Selection Locality Analysis


FP8 Mixed Precision Training




Flexible Vision Encoder Architecture


Frozen Quantized Pre trained Weights with Trainable Adapters


Gaussian Mixture Model Clustering



Global Batch Load Balancing Loss






Hierarchical Summarization



Instruction Fine Tuning with Direct Preference Optimization


Interleaved Sequence Processing


Interleaving Local Global Attention



Joint Multimodal Pre Training



Knowledge Distillation for Small Language Models


Language Model Scaling Laws


Large Scale Agentic Data Synthesis


Large Scale Reinforcement Learning on Base Model




Length Normalized Preference Optimization



Long Chain of Thought Cold Start




Long Context Retrieval Optimization


Low Rank Adaptation of Quantized Models



MM MT Bench Benchmark



Memory Efficient Attention


Memory Efficient Large Language Model Fine Tuning



Mixture of Experts Sparsity Scaling Law


Model Merging through Weight Averaging



Multi Dimensional Scaling Laws



Multi Level Abstraction Retrieval





Multi Stage Reinforcement Learning with Self Critique


Multi Step Learning Rate Scheduler


Multi Token Prediction


Multi Turn Instruction Tuning


Multilingual Multimodal Understanding


Multilingual Performance Scaling


Multimodal Knowledge Distillation


Multimodal Reasoning with Uncertainty Routing


Multimodal Safety Evaluation Framework



MuonClip Optimizer



Natively Multimodal Transformer Architecture


Node Limited Routing


Non Embedding FLOPs per Token


Open Foundation and Fine Tuned Chat Models


Open Language Model Evaluation System


Open and Efficient Foundation Language Models


Optimal Model Data Scaling Allocation


Paged Optimizers


Pan & Scan Image Processing


Parameter Efficient Fine Tuning with Quantization


Parameter Efficient Language Model Scaling


Performance Training Inference Tradeoff


Persona Driven Data Synthesis



Pre fill and Chunking



Prompt Decontamination


Proximal Policy Optimization


Public Dataset Only Training


QK Clip Attention Stabilization


Quantization Aware Training




RMSNorm Pre Normalization



Reasoning Oriented Reinforcement Learning


Reasoning Reinforcement Learning


Recursive Abstractive Processing



Red Team Safety Testing



Reinforcement Learning with Cold Start


Reinforcement Learning with Human Feedback


Reinforcement Learning with Verifiable Rewards



Rejection Sampling and Supervised Fine Tuning




Responsible Multimodal Model Development


Responsible Open Model Development



RoPE 2D Positional Encoding


RoPE Positional Embedding Extension


Rolling Buffer Cache


Rotary Positional Embeddings


Routing Network Token Selection



Safety Context Distillation


Scaling Laws for Large Language Models


Scaling Open Source Language Models with Longtermism


Self Reflection Content Moderation


Semantic Similarity Clustering


Sequence Packing Optimization



Skill Specific Synthetic Data Generation


Skill Targeted Model Training


Sliding Window Attention


Sliding Window Attention Optimization




Speculative Decoding


Standardized Multimodal Evaluation


Strong to Weak Distillation



SwiGLU Activation Function


Synthetic Data Generation for Mathematics


Synthetic Data Rephrasing for Token Efficiency


System Message for Multi Turn Consistency


System Prompt Guardrails


Thinking Budget Mechanism



Tile Wise Fine Grained Quantization



Token Efficient Knowledge Compression


Tool Use Emergence




Two Expert Token Processing



Unified Multi Modal Generative Model


Unified Paradigm for Reinforcement Learning


Variable Image Resolution Processing



Vision Encoder Token Condensation