Agentic Intelligence Framework


Alternating Local Global Attention


Asynchronous Reinforcement Learning Infrastructure


Auxiliary Loss Free Load Balancing



Code Training Benefits Mathematical Reasoning



CommonCrawl Quality Filtering




Computation Communication Overlap


Computational Efficiency in Large Language Models



Contextual Flexibility Retrieval







DeepSeekMoE Architecture



Dependency Aware Tree Traversal





Distillation of Reasoning Capability


Dynamic Expert Routing



Efficient Cross Node All to All Communication



Efficient Inference with Reduced Active Parameters


Efficient Long Context Attention Mechanism


Efficient Mixture of Experts Architecture



Efficient Transformer Architecture


Entity Knowledge Graph Extraction



Expert Selection Locality Analysis


FP8 Mixed Precision Training




Gaussian Mixture Model Clustering







Hierarchical Summarization


Instruction Fine Tuning with Direct Preference Optimization


Interleaving Local Global Attention



Knowledge Distillation for Small Language Models


Language Model Scaling Laws


Large Scale Agentic Data Synthesis


Large Scale Reinforcement Learning on Base Model




Length Normalized Preference Optimization




Long Context Retrieval Optimization




Memory Efficient Attention



Mixture of Experts Sparsity Scaling Law


Model Merging through Weight Averaging



Multi Dimensional Scaling Laws



Multi Level Abstraction Retrieval





Multi Stage Reinforcement Learning with Self Critique


Multi Step Learning Rate Scheduler


Multi Token Prediction


Multilingual Multimodal Understanding


Multilingual Performance Scaling


Multimodal Knowledge Distillation


MuonClip Optimizer


Node Limited Routing


Non Embedding FLOPs per Token


Open Foundation and Fine Tuned Chat Models


Open Language Model Evaluation System


Open and Efficient Foundation Language Models


Optimal Model Data Scaling Allocation


Pan & Scan Image Processing


Parameter Efficient Language Model Scaling


Performance Training Inference Tradeoff


Persona Driven Data Synthesis


Pre fill and Chunking



Prompt Decontamination


Proximal Policy Optimization


Public Dataset Only Training


QK Clip Attention Stabilization


Quantization Aware Training



RMSNorm Pre Normalization



Reasoning Oriented Reinforcement Learning


Recursive Abstractive Processing



Red Team Safety Testing


Reinforcement Learning with Cold Start


Reinforcement Learning with Human Feedback


Reinforcement Learning with Verifiable Rewards



Rejection Sampling and Supervised Fine Tuning




Responsible Open Model Development


RoPE Positional Embedding Extension


Rolling Buffer Cache


Rotary Positional Embeddings


Routing Network Token Selection



Safety Context Distillation


Scaling Laws for Large Language Models


Scaling Open Source Language Models with Longtermism


Self Reflection Content Moderation


Semantic Similarity Clustering



Skill Specific Synthetic Data Generation


Skill Targeted Model Training


Sliding Window Attention


Sliding Window Attention Optimization



Speculative Decoding



SwiGLU Activation Function


Synthetic Data Generation for Mathematics


Synthetic Data Rephrasing for Token Efficiency


System Message for Multi Turn Consistency


System Prompt Guardrails


Tile Wise Fine Grained Quantization



Token Efficient Knowledge Compression


Tool Use Emergence




Two Expert Token Processing



Unified Paradigm for Reinforcement Learning



Vision Encoder Token Condensation