Accuracy Recovery Adapters



Auxiliary Loss Free Load Balancing



Block Diagonal Attention Masking


Code Training Benefits Mathematical Reasoning



CommonCrawl Quality Filtering




Computation Communication Overlap




Contextual Flexibility Retrieval




Cross Modal Reasoning Capabilities






Dependency Aware Tree Traversal





Distillation of Reasoning Capability


Dynamic Expert Routing



Efficient Cross Node All to All Communication




Efficient Inference with Reduced Active Parameters


Efficient Long Context Attention Mechanism




Efficient Transformer Architecture


Enhanced Safety Alignment


Entity Knowledge Graph Extraction




Expert Selection Locality Analysis


Explicit Prompt Engineering


FP8 Training





Flexible Image Processing


Flexible Vision Encoder Architecture


Gaussian Mixture Model Clustering









Hierarchical Summarization


Instruction Fine Tuning with Direct Preference Optimization


Interleaved Sequence Processing


Interleaving Local Global Attention



Iterative Teaching Committee


Joint Multimodal Pre Training


Knowledge Distillation for Small Language Models


Language Model Scaling Laws


Large Scale Reinforcement Learning on Base Model






Long Context Retrieval Optimization



MM MT Bench Benchmark



Memory Efficient Attention


Mirror Descent with Leave One Out Estimation


Mixed Precision Quantization



Model Merging through Weight Averaging



Multi Dimensional Scaling Laws



Multi Image Instruction Following


Multi Level Abstraction Retrieval




Multi Step Learning Rate Scheduler


Multi Token Prediction


Multi Turn Instruction Tuning


Multilingual Performance Scaling


Multimodal Instruction Tuning


Multimodal Knowledge Distillation


Multimodal Safety Evaluation Framework


Natively Multimodal Transformer Architecture


Non Embedding FLOPs per Token


Open Foundation Language Models


Open Foundation and Fine Tuned Chat Models


Open and Efficient Foundation Language Models


Optimal Model Data Scaling Allocation


Optimal Model/Data Scaling Up Allocation


Pan & Scan Image Processing


Parameter Efficient Language Model Scaling


Performance Training Inference Tradeoff


Post Training Multimodal Alignment


Pre fill and Chunking




Proximal Policy Optimization


Public Dataset Only Training


Quantization Aware Training



RMSNorm Pre Normalization



Reasoning Oriented Reinforcement Learning


Recursive Abstractive Processing



Red Team Safety Testing


Reinforcement Learning with Cold Start


Reinforcement Learning with Human Feedback


Reinforcement Learning with Verifiable Rewards



Rejection Sampling and Supervised Fine Tuning




Responsible AI Evaluation


Responsible AI Principles


Responsible Multimodal Model Development


Responsible Open Model Development


Retrieval Augmented Generation


RoPE 2D Positional Encoding



Rolling Buffer Cache


Rotary Positional Embeddings


Routing Network Token Selection


Runtime Swappable Model Adapters



Safety Context Distillation


Scaling Laws for Large Language Models


Scaling Open Source Language Models with Longtermism


Self Reflection Content Moderation


Semantic Similarity Clustering




Sliding Window Attention


Sliding Window Attention Optimization


Soft Label Reward Modeling




Standardized Multimodal Evaluation




SwiGLU Activation Function


Synthetic Data Generation for Mathematics


System Message for Multi Turn Consistency


System Prompt Guardrails



Token Efficient Knowledge Compression




Two Expert Token Processing



Uncertainty Routed Multimodal Reasoning



Variable Image Resolution Processing


Vision Encoder Token Condensation


Vision Encoder with Break Tokens