Hugging Face가 공개한 MiniMax-Text-01 정리

머신러닝/자연어 처리

Hugging Face가 공개한 MiniMax-Text-01 정리

shinetheworld 2025. 2. 21. 14:10

머신러닝 모델 관련 도구와 리소스를 개발하는 최근 Hugging Face 에서 MiniMax-Text-01 을 공개했다.

https://huggingface.co/MiniMaxAI/MiniMax-Text-01

MiniMaxAI/MiniMax-Text-01 · Hugging Face

MiniMax-Text-01 1. Introduction MiniMax-Text-01 is a powerful language model with 456 billion total parameters, of which 45.9 billion are activated per token. To better unlock the long context capabilities of the model, MiniMax-Text-01 adopts a hybrid arch

huggingface.co

MiniMax-Text-01은 총 4,560억 개의 매개변수를 보유한 강력한 언어 모델입니다. 이 중 토큰당 459억 개의 매개변수가 활성화되어 작동하는 것이 특징이다.

이 모델의 가장 큰 특징은 긴 문맥을 처리하는 뛰어난 능력이다. Lightning Attention, Softmax Attention, 그리고 전문가 혼합(Mixture-of-Experts, MoE) 방식을 결합한 하이브리드 아키텍처를 채택했다.

또한 다음과 같은 고급 병렬 처리 전략과 혁신적인 계산-통신 중첩 방법을 활용한다

Linear Attention Sequence Parallelism Plus (LASP+)
Varlen Ring Attention
Expert Tensor Parallel (ETP)

이러한 기술들을 통해 MiniMax-Text-01은 학습 시 100만 토큰의 문맥 길이를 처리할 수 있으며, 추론 시에는 최대 400만 토큰까지 처리가 가능하다. 다양한 학술 벤치마크에서도 최상위권 모델에 걸맞은 성능을 보여주고 있다.

MiniMax-Text-01의 아키텍처

총 매개변수: 456B
토큰당 활성화된 매개변수: 45.9B
레이어 수: 80
하이브리드 어텐션: 소프트맥스 어텐션은 7개의 라이트닝 어텐션마다 배치
- 주의 헤드 수: 64
- 주의 머리 크기: 128
전문가의 혼합:
- 전문가 수 : 32명
- 전문가 숨겨진 차원: 9216
- Top-2 라우팅 전략
위치 인코딩: 회전 위치 임베딩(RoPE)은 10,000,000의 기본 주파수를 사용하여 어텐션 헤드 차원의 절반에 적용
숨겨진 크기: 6144
어휘 크기: 200,064

Core Academic Benchmarks

Tasks	GPT-4o (11-20)	Claude-3.5-Sonnet (10-22)	Gemini-1.5-Pro (002)	Gemini-2.0-Flash (exp)	Qwen2.5-72B-Inst.	DeepSeek-V3	Llama-3.1-405B-Inst.	MiniMax-Text-01
General
MMLU*	85.7	88.3	86.8	86.5	86.1	88.5	88.6	88.5
MMLU-Pro*	74.4	78.0	75.8	76.4	71.1	75.9	73.3	75.7
SimpleQA	39.0	28.1	23.4	26.6	10.3	24.9	23.2	23.7
C-SimpleQA	64.6	56.8	59.4	63.3	52.2	64.8	54.7	67.4
IFEval (avg)	84.1	90.1	89.4	88.4	87.2	87.3	86.4	89.1
Arena-Hard	92.4	87.6	85.3	72.7	81.2	91.4	63.5	89.1
Reasoning
GPQA* (diamond)	46.0	65.0	59.1	62.1	49.0	59.1	50.7	54.4
DROP* (F1)	89.2	88.8	89.2	89.3	85.0	91.0	92.5	87.8
Mathematics
GSM8k*	95.6	96.9	95.2	95.4	95.8	96.7	96.7	94.8
MATH*	76.6	74.1	84.6	83.9	81.8	84.6	73.8	77.4
Coding
MBPP +	76.2	75.1	75.4	75.9	77.0	78.8	73.0	71.7
HumanEval	90.2	93.7	86.6	89.6	86.6	92.1	89.0	86.9

* Evaluated following a 0-shot CoT setting.

Long Benchmarks

4M Needle In A Haystack Test

Ruler

Model	4k	8k	16k	32k	64k	128k	256k	512k	1M
GPT-4o (11-20)	0.970	0.921	0.890	0.888	0.884	-	-	-	-
Claude-3.5-Sonnet (10-22)	0.965	0.960	0.957	0.950	0.952	0.938	-	-	-
Gemini-1.5-Pro (002)	0.962	0.960	0.960	0.958	0.938	0.917	0.916	0.861	0.850
Gemini-2.0-Flash (exp)	0.960	0.960	0.951	0.957	0.937	0.860	0.797	0.709	-
MiniMax-Text-01	0.963	0.961	0.953	0.954	0.943	0.947	0.945	0.928	0.910

LongBench v2

Model	overall	easy	hard	shot	medium	long
Human	53.7	100.0	25.1	47.2	59.1	53.7
w/ CoT
GPT-4o (11-20)	51.4	54.2	49.7	59.6	48.6	43.5
Claude-3.5-Sonnet (10-22)	46.7	55.2	41.5	53.9	41.9	44.4
Deepseek-V3	-	-	-	-	-	-
Qwen2.5-72B-Inst.	43.5	47.9	40.8	48.9	40.9	39.8
MiniMax-Text-01	56.5	66.1	50.5	61.7	56.7	47.2
w/o CoT
GPT-4o (11-20)	50.1	57.4	45.6	53.3	52.4	40.2
Claude-3.5-Sonnet (10-22)	41.0	46.9	37.3	46.1	38.6	37.0
Deepseek-V3	48.7	-	-	-	-	-
Qwen2.5-72B-Inst.	42.1	42.7	41.8	45.6	38.1	44.4
MiniMax-Text-01	52.9	60.9	47.9	58.9	52.6	43.5

MTOB

Context Type	no context	half book	full book	Δ half book	Δ full book
eng → kalam (ChrF)
GPT-4o (11-20)	9.90	54.30	-	44.40	-
Claude-3.5-Sonnet (10-22)	20.22	53.62	55.65	33.39	35.42
Gemini-1.5-Pro (002)	16.79	53.68	57.90	36.89	41.11
Gemini-2.0-Flash (exp)	12.20	49.50	53.30	37.30	41.10
Qwen-Long	16.55	48.48	45.94	31.92	29.39
MiniMax-Text-01	6.0	51.74	51.60	45.7	45.6
kalam → eng (BLEURT)
GPT-4o (11-20)	33.20	58.30	-	25.10	-
Claude-3.5-Sonnet (10-22)	31.42	59.70	62.30	28.28	30.88
Gemini-1.5-Pro (002)	32.02	61.52	63.09	29.50	31.07
Gemini-2.0-Flash (exp)	33.80	57.50	57.00	23.70	23.20
Qwen-Long	30.13	53.14	32.15	23.01	2.02
MiniMax-Text-01	33.65	57.10	58.00	23.45	24.35

Hands-on

from transformers import AutoModelForCausalLM, AutoTokenizer, AutoConfig, QuantoConfig, GenerationConfig

# load hf config
hf_config = AutoConfig.from_pretrained("MiniMaxAI/MiniMax-Text-01", trust_remote_code=True)

# quantization config, int8 is recommended
quantization_config =  QuantoConfig(
            weights="int8",
            modules_to_not_convert=[
                "lm_head",
                "embed_tokens",
            ] + [f"model.layers.{i}.coefficient" for i in range(hf_config.num_hidden_layers)]
            + [f"model.layers.{i}.block_sparse_moe.gate" for i in range(hf_config.num_hidden_layers)]
        )

# assume 8 GPUs
world_size = 8
layers_per_device = hf_config.num_hidden_layers // world_size
# set device map
device_map = {
    'model.embed_tokens': 'cuda:0',
    'model.norm': f'cuda:{world_size - 1}',
    'lm_head': f'cuda:{world_size - 1}'
}
for i in range(world_size):
    for j in range(layers_per_device):
        device_map[f'model.layers.{i * layers_per_device + j}'] = f'cuda:{i}'

# load tokenizer
tokenizer = AutoTokenizer.from_pretrained("MiniMaxAI/MiniMax-Text-01")
prompt = "Hello!"
messages = [
    {"role": "system", "content": [{"type": "text", "text": "You are a helpful assistant created by MiniMax based on MiniMax-Text-01 model."}]},
    {"role": "user", "content": [{"type": "text", "text": prompt}]},
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)
# tokenize and move to device
model_inputs = tokenizer(text, return_tensors="pt").to("cuda")

# load bfloat16 model, move to device, and apply quantization
quantized_model = AutoModelForCausalLM.from_pretrained(
    "MiniMaxAI/MiniMax-Text-01",
    torch_dtype="bfloat16",
    device_map=device_map,
    quantization_config=quantization_config,
    trust_remote_code=True,
    offload_buffers=True,
)

# generate response
generation_config = GenerationConfig(
    max_new_tokens=20,
    eos_token_id=200020,
    use_cache=True,
)
generated_ids = quantized_model.generate(**model_inputs, generation_config=generation_config)
print(f"generated_ids: {generated_ids}")
generated_ids = [
    output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]
response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]

'머신러닝 > 자연어 처리' 카테고리의 다른 글

Agent (0)	2025.02.25
RAG; Retrieval-Augmented Generation (1)	2025.02.25

현재글Hugging Face가 공개한 MiniMax-Text-01 정리

shinetheworld01 님의 블로그

얼렁뚱땅 남겨보는 발자취

virtualenv, coding, nlp, Langchain, 코테, Algorithm, data, CV, 알고리즘, til, programmers, 프로그래머스, conda, OpenAI, ML, SQLD, python, LLM, 코딩테스트, 데이터,

Today :
Yesterday :

shinetheworld01 님의 블로그