LLM Multiplication | Peval Competition

Description

Given 2 random integers between 1 and 13 digits long, write a prompt to multiply them correctly. The LLM should execute it directly without using any tools or code.

Evaluation

Each submission will be tested against 100 randomly generated test cases, where the integers increase in digits by 2 every group of 20:

def generate_test_case(index):
    a_digits = 2 * int(index // 20) + 3
    b_digits = int(a_digits * (index % 20) / 20) + 1
    a_min, a_max = 10**(a_digits - 1), 10**a_digits - 1
    b_min, b_max = 10**(b_digits - 1), 10**b_digits - 1
    return f"{random.randint(a_min, a_max)} {random.randint(b_min, b_max)}"

For each of the 100 test cases, the grader will prompt the selected LLM with the submission's prompt concatenated with 2 new line characters and the test case:

<submission-prompt>
{{USER_PROMPT}}
</submission-prompt>

<question>
{{TEST_CASE}}
</question>

Then, each response will be scored for exact match (i.e. the multiplied value should match the expected answer exactly). The final score is the percentage of correct test cases:

def evaluate(test_cases, outputs):
    correct = 0
    for test_case, output in zip(test_cases, outputs):
        a, b = map(int, test_case.split())
        if output.strip() == str(a * b):
            correct += 1
    return correct / len(test_cases)

Example

input.txt

555555555 555555555

output.txt

308641974691358025

Submission Requirements

Standard rules apply.
The final output should be just the result with no commas or units inside <answer> tags.
Maximum input length of 1024 characters.
Maximum output of 4096 tokens.
No tool-calling.