S3 Ingestion

S3 Ingestion lets you pull files from Amazon S3 or S3-compatible storage (MinIO, NCP Object Storage, etc.) directly into your AI Data Foundry project.

After connecting a source, use Import now to ingest files immediately, or set up an automatic schedule to periodically collect newly added files.


Prerequisites

You will need the following information to connect an S3 source.

Item Description
Bucket name The name of the S3 bucket where files are stored (e.g., my-documents-bucket)
Region The AWS region where the bucket is located (e.g., ap-northeast-2). The form defaults to ap-northeast-2 — change it to match your actual bucket region
Access Key ID Access Key ID of an IAM user with access to the S3 bucket
Secret Access Key The Secret Access Key corresponding to the Access Key above

IAM Permissions

The IAM user used for the connection needs at least the following permissions.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "s3:HeadBucket",
        "s3:ListBucket",
        "s3:GetObject"
      ],
      "Resource": [
        "arn:aws:s3:::my-documents-bucket",
        "arn:aws:s3:::my-documents-bucket/*"
      ]
    }
  ]
}

Replace my-documents-bucket with your actual bucket name.

  • s3:HeadBucket — Verifies the bucket exists during connection test
  • s3:ListBucket — Lists files and folders in the bucket
  • s3:GetObject — Downloads files during ingestion

Step 1: Connect an S3 Source

  1. Click the Ingestion tab in the left sidebar.

    Ingestion tab selection
  2. Click Connect a new source.

    Connect a new source button
  3. Select Amazon S3 as the source type and fill in the fields.

    S3 source connection dialog
Field Required Description
Name Yes A friendly name for this source (e.g., "Production log bucket")
Bucket Yes S3 bucket name
Region Yes AWS region (form default: ap-northeast-2 — change to match your bucket)
Prefix No Start browsing from a specific folder (e.g., reports/2026/). A trailing / is automatically appended
Access Key ID Yes IAM Access Key ID
Secret Access Key Yes IAM Secret Access Key
Endpoint No Only needed for S3-compatible storage
  1. Click Test and save to automatically test that the bucket is accessible.

  2. If the test passes, the source is registered and a source card appears.

    Source registered

Credentials are encrypted before storage and are never included in API responses.


Step 2: Ingest Files

Once the source is connected, you can pull files from the bucket into your project.

Click Import now on the source card to scan the entire bucket and automatically ingest eligible files.

How Ingestion Works

  • First run: Scans and ingests all files within the bucket (or Prefix scope).
  • Subsequent runs: Only files added or modified since the last ingestion are collected (incremental ingestion).
  • Deduplication: Files already ingested from the same source are not imported again.
  • Per-run limit: Scheduled (automatic) runs process up to 200 files per run; manual (Import now) runs process up to 50 files per run. Any remaining files will be picked up in the next run.

Progress is shown in real time in the Recent ingestion jobs section at the bottom of the page.

Supported Formats and Size Limits

  • Formats: All file formats supported by the project (Supported Formats)
  • Size: Up to 100 MB per file

Step 3: Set Up Automatic Schedule (Optional)

To automatically ingest new files on a recurring basis, enable a schedule.

  1. Click the Run schedule dropdown on the source card.
  2. Choose your preferred interval:
Interval Description
Manual only No automatic ingestion (default)
Every 6 hours Automatically ingest new files every 6 hours
Every 12 hours Automatically ingest new files every 12 hours
Every day Automatically ingest new files every 24 hours
Every week Automatically ingest new files every 7 days
  1. The schedule is activated as soon as you select an interval.

Once the schedule is active, the source card displays the Last run time and Next run estimate.

Run Now

Click Import now at any time to trigger an immediate ingestion without waiting for the next scheduled cycle.


Monitoring Ingestion Jobs

The Recent ingestion jobs table at the bottom of the Ingestion page shows the history of all ingestion jobs.

Column Description
Source Which source the files were ingested from
Mode Manual or Scheduled
Status Pending / Running / Completed / Partial / Failed
Success / Fail Number of successfully ingested files and failed files
Started / Completed Job start and completion timestamps

Click a job row to view the detailed list of ingested files.

Status Reference

Status Meaning
Pending The job is queued and waiting to be processed
Running Files are being downloaded and imported into the project
Completed All files were successfully ingested
Partial Some files succeeded, some failed (unsupported format, size exceeded, etc.)
Failed The entire job failed (connection error, etc.)

Connecting S3-Compatible Storage

In addition to Amazon S3, any storage that provides an S3-compatible API can be connected in the same way.

How to Connect

Enter the storage endpoint URL in the Endpoint field when connecting a source.

Storage Endpoint Example
MinIO (local) http://localhost:9000
NCP Object Storage https://kr.object.ncloudstorage.com
Cloudflare R2 https://{account_id}.r2.cloudflarestorage.com

All other fields (Bucket, Region, Access Key ID, Secret Access Key) should be filled with the values provided by the respective storage service.

When an Endpoint is provided, Path Style access is automatically enabled. No additional configuration is needed.


FAQ

Connection to the source fails

  • Verify that the IAM user has the s3:HeadBucket permission.
  • Double-check that the bucket name and region are correct.
  • For S3-compatible storage, make sure the Endpoint URL is valid.

Ingested files do not appear in the project

  • Check that the ingestion job status is Completed.
  • If you specified a target folder, look for the files in that folder.
  • Verify that the file format is included in Supported Formats.

Files are missing from ingestion

  • After the first run, ingestion only fetches files modified since the last run. Previously existing but unmodified files are not included.
  • There is a per-run limit on the number of files processed: up to 200 for scheduled (automatic) runs and up to 50 for manual (Import now) runs. Excess files will be picked up in the next run.
  • Files already ingested (same source + same file path) are skipped by deduplication.

Will deleting a source also delete already ingested files?

No. Deleting a source only removes the connection configuration and schedule settings. Files already imported into your project remain untouched.