s3-concat
is a library for concatenating multiple files stored in AWS S3 into a single file using multipart upload. This is particularly useful for handling large datasets and optimizing S3 operations. Files larger than 5MiB are uploaded using multipart upload, while files smaller than 5MiB are concatenated via streaming. Additionally, the order in which the source files are concatenated can also be controlled.
Inspired by the s3-concat project on PyPI.
npm install s3-concat
This example shows how to concatenate all files into a single file without using the minSize option.
import { S3Client } from '@aws-sdk/client-s3';
import { S3Concat } from 's3-concat';
const s3Client = new S3Client({});
const srcBucketName = process.env.srcBucketName!;
const dstBucketName = process.env.dstBucketName!;
const dstPrefix = 'output';
const main = async () => {
const s3Concat = new S3Concat({
s3Client,
srcBucketName: srcBucketName,
dstBucketName: dstBucketName,
dstPrefix,
concatFileName: 'final_concat.json',
});
await s3Concat.addFiles('tmp/1gb');
await s3Concat.concat();
};
main().then(() => console.log('success'));
In this example, all files from the tmp/1gb prefix in the source bucket will be concatenated into a single file named final_concat.json.
This example shows how to use the minSize option to split the concatenated files if the total size exceeds the specified limit.
import { S3Client } from '@aws-sdk/client-s3';
import { S3Concat } from 's3-concat';
const s3Client = new S3Client({});
const srcBucketName = process.env.srcBucketName!;
const dstBucketName = process.env.dstBucketName!;
const dstPrefix = 'output';
const main = async () => {
const s3Concat = new S3Concat({
s3Client,
srcBucketName: srcBucketName,
dstBucketName: dstBucketName,
dstPrefix,
concatFileNameCallback: (i) => `concat_${i}.json`,
minSize: '5GiB',
});
await s3Concat.addFiles('tmp/1gb');
await s3Concat.concat();
};
main().then(() => console.log('success'));
In this example, files from the tmp/1gb prefix in the source bucket will be concatenated and split into multiple files if the total size exceeds 5GiB. The concatenated files will be named using the callback function, resulting in names like concat_1.json, concat_2.json, etc.
It is possible to specify the join order using the joinOrder option. Although the presets keyNameDsc and keyNameAsc are supported, you can also customize the join order by providing your own function that conforms to the type JoinOrderCompareFn (e.g., JoinOrderCompareFn<{ key: string; size: number; lastModified: Date }>).
// Descending order by keyName
const s3Concat = new S3Concat({
s3Client,
srcBucketName: srcBucketName,
dstBucketName: dstBucketName,
dstPrefix,
concatFileNameCallback: (i) => `concat_${i}.json`,
+ joinOrder: 'keyNameDsc', // use builtin keyword
});
// Descending order by lastModified
const s3Concat = new S3Concat({
s3Client,
srcBucketName: srcBucketName,
dstBucketName: dstBucketName,
dstPrefix,
concatFileNameCallback: (i) => `concat_${i}.json`,
+ joinOrder: (a, b) => a.lastModified.getTime() - b.lastModified.getTime(),
});
// Descending order by size
const s3Concat = new S3Concat({
s3Client,
srcBucketName: srcBucketName,
dstBucketName: dstBucketName,
dstPrefix,
concatFileNameCallback: (i) => `concat_${i}.json`,
+ joinOrder: (a, b) => b.size - a.size,
});
This project is licensed under the MIT License.
Contributions are welcome! Please open an issue or submit a pull request with any changes or improvements.