fix: prevent memory exhaustion on large files (#25)#36
fix: prevent memory exhaustion on large files (#25)#36vyagh wants to merge 2 commits intoblackboxaicode:mainfrom
Conversation
There was a problem hiding this comment.
Pull request overview
This PR addresses memory exhaustion issues when reading large data files by implementing streaming for CSV and TXT files, and adding a file size limit for JSON files (which cannot be easily streamed).
Key Changes:
- Replaced synchronous file reading with streaming using
readline.createInterface()for CSV and TXT files - Added a 100MB file size limit for JSON files with appropriate error handling
- Extracted CSV line parsing into a reusable
parseCSVLine()method
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // JSON cannot be streamed easily, so enforce a file size limit | ||
| if (stats.size > MAX_JSON_FILE_SIZE_BYTES) { | ||
| const fileSizeMB = (stats.size / (1024 * 1024)).toFixed(2); | ||
| return { | ||
| llmContent: `JSON file is too large (${fileSizeMB} MB). Maximum supported size for JSON files is ${MAX_JSON_FILE_SIZE_MB} MB. For large JSON files, write a Python script using the 'json' module with streaming (ijson) or load in chunks.`, | ||
| returnDisplay: `JSON file too large (${fileSizeMB} MB, max ${MAX_JSON_FILE_SIZE_MB} MB)`, | ||
| error: { | ||
| message: `JSON file size (${fileSizeMB} MB) exceeds ${MAX_JSON_FILE_SIZE_MB} MB limit`, | ||
| type: ToolErrorType.FILE_TOO_LARGE, | ||
| }, | ||
| }; | ||
| } |
There was a problem hiding this comment.
The JSON file size validation logic lacks test coverage. This is a critical new feature that prevents memory exhaustion, and it should be validated with tests to ensure the size limit is enforced correctly and appropriate error messages are returned. Other tools in this directory have comprehensive test files.
There was a problem hiding this comment.
The core implementation has been tested manually. I can add tests in a follow-up PR if the maintainers request it.
| for await (const line of rl) { | ||
| const trimmedLine = line.trim(); | ||
| if (!trimmedLine) continue; | ||
|
|
||
| if (isFirstLine) { | ||
| headers = this.parseCSVLine(trimmedLine); | ||
| isFirstLine = false; | ||
| continue; | ||
| } | ||
|
|
||
| totalRows++; | ||
|
|
||
| // Only store rows up to displayMaxRows for the sample | ||
| if (sampleData.length < displayMaxRows) { | ||
| const values = this.parseCSVLine(trimmedLine); | ||
| const row: Record<string, string> = {}; | ||
| headers.forEach((header, index) => { | ||
| row[header] = values[index] || ''; | ||
| }); | ||
| sampleData.push(row); | ||
| } | ||
| } |
There was a problem hiding this comment.
The streaming implementation continues reading the entire file even after collecting the maximum number of sample rows. This defeats the purpose of preventing memory exhaustion for very large files, as the entire file is still processed line by line. Consider closing the stream early once totalRows reaches a reasonable limit (e.g., after counting enough rows for accurate statistics, or implementing a configurable max count limit).
There was a problem hiding this comment.
Intentional design choice. The tool provides accurate row counts for analysis. Memory usage stays constant since only sample rows are stored in memory.
| for await (const line of rl) { | ||
| totalLines++; | ||
| if (sampleLines.length < maxRows) { | ||
| sampleLines.push(line); | ||
| } | ||
| } |
There was a problem hiding this comment.
The streaming implementation continues reading the entire file even after collecting the maximum number of sample lines. For very large text files (e.g., multi-GB log files), this means the function will still process every line in the file, which can take a significant amount of time. Consider closing the stream early once enough sample lines have been collected, if the total line count is not critical for the use case.
There was a problem hiding this comment.
accurate line counts are useful. Memory stays constant regardless of file size
Fixes #25
Problem
The
read-data-filetool loads entire files into memory before parsing,causing OOM crashes for files larger than ~500MB.
Solution
readline.createInterface()) for CSV and TXT filesChanges
and parseTXTStream() methods, plus JSON size validation