-
Notifications
You must be signed in to change notification settings - Fork 1
CSV and JSONL support #371
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Changes from all commits
Commits
Show all changes
2 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -28,6 +28,8 @@ package-lock.json | |
| *.tgz | ||
| .vscode | ||
| *.parquet | ||
| *.csv | ||
| *.jsonl | ||
| /coverage/ | ||
|
|
||
| /lib/ | ||
|
|
||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
File renamed without changes.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,60 @@ | ||
| /** | ||
| * Parse CSV text into nested array of rows and columns. | ||
| */ | ||
| export function parseCsv(text: string): string[][] { | ||
| const rows = [] | ||
| let row = [] | ||
| let field = '' | ||
| let inQuotes = false | ||
| let previousCharWasQuote = false | ||
|
|
||
| for (const char of text) { | ||
|
|
||
| if (inQuotes && char === '"' && !previousCharWasQuote) { | ||
| // first quote, wait to see if it's escaped or end of field | ||
| previousCharWasQuote = true | ||
| } else if (inQuotes && char === '"' && previousCharWasQuote) { | ||
| // csv escaped quote ## | ||
| field += char | ||
| previousCharWasQuote = false | ||
| } else if (inQuotes && !previousCharWasQuote) { | ||
| // append quoted character to field | ||
| field += char | ||
| } else { | ||
| // not in quotes | ||
| inQuotes = false | ||
| previousCharWasQuote = false | ||
| switch (char) { | ||
| case ',': | ||
| // emit column | ||
| row.push(field) | ||
| field = '' | ||
| break | ||
| case '\n': | ||
| // emit row | ||
| row.push(field) | ||
| rows.push(row) | ||
| row = [] | ||
| field = '' | ||
| break | ||
| case '"': | ||
| inQuotes = true | ||
| break | ||
| default: | ||
| field += char | ||
| } | ||
| } | ||
| } | ||
|
|
||
| if (inQuotes && !previousCharWasQuote) { | ||
| console.error('csv unterminated quote') | ||
| } | ||
|
|
||
| // handle last field and row, but skip empty last line | ||
| if (field || row.length) { | ||
| row.push(field) | ||
| rows.push(row) | ||
| } | ||
|
|
||
| return rows | ||
| } | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,6 +1,7 @@ | ||
| export { appendSearchParams, replaceSearchParams } from './routes.js' | ||
| export * from './sources/index.js' | ||
| export { parquetDataFrame } from './tableProvider.js' | ||
| export { parseCsv } from './csv.js' | ||
| export { csvDataFrame, jsonLinesDataFrame, parquetDataFrame, tableProvider } from './tableProvider.js' | ||
| export { asyncBufferFrom, cn, contentTypes, formatFileSize, getFileDate, getFileDateShort, imageTypes, parseFileSize } from './utils.js' | ||
| export { parquetQueryWorker, parquetReadObjectsWorker, parquetReadWorker } from './workers/parquetWorkerClient.js' | ||
| export type { AsyncBufferFrom } from './workers/types.js' |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
| @@ -1,9 +1,38 @@ | ||||||
| import { DataFrame, DataFrameEvents, ResolvedValue, checkSignal, createEventTarget, validateFetchParams, validateGetCellParams, validateGetRowNumberParams } from 'hightable' | ||||||
| import { DataFrame, DataFrameEvents, ResolvedValue, arrayDataFrame, checkSignal, createEventTarget, sortableDataFrame, validateFetchParams, validateGetCellParams, validateGetRowNumberParams } from 'hightable' | ||||||
| import type { ColumnData } from 'hyparquet' | ||||||
| import { FileMetaData, ParquetReadOptions, parquetSchema } from 'hyparquet' | ||||||
| import { FileMetaData, ParquetReadOptions, asyncBufferFromUrl, parquetMetadataAsync, parquetSchema } from 'hyparquet' | ||||||
| import { parseCsv } from './csv.js' | ||||||
| import { parquetReadWorker } from './workers/parquetWorkerClient.js' | ||||||
| import type { AsyncBufferFrom } from './workers/types.d.ts' | ||||||
|
|
||||||
| interface TableProviderOptions { | ||||||
| url: string | ||||||
| fileName: string | ||||||
| requestInit?: RequestInit | ||||||
| } | ||||||
|
|
||||||
| /** | ||||||
| * Create a dataframe from a file URL, automatically detecting the file type. | ||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. maybe
Suggested change
Also (nit), should we factor the call to sortableDataFrame to make it clear that all of them are sortable? |
||||||
| * Supports parquet, CSV, and JSONL files. | ||||||
| */ | ||||||
| export async function tableProvider({ url, fileName, requestInit }: TableProviderOptions): Promise<DataFrame> { | ||||||
| const asyncBuffer = await asyncBufferFromUrl({ url, requestInit }) | ||||||
| const from = { url, byteLength: asyncBuffer.byteLength, requestInit } | ||||||
|
|
||||||
| const baseName = fileName.toLowerCase() | ||||||
| if (baseName.endsWith('.csv')) { | ||||||
| return csvDataFrame(from) | ||||||
| } | ||||||
|
|
||||||
| if (baseName.endsWith('.jsonl')) { | ||||||
| return jsonLinesDataFrame(from) | ||||||
| } | ||||||
|
|
||||||
| // Default to parquet | ||||||
| const metadata = await parquetMetadataAsync(asyncBuffer) | ||||||
| return sortableDataFrame(parquetDataFrame(from, metadata)) | ||||||
| } | ||||||
|
|
||||||
| type GroupStatus = { | ||||||
| kind: 'unfetched' | ||||||
| } | { | ||||||
|
|
@@ -130,3 +159,49 @@ export function parquetDataFrame(from: AsyncBufferFrom, metadata: FileMetaData, | |||||
|
|
||||||
| return unsortableDataFrame | ||||||
| } | ||||||
|
|
||||||
| /** | ||||||
| * Convert a CSV file into a sortable dataframe. | ||||||
| * | ||||||
| * Parses the entire file and creates a sortable dataframe. | ||||||
| * The first row is treated as the header. | ||||||
| */ | ||||||
| export async function csvDataFrame(from: AsyncBufferFrom): Promise<DataFrame> { | ||||||
| let buffer: ArrayBuffer | ||||||
| if ('file' in from) { | ||||||
| buffer = await from.file.arrayBuffer() | ||||||
| } else { | ||||||
| const response = await fetch(from.url, from.requestInit) | ||||||
| buffer = await response.arrayBuffer() | ||||||
| } | ||||||
|
|
||||||
| const text = new TextDecoder().decode(buffer) | ||||||
| const lines = parseCsv(text) | ||||||
| const header = lines[0] ?? [] | ||||||
| const rows = lines.slice(1).map(row => { | ||||||
| return Object.fromEntries(header.map((key, i) => [key, row[i]])) | ||||||
| }) | ||||||
| return sortableDataFrame(arrayDataFrame(rows)) | ||||||
| } | ||||||
|
|
||||||
| /** | ||||||
| * Convert a JSONL file into a sortable dataframe. | ||||||
| * | ||||||
| * Parses each line as a JSON object and creates a sortable dataframe. | ||||||
| */ | ||||||
| export async function jsonLinesDataFrame(from: AsyncBufferFrom): Promise<DataFrame> { | ||||||
| let buffer: ArrayBuffer | ||||||
| if ('file' in from) { | ||||||
| buffer = await from.file.arrayBuffer() | ||||||
| } else { | ||||||
| const response = await fetch(from.url, from.requestInit) | ||||||
| buffer = await response.arrayBuffer() | ||||||
| } | ||||||
|
|
||||||
| const text = new TextDecoder().decode(buffer).trimEnd() | ||||||
| const lines = text.split('\n').filter(line => line.trim()) | ||||||
| const rows: Record<string, unknown>[] = lines.map(line => { | ||||||
| return line ? JSON.parse(line) as Record<string, unknown> : {} | ||||||
| }) | ||||||
| return sortableDataFrame(arrayDataFrame(rows)) | ||||||
| } | ||||||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,63 @@ | ||
| import { describe, expect, it, vi } from 'vitest' | ||
| import { parseCsv } from '../../src/index.js' | ||
|
|
||
| describe('parseCsv', () => { | ||
| it('parses simple CSV', () => { | ||
| const csv = 'Name,Age,Occupation\nAlice,30,Engineer\nBob,25,Designer' | ||
| const expected = [ | ||
| ['Name', 'Age', 'Occupation'], | ||
| ['Alice', '30', 'Engineer'], | ||
| ['Bob', '25', 'Designer'], | ||
| ] | ||
| expect(parseCsv(csv)).toEqual(expected) | ||
| }) | ||
|
|
||
| it('ignores empty last line', () => { | ||
| const csv = 'Name,Age,Occupation\nAlice,30,Engineer\n' | ||
| const expected = [ | ||
| ['Name', 'Age', 'Occupation'], | ||
| ['Alice', '30', 'Engineer'], | ||
| ] | ||
| expect(parseCsv(csv)).toEqual(expected) | ||
| }) | ||
|
|
||
| it('handles quoted fields', () => { | ||
| const csv = 'Name,Age,Occupation\n"Alice, PhD",30,Engineer\nBob,25,"Designer, Senior"' | ||
| const expected = [ | ||
| ['Name', 'Age', 'Occupation'], | ||
| ['Alice, PhD', '30', 'Engineer'], | ||
| ['Bob', '25', 'Designer, Senior'], | ||
| ] | ||
| expect(parseCsv(csv)).toEqual(expected) | ||
| }) | ||
|
|
||
| it('handles escaped quotes', () => { | ||
| const csv = 'Name,Quote\nAlice,"She said, ""Hello world"""\nBob,"This is ""an example"" of quotes"' | ||
| const expected = [ | ||
| ['Name', 'Quote'], | ||
| ['Alice', 'She said, "Hello world"'], | ||
| ['Bob', 'This is "an example" of quotes'], | ||
| ] | ||
| expect(parseCsv(csv)).toEqual(expected) | ||
| }) | ||
|
|
||
| it('handles newlines within quoted fields', () => { | ||
| const csv = 'Name,Address\nAlice,"123 Main St.\nAnytown, USA"' | ||
| const expected = [ | ||
| ['Name', 'Address'], | ||
| ['Alice', '123 Main St.\nAnytown, USA'], | ||
| ] | ||
| expect(parseCsv(csv)).toEqual(expected) | ||
| }) | ||
|
|
||
| it('handles unterminated quotes', () => { | ||
| const csv = 'Name,Quote\nAlice,"This is an unterminated quote\n' | ||
| const expected = [ | ||
| ['Name', 'Quote'], | ||
| ['Alice', 'This is an unterminated quote\n'], | ||
| ] | ||
| vi.spyOn(console, 'error') | ||
| expect(parseCsv(csv)).toEqual(expected) | ||
| expect(console.error).toHaveBeenCalledWith('csv unterminated quote') | ||
| }) | ||
| }) |
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see also https://github.com/severo/cosovo 😉
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm all for it :-) Right now I just wanted a quick fix for csv viewing, but I've hit issues recently with csvs that might benefit from cosovo. The streaming is cool!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice :) Feel free to first try if cosovo can parse them correctly