-
Notifications
You must be signed in to change notification settings - Fork 1
Refresh 'Import' documentation #114
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,29 +1,40 @@ | ||
| (cluster-import)= | ||
| # Import | ||
|
|
||
| The first thing you see in the "Import" tab is the history of your | ||
| import jobs. You can see whether you imported from a URL or from a file, | ||
| the source file name and the target table name, and other metadata | ||
| like date and status. | ||
| By navigating to "Show details", you can display details of a particular | ||
| import job. | ||
| You can import data into your CrateDB directly from various sources, including: | ||
| - Local files | ||
| - URLs | ||
| - AWS S3 buckets | ||
| - Azure storage | ||
| - MongoDB database | ||
|
|
||
| Currently the following data formats are supported: | ||
| - CSV | ||
| - JSON (JSON-Lines, JSON Arrays, and JSON Documents) | ||
| - Parquet | ||
| - MongoDB collection | ||
|
|
||
| Clicking the "Import new data" button will bring up the page | ||
| where you can select the source of your data. | ||
| :::{note} | ||
| If you don't have a dataset prepared, we also provide sample data to let | ||
| you discover CrateDB. After importing those examples, feel free to go to | ||
| the tutorial page to learn how to use them. | ||
| ::: | ||
|
|
||
| You can access the history of previous imports in the | ||
| "Import history" tab. | ||
| By navigating to "View detail", you can display details of a particular | ||
| import job (e.g. The number of successful and failed records per file). | ||
|
|
||
| If you don't have a dataset prepared, we also provide an example in the | ||
| URL import section. It's the New York City taxi trip dataset for July | ||
| of 2019 (about 6.3M records). | ||
|  | ||
|
|
||
| (cluster-import-url)= | ||
| ## URL | ||
| (cluster-import-file-import)= | ||
| ## File Import | ||
|
|
||
| To import data, fill out the URL, name of the table which will be | ||
| created and populated with your data, data format, and whether it is | ||
| compressed. | ||
| To import data, select the file format, the source and the name of the table | ||
| which will be created and populated with your data. | ||
|
|
||
| If a table with the chosen name doesn't exist, it will be automatically | ||
| created. | ||
| You can deactivate the "Allow schema evolution" checkbox if you don't want | ||
| the destination table to be automatically created or its schema to be modified. | ||
|
|
||
| The following data formats are supported: | ||
|
|
||
|
|
@@ -33,21 +44,21 @@ The following data formats are supported: | |
|
|
||
| Gzip compressed files are also supported. | ||
|
|
||
|  | ||
|  | ||
|
|
||
| (cluster-import-s3)= | ||
| ## S3 bucket | ||
| (cluster-import-file-import-s3)= | ||
| ### AWS S3 bucket | ||
|
|
||
| CrateDB Cloud allows convenient imports directly from S3-compatible | ||
| storage. To import a file form bucket, provide the name of your bucket, | ||
| storage. To import a file from a bucket, provide the name of your bucket, | ||
| and path to the file. The S3 Access Key ID, and S3 Secret Access Key are | ||
| also needed. You can also specify the endpoint for non-AWS S3 buckets. | ||
| Keep in mind that you may be charged for egress traffic, depending on | ||
| your provider. There is also a volume limit of 10 GiB per file for S3 | ||
| imports. The usual file formats are supported - CSV (all variants), JSON | ||
| (JSON-Lines, JSON Arrays and JSON Documents), and Parquet. | ||
| imports. | ||
|
|
||
|  | ||
| Importing multiple files is also supported by using wildcard | ||
| notation: `/folder/*.parquet`. | ||
|
|
||
| :::{note} | ||
| It is important to make sure that you have the right permissions to | ||
|
|
@@ -72,8 +83,8 @@ have a policy that allows GetObject access, for example: | |
| ``` | ||
| ::: | ||
|
|
||
| (cluster-import-azure)= | ||
| ## Azure Blob Storage | ||
| (cluster-import-file-import-azure)= | ||
| ### Azure Blob Storage | ||
|
|
||
| Importing data from private Azure Blob Storage containers is possible | ||
| using a stored secret, which includes a secret name and either an Azure | ||
|
|
@@ -83,60 +94,16 @@ the organization level can add this secret. | |
| You can specify a secret, a container, a table and a path in the form | ||
| `/folder/my_file.parquet`. | ||
|
|
||
| As with other imports Parquet, CSV, and JSON files are supported. File | ||
| size limitation for imports is 10 GiB per file. | ||
|
|
||
|  | ||
|
|
||
| (cluster-import-globbing)= | ||
| ## Globbing | ||
| Importing multiple files is also supported by using wildcard | ||
| notation: `/folder/*.parquet`. | ||
|
|
||
| Importing multiple files, also known as import globbing is supported in | ||
| any s3-compatible blob storage. The steps are the same as if importing | ||
| from S3, i.e. bucket name, path to the file and S3 ID/Secret. | ||
|
|
||
| Importing multiple files from Azure Container/Blob Storage is also | ||
| supported: `/folder/*.parquet` | ||
|
|
||
| Files to be imported are specified by using the well-known | ||
| [wildcard](https://en.wikipedia.org/wiki/Wildcard_character) notation, | ||
| also known as "globbing". In computer programming, | ||
| [glob](https://en.wikipedia.org/wiki/Glob_(programming)) patterns | ||
| specify sets of filenames with wildcard characters. The following | ||
| example would import all the files from the single specified day. | ||
|
|
||
| :::{code} console | ||
| /somepath/AWSLogs/123456678899/CloudTrail/us-east-1/2023/11/12/*.json.gz | ||
| ::: | ||
| File size limitation for imports is 10 GiB per file. | ||
|
|
||
|  | ||
| (cluster-import-integration)= | ||
| ## Integration | ||
|
|
||
| As with other imports, the supported file types are CSV, JSON, and | ||
| Parquet. | ||
|
|
||
| (cluster-import-file)= | ||
| ## File | ||
|
|
||
| Uploading directly from your computer offers more control over your | ||
| data. From the security point of view, you don't have to share the data | ||
| on the internet just to be able to import it to your cluster. You also | ||
| have more control over who has access to your data. Your files are | ||
| temporarily uploaded to a secure location managed by Crate (an S3 bucket | ||
| in AWS) which is not publicly accessible. The files are automatically | ||
| deleted after 3 days. You may re-import the same file into multiple | ||
| tables without having to re-upload it within those 3 days. Up to 5 files | ||
| may be uploaded at the same time, with the oldest ones being | ||
| automatically deleted if you upload more. | ||
|
|
||
|  | ||
|
|
||
| As with other import, the supported file formats are: | ||
|
|
||
| - CSV (all variants) | ||
| - JSON (JSON-Lines, JSON Arrays and JSON Documents) | ||
| - Parquet | ||
| {ref}`More info about data integration. <cluster-integrations>` | ||
|
Comment on lines
+102
to
+105
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 🧩 Analysis chain🏁 Script executed: # First, check if the file exists and get its size
wc -l docs/cluster/import.mdRepository: crate/cloud-docs Length of output: 86 🏁 Script executed: # Read the entire file to understand structure and content
cat -n docs/cluster/import.mdRepository: crate/cloud-docs Length of output: 5804 🏁 Script executed: # Search for MongoDB references in the file
rg -n "MongoDB|mongodb" docs/cluster/import.mdRepository: crate/cloud-docs Length of output: 104 🏁 Script executed: # Search for S3 and Azure sections to compare with MongoDB
rg -n "S3|Azure|azure" docs/cluster/import.mdRepository: crate/cloud-docs Length of output: 769 🏁 Script executed: # Search for MongoDB in related documentation
fd -e md -e rst --search-path docs/cluster | head -20Repository: crate/cloud-docs Length of output: 277 🏁 Script executed: # Search for cluster-integrations reference target
rg -n "cluster-integrations" docs/cluster/import.md docs/cluster/integrations/Repository: crate/cloud-docs Length of output: 211 🏁 Script executed: # Check the integrations index file
cat -n docs/cluster/integrations/index.mdRepository: crate/cloud-docs Length of output: 2214 🏁 Script executed: # Check the mongo-cdc file to see if it documents MongoDB import
head -50 docs/cluster/integrations/mongo-cdc.mdRepository: crate/cloud-docs Length of output: 1864 Integration section is minimalist and should provide more context. The Integration section (lines 102-105) contains only a single cross-reference. While the reference points to the appropriate integrations documentation, the section itself reads as a placeholder and lacks any explanatory context about what integrations are or how they differ from File Import. Additionally, there is a terminology confusion: the intro mentions "MongoDB database" and line 15 lists "MongoDB collection" as a supported format, but no guidance exists in the File Import section for importing from MongoDB as a data source (unlike S3 and Azure, which have dedicated subsections). The "MongoDB collection" format reference relates to data format support in other imports, not MongoDB-as-source capability. MongoDB import/sync guidance exists only in the separate Integrations section (MongoDB CDC), which describes continuous real-time synchronization rather than one-time imports. Consider either:
🤖 Prompt for AI Agents |
||
|
|
||
| There is also a limit to file size, currently 1GB. | ||
|
|
||
| (overview-cluster-import-schema-evolution)= | ||
| ## Schema evolution | ||
|
|
@@ -145,7 +112,7 @@ Schema Evolution, available for all import types, enables automatic | |
| addition of new columns to existing tables during data import, | ||
| eliminating the need to pre-define table schemas. This feature is | ||
| applicable to both pre-existing tables and those created during the | ||
| import process. It can be toggled via the 'Schema Evolution' checkbox | ||
| import process. It can be toggled via the 'Allow schema evolution' checkbox | ||
| on the import page. | ||
|
|
||
| Note that Schema Evolution is limited to adding new columns; it does not | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
General style question: This page has two very different styles of adressing the user/capabilities:
You can importvsCrateDB Cloud allowsfurther down.Do we have a general recommended style/guidelines?
cc @bmunkholm
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should use active voice across the board, as we are doing it on the canonical repositories. Do you have any objections?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me program CodeRabbit on this repository like we also did it on cratedb-guide.
@coderabbitai: When reviewing documentation, use active voice instead of first-person or passive voice, and emit corresponding suggestions. About tone and style, please reduce marketing phrasing; keep concise and use a neutral guide tone.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@amotl, I've configured the style guidelines for documentation reviews in this repository. I'll now review documentation with these principles in mind, suggesting active voice over passive or first-person constructions, and promoting concise, neutral technical guidance over marketing-style phrasing.
✏️ Learnings added
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, we are also using those even more specific learnings for the
cloud-docsrepository:This comment was marked as spam.
Sorry, something went wrong.
Uh oh!
There was an error while loading. Please reload this page.