Skip to content

CyrusDioun/docker-setup

Repository files navigation

Data Science Docker Container

An all-purpose Docker container for data science, machine learning, and NLP work with large datasets.

Features

  • Python 3.11 with optimized libraries for large datasets
  • ML/DL Frameworks: TensorFlow, PyTorch, scikit-learn
  • NLP Tools: spaCy, NLTK, fuzzywuzzy
  • Data Processing: pandas, numpy, with performance optimizations
  • Jupyter Lab for interactive development
  • Claude Code CLI for AI-assisted coding
  • Memory optimized for datasets up to 7GB
  • Stata support (requires license)

Quick Start

  1. Build the container:

    ./run.sh build
  2. Start Jupyter Lab:

    ./run.sh jupyter

    Then open http://localhost:8888 in your browser.

  3. Run Python scripts:

    ./run.sh run my_script.py
  4. Interactive Python/Bash:

    ./run.sh python  # Python shell
    ./run.sh bash    # Bash shell
    ./run.sh claude  # Claude Code CLI

Directory Structure

  • data/ - Mount point for your datasets
  • notebooks/ - Jupyter notebooks
  • code/ - Python scripts

Memory Configuration

The container is configured with:

  • 16GB memory limit (adjustable in docker-compose.yml)
  • 32GB swap limit
  • 2GB shared memory

Adjust these in docker-compose.yml based on your system.

Adding Datasets

Edit docker-compose.yml to mount your existing data directories:

volumes:
  - ~/path/to/your/datasets:/workspace/external_data:ro

Installing Additional Packages

./run.sh install package_name

Or add to Dockerfile and rebuild for permanent inclusion.

Performance Tips

  1. For large datasets, use chunked reading:

    for chunk in pd.read_csv('large_file.csv', chunksize=10000):
        process(chunk)
  2. Monitor memory usage:

    import psutil
    print(f"Memory usage: {psutil.virtual_memory().percent}%")
  3. Use appropriate data types to reduce memory:

    df = pd.read_csv('file.csv', dtype={'id': 'int32', 'category': 'category'})

Stata Integration

To add Stata support:

  1. Place your Stata installation files in this directory
  2. Uncomment the Stata installation lines in Dockerfile
  3. Rebuild the image

Claude Code Setup

  1. Copy the environment template:

    cp .env.example .env
  2. Add your Anthropic API key to .env:

    ANTHROPIC_API_KEY=your_api_key_here
  3. Start Claude Code:

    ./run.sh claude

Note: Get your API key from https://console.anthropic.com/

GPU Support

For GPU support, uncomment the nvidia runtime lines in docker-compose.yml and ensure nvidia-docker is installed.

About

Script for creating docker environment for data processing and analysis.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published