Skip to content

Tons of garbage on opensnp #559

@chaplin89

Description

@chaplin89

Hey, not sure if you're aware but there's really a lot of garbage there, as OpenSNP is probably not checking what users are uploading.

Here's a normalized list of file types I've found in your db:

  • 7-zip
  • Apple binary property list
  • ASCII text
  • bgzip
  • bzip2
  • Composite-documents
  • CSV
  • data?
  • empty
  • Excel
  • EXE (???)
  • gzip
  • JPEG
  • Word
  • PDF
  • PNG
  • RAR
  • RSID sidtune (?!)
  • Unicode Text
  • VCF
  • Word
  • XML
  • Zip
  • zlib

I was curious about the EXEs, at least they don't seem to contain virus. One of them are from a tool called "MyHeritage Family Builder Genealogy Software" and all the rest are called "23andme to FASTA".
It shouldn't be too hard to clean it and to put some checks after people are uploading something. I did this analysis using the file linux utility, I think it could probably be done on the server side as well? Watch out for command injection in case. A neat improvement would be to have all the files in the same format.

I'm attaching a list of files with their format: file_type.csv

Also the phenotype section doesn't seem very well monitored as someone created a "naked body phenotype" to use it to share a naked picture of himself. Not sure about the scientific value of that lol

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions