Make sure that you don't have different case email duplicates in src/cncf-config/email-map: cd src, ./lower_unique.sh cncf-config/email-map.
- If you generated new email-map using
./import_affs.sh, then:mv email-map cncf-config/email-map. - To generate
git.logfile and make sure it includes all repos used bydevstats. Use the final command line it generates. Make ituniq:
- On DevStats test master:
helm install devstats-test-debug ./devstats-helm --set skipSecrets=1,skipPVs=1,skipBackupsPV=1,skipBackups=1,skipProvisions=1,skipCrons=1,skipAffiliations=1,skipGrafanas=1,skipServices=1,skipIngress=1,skipStatic=1,skipNamespaces=1,skipPostgres=1,projectsOverride='+cncf\,+opencontainers\,+istio\,+knative\,+zephyr\,+linux\,+rkt\,+sam\,+azf\,+riff\,+fn\,+openwhisk\,+openfaas\,+cii',bootstrapPodName=debug,bootstrapCommand=sleep,bootstrapCommandArgs={360000s}. ../devstats-k8s-lf/util/pod_shell.sh debug.GHA2DB_EXTERNAL_INFO=1 GHA2DB_PROCESS_REPOS=1 GHA2DB_LOCAL=1 get_repos.helm delete devstats-test-debug.kubectl delete pod debug.
- To get LF repos use:
AWS_PROFILE=... KUBECONFIG=... helm2 install --name devstats-debug ./devstats-helm --set skipSecrets=1,skipPVs=1,skipProvisions=1,skipCrons=1,skipAffiliations=1,skipGrafanas=1,skipServices=1,skipNamespace=1,bootstrapPodName=debug,bootstrapCommand=sleep,bootstrapCommandArgs={36000s}.AWS_PROFILE=... KUBECONFIG=... ../devstats-k8s-lf/util/pod_shell.sh debug.ONLY='iovisor mininet opennetworkinglab opensecuritycontroller openswitch p4lang openbmp tungstenfabric cord' GHA2DB_PROPAGATE_ONLY_VAR=1 GHA2DB_EXTERNAL_INFO=1 GHA2DB_PROCESS_REPOS=1 GHA2DB_LOCAL=1 get_repos.AWS_PROFILE=... KUBECONFIG=... helm2 delete --purge devstats-debug.AWS_PROFILE=... KUBECONFIG=... kubectl delete po debug.
- Update
repos.txtto contain all repositories returned by the above commands. Updateall_repos.shto include data from CNCF, CDF, LF and GraphQL. Run./all_repos.sh. - To run
cncf/gitdmon a generatedgit.logfile run:cd src/; cp all_affs.csv all_affs.old; ~/dev/alt/gitdm/src/cncfdm.py -i git.log -r "^vendor/|/vendor/|^Godeps/" -R -n -b ./ -t -z -d -D -A -U -u -o all.txt -x all.csv -a all_affs.csv > all.out. New approach is./mtpbut it don't have a way (yet) to deal with the same emails mapped into different user names from different per-thread buckets. - Run:
./enchance_all_affs.sh. - If updating via
ghusers.shorghusers_cached.sh(step 8) - rungenerate_actors.shtoo:
- LF actors:
AWS_PROFILE=... KUBECONFIG=... ./generate_actors_lf.sh. - CNCF, CDF and GraphQL actors:
KUBECONFIG=... ./generate_actors_nonlf.sh. - Concat:
./generate_actors_all.sh,./generate_actors_cncf.sh.
- Consider
./ghusers_cached.shor./ghusers.sh(if you run this, then copy result json somewhere and get 0-committers from previous version to save GH API points). Sometimes you should just run./ghusers.shwithout cache. - Recommended:
./ghusers_partially_cached.sh 2> errors.txtwill refetch repos metadata and commits since last fetched and get users data fromgithub_users.jsonso you can save a lot of API points. You can prepend withNCPUS=Nto override autodetecting number of CPU cores available. - To copy source type from previous JSON version do
./copy_source.sh,./compare_sources.sh. - Run
./company_names_mapping.shto fix typical company names spell errors, lower/upper case etc. Updatecompany-names-mappingbefore running this (with a new typos/correlations data from the last 3 steps). - To update (enhance)
github_users.jsonwith new affiliations[SHUFFLE=1] ./enhance_json.sh. If you runghusersyou may need to updateskip_github_logins.txtwith new broken GitHub logins found. This is optional if you already have an enhanced json. You can prepend withNCPUS=Nto override autodetecting number of CPU cores available. - To merge with previous JSON use:
./merge_jsons.sh. - To merge multiple GitHub logins data (for example propagate known affiliation to unknown or not found on the same GitHub login) run:
./merge_github_logins.sh. - Because this can find new affiliations you can now use
./import_from_github_users.shto import back fromgithub_users.jsonand then./lower_unique.sh cncf-config/email-mapand restart from step 5. This usescompany-names-mappingfile to import from GitHubcompanyfield. - Run
./correlations.shand examine its outputcorrelations.txtto try to normalize company names and remove common suffixes like Ltd., Corp. and downcase/upcase differences. - Run
./check_spellfor fuzziness/spell check errors finder (uses Levenshtein distance to find bugs). - Run
./lookup_json.shand examine its output JSONs - those GitHub profiles have some useful data directly available - this will save you some manual research work. - ALWAYS before any commit to GitHub run:
./handle_forbidden_data.shto remove any forbiden affiliations, please also seeFORBIDDEN_DATA.md. - You can use
./clear_affiliations_in_json.shto clear all affiliations on a generatedgithub_users.json. - To make json unique, call
./unique_json.rb github_users.json. To sort JSON by commits, login, email use:./sort_json.rb github_users.json. - You should run genderize/geousers/localize/agify (if needed) before the next step.
- To generate human readable text affiliation files: first run:
./gen_aff_files.sh. - You can create smaller final json for
cncf/devstatsusing./delete_json_fields.sh github_users.json; ./check_source.rb github_users.json; ./strip_json.sh github_users.json stripped.json; ONLY_AFF=1 ./strip_json.sh github_users.json affiliated.json; cp affiliated.json ~/dev/go/src/github.com/cncf/devstats/github_users.json. - To generate final
unknowns.csvmanual research task file run:./gen_aff_task.rb unknowns.txt. You can also generate all actors./gen_aff_task.rb alldevs.txt. You can prepend withONLY_GH=1to skip entries without GitHub. You can prepend withONLY_EMP=1to skip entries with any affiliation already set. You can filter only specific entries, for example:./filter_task.rb unknowns.txt unknown_with_linkedin.json unknowns_with_linkedin.txt. - To manually edit all affiliations related files: edit
cncf-config/email-map all.txt all.csv all_affs.csv github_users.json stripped.json affiliated.json ../developers_affiliations.txt ../company_developers.txt affiliations.csv - To add all possible entries from
github_users.jsontocncf-config/email-mapuse :github_users_to_map.sh. This is optional. - Finally copy
github_users.jsontogithub_users.old. You can check if JSON fileds are correct via./check_json_fields.sh github_users.json,./check_json_fields.sh stripped.json small,./check_json_fields.sh affiliated.json small. - If any file displays error with 'Invalid UTF-8' encoding, scrub it using Ruby tool:
./scrub.rb filename. - To add user with 'xyz' GitHub id, use:
PG_PASS=... ./gh.rb xyz- this will generate JSON entry that can be added togithub_users.jsonafter tweakingemail,source,affiliationand possible some more fields. - To generate unknown CII committers create devstats-reports pod (see
cncf/devstats-helm:test/README.md, search forCreate reports pod), then run inside reports pod:PG_DB=cii ./affs/unknown_committers.sh, or./affs/all_tasks.sh. - Get result CSV:
wget https://teststats.cncf.io/backups/argo_unknown_contributors.csv. - Obsolete way to get unknown committers on the local database:
PG_PASS=... ./sh/unknown_committers.sh. - Use
[KEYW=1] [FREQ=10000] [API_KEY=...] [SKIP_GDPR=1] PG_PASS=... ./unknown_committers.rb cii_unknown_committers.csvto generatetask.csvfile to research CII committers. After this step you can also use./top_to_task.rbto generatetop_task.csv(this converts Top N CSV output into the task.csv file, optional). - Use
./csv_merge.rb commits task.csv *_task.csvto merge tasks generated for different projects, to create a file containing all those projects data sorted by contributions/commits desc. - Use
[SHUFFLE=1] ./ensure_emails.rb github_users.jsonto ensure that most up-to-date GitHub users emails are present (this will query all GitHub logins so can take even a day to finish on 300k+ JSON). - Use
OUT=fn.csv ./merge_affs_csvs.rb csvfile1.csv csvfile2.csv ...to merge multiple CSVs to import. - Use
[SKIP_JSON=1] ./affs_analysis.rb filename.csvto analyse committers/commits affiliated/independent/unknown stats.
./all_repos_log.sh /root/devstats_repos/Azure/* /root/devstats_repos/BuoyantIO/* /root/devstats_repos/GoogleCloudPlatform/* /root/devstats_repos/OpenBMP/* /root/devstats_repos/OpenObservability/* /root/devstats_repos/RichiH/* /root/devstats_repos/Virtual-Kubelet/* /root/devstats_repos/alibaba/* /root/devstats_repos/apcera/* /root/devstats_repos/appc/* /root/devstats_repos/brigadecore/* /root/devstats_repos/buildpack/* /root/devstats_repos/cdfoundation/* /root/devstats_repos/cloudevents/* /root/devstats_repos/cncf/* /root/devstats_repos/containerd/* /root/devstats_repos/containernetworking/* /root/devstats_repos/coredns/* /root/devstats_repos/coreos/* /root/devstats_repos/cortexproject/* /root/devstats_repos/cri-o/* /root/devstats_repos/crosscloudci/* /root/devstats_repos/datawire/* /root/devstats_repos/docker/* /root/devstats_repos/dragonflyoss/* /root/devstats_repos/draios/* /root/devstats_repos/envoyproxy/* /root/devstats_repos/etcd-io/* /root/devstats_repos/facebook/* /root/devstats_repos/falcosecurity/* /root/devstats_repos/fluent/* /root/devstats_repos/goharbor/* /root/devstats_repos/graphql/* /root/devstats_repos/grpc/* /root/devstats_repos/helm/* /root/devstats_repos/iovisor/* /root/devstats_repos/istio/* /root/devstats_repos/jaegertracing/* /root/devstats_repos/jenkins-x/* /root/devstats_repos/jenkinsci/* /root/devstats_repos/knative/* /root/devstats_repos/kubeedge/* /root/devstats_repos/kubernetes-client/* /root/devstats_repos/kubernetes-csi/* /root/devstats_repos/kubernetes-graveyard/* /root/devstats_repos/kubernetes-helm/* /root/devstats_repos/kubernetes-incubator-retired/* /root/devstats_repos/kubernetes-incubator/* /root/devstats_repos/kubernetes-retired/* /root/devstats_repos/kubernetes-security/* /root/devstats_repos/kubernetes-sig-testing/* /root/devstats_repos/kubernetes-sigs/* /root/devstats_repos/kubernetes/* /root/devstats_repos/ligato/* /root/devstats_repos/linkerd/* /root/devstats_repos/lyft/* /root/devstats_repos/miekg/* /root/devstats_repos/mininet/* /root/devstats_repos/nats-io/* /root/devstats_repos/networkservicemesh/* /root/devstats_repos/open-policy-agent/* /root/devstats_repos/open-switch/* /root/devstats_repos/open-telemetry/* /root/devstats_repos/opencontainers/* /root/devstats_repos/opencord/* /root/devstats_repos/openebs/* /root/devstats_repos/openeventing/* /root/devstats_repos/opennetworkinglab/* /root/devstats_repos/opensecuritycontroller/* /root/devstats_repos/opentracing/* /root/devstats_repos/p4lang/* /root/devstats_repos/pingcap/* /root/devstats_repos/prometheus/* /root/devstats_repos/rkt/* /root/devstats_repos/rktproject/* /root/devstats_repos/rook/* /root/devstats_repos/spiffe/* /root/devstats_repos/spinnaker/* /root/devstats_repos/tektoncd/* /root/devstats_repos/telepresenceio/* /root/devstats_repos/theupdateframework/* /root/devstats_repos/tikv/* /root/devstats_repos/torvalds/* /root/devstats_repos/tungstenfabric/* /root/devstats_repos/uber/* /root/devstats_repos/virtual-kubelet/* /root/devstats_repos/vitessio/* /root/devstats_repos/vmware/* /root/devstats_repos/weaveworks/* /root/devstats_repos/youtube/* /root/devstats_repos/zephyrproject-rtos/*.
- Open CNCF projects maintainers list
- Save "Name", "Company", "GitHub name" columns to a new sheet and download it as "maintainers.csv".
- Add "name,company,login" CSV header.
- Example file
- Run
[DBG=1] [ONLYNEW=1] ./maintainers.shscript. Follow its instructions. - Run
[DBG=1] ./check_maintainers.sh. Follow its instructions.
Please follow the instructions from ADD_PROJECT.md.
To add geo data (country_id, tz) and gender data (sex, sex_prob), do the following:
- Download
allCountries.zipfile from geonames server. - Create
geonamesdatabase via:sudo -u postgres createdb geonames,sudo -u postgres psql geonames -f geonames.sql. Table details ingeonames.info - Unzip
allCountries.zipand runPG_PASS=... ./geodata.sh allCountries.tsv- this will populate the DB. - Create indices on columns to speedup localization:
sudo -u postgres psql geonames -f geonames_idx.sql. - Make sure that you don't have any
nil,nullandfalsevalues saved in any*_cache.jsonfile (those files are also saved when youCTRL^Crunning enchancement). - Regexp to search is
/ \(null\|nil\|false\)\(\n\|,\), butagify_cache.jsonandgenderize_cache.jsoncan havenullso search only forfalseandnil:/ \(nil\|false\)\(\n\|,\). - If this is a first geousers run create
geousers_cache.jsonviacp empty.json geousers_cache.json. - To use cache it is best to have
stripped.jsonfrom the previous run. See step 24. - Enchance
github_users.jsonviaSHUFFLE=1 PG_PASS=... ./geousers.sh github_users.json stripped.json geousers_cache.json 20000. It will addcountry_idandtzfields. - Go to store.genderize.io and get you
API_KEY, basic subscription ($9) allows 100,000 monthly gender lookups. - If this is a first genderize run create
genderize_cache.jsonviacp empty.json genderize_cache.json. - Enchance
github_users.jsonviaSHUFFLE=1 PG_PASS=... API_KEY=... ./nationalize.sh github_users.json stripped.json nationalize_cache.json 20000. It will eventually fill missingcountry_idandtzfields. - Enchance
github_users.jsonviaSHUFFLE=1 API_KEY=... ./genderize.sh github_users.json stripped.json genderize_cache.json 20000. It will addsexandsex_probfields. - Enchance
github_users.jsonviaSHUFFLE=1 API_KEY=... ./agify.sh github_users.json stripped.json agify_cache.json 20000. It will addagefield. - You can skip
API_KEY=...but only 1000 gender lookups/day are allowed then. - Copy enhanced json to devstats:
ONLY_AFF=1 ./strip_json.sh github_users.json affiliated.json; cp affiliated.json ~/dev/go/src/github.com/cncf/devstats/github_users.json. - Import new json on devstats using
./import_affstool.
- To import manual affiliations from a google sheet save this sheet as
affiliations.csvand then use./affiliations.shscript. - Prepend with
UPDATE=1to only import those marked as changed: columnchanges='x'. - Prepend with
RECHECK=1to always ask for operation and allow updating found -> not found. - Prepend with
DBG=1to enable verbose output. - After finishing import add a status line to
affiliations_import.txtfile and update the online spreadsheet. - Update
company-names-mappingif needed and then run./company_names_mapping.sh. - Run:
./sort_config.shand./lower_unique.sh cncf-config/email-map. - Run:
./enchance_all_affs.sh, then follow its suggestions about search and check, then (remove csv header):cat new_affs.csv >> all_affs.csv,./lower_unique.sh all_affs.csv. - Finall:
cp all_affs.csv all_affs.old. - After importing new data run
./src/burndown.sh 2018-08-22(from the src's parent directory). Do this after processing all data mentioned here, not after just importing new CSV. - Import generated
csv/burndown.csvdata intohttps://docs.google.com/spreadsheets/d/1RxEbZNefBKkgo3sJ2UQz0OCA91LDOopacQjfFBRRqhQ/edit?usp=sharing. - To calculate CNCF/LF ratio use number of CNCF found from last commit - number of CNCF found from some previous commit diveded by the same ratio for all actors.
- You can do a company acquisition on a specific dae via something like:
ruby acquire_company.rb 'Old Company' 'YYYY-MM-DD' 'New Company'
For complex merges that modify developers_affiliationsN.txt file(s) do the following:
- Copy PR modifications and save them in
pr_data.txt. - Run
pr_data_to_csv.sh; cat new_affs.csv >> all_affs.csv; ./sort_configs.sh. - Run
PG_PASS=... ./unknown_committers.rb pr_unknowns.csv. - Run
mv pr_data.csv affiliations.csv; ./affiliations.sh.
Alternative way using diff (for simple PRs that only add new users):
- Merge a PR from GitHub UI, then
git pull. git diff HEAD^ ../*.txt > input.diff.PG_PASS=... ./update_from_pr_diff.rb ./input.diff github_users.json cncf-config/email-map../post_manual_checks.sh && ./post_manual_updates.sh.