Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor import and auto-commit Base images #19

Closed
josecelano opened this issue Jan 4, 2022 · 9 comments · Fixed by #23
Closed

Refactor import and auto-commit Base images #19

josecelano opened this issue Jan 4, 2022 · 9 comments · Fixed by #23

Comments

@josecelano
Copy link
Member

josecelano commented Jan 4, 2022

The workflow import-base-images.yml allows us to import new images from the library into this repo.

We have to change the workflow to start using the Librarian.

There are two big questions:

  1. What should be moved to the Librarian and what should be part only of this website repo.
  2. We detected some problems with the current solution. Upgrading the library (submodule) could imply more than one secondary task. Right now we are only processing changes related to Base images, but the library update could imply other changes in the site that might require new workflows. Besides, working directly with the main branch leads to inconsistencies.

And some known problems:

There was also at least one accepted refactor. We have to use a PR instead of pushing directly to the main branch after every auto-commit. For example, if we create a commit adding a base image and we run again the workflow the commit is going to be created again because the library has not been updated yet in the main branch. In order to do that we have to split the workflow in two: one workflow is going to receive the event (the library has been updated) and it will create the new PR (I'm not sure if a PR created by the GitHub API triggers the workflows). The entry workflow would be called something like create-pr-for-library-update.yml. That workflow is going to create the new PR that will trigger the update-library.yml workflow.

Proposal 1: nested pull requests

On solution we proposed was:

We create a new workflow called: update-library.yml. This workflow will update the library submodule and will trigger a new workflow (import-base-images.yml) in a nested branch. That nested PR will import the Base image into this repo. The nested PR have to be merged manually into the parent PR.

Pros:

  • We can easily add new tasks that we want to trigger when the library is updated.

Cons:

  • We have to merge two PRs manually. Merge them automatically will require some kind of notification to the parent workflow.
  • We have to implement a step to automatically generate the child PR from the parent one. That is going to couple us more to GitHub.

Proposal 2: keep only one workflow for main and secondary tasks

We can try to keep it simple only with one workflow: update-library.yml. Secondary tasks could be steps in this workflow. For example, we can have a step to import the Base images. If we need to add more action we can add more steps.

name: Import Base images from remote media library

on:
  push:
    branches: [ issue-library-update* ]

jobs:
  import-base-images:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v2
        with:  
          submodules: 'true'

      - name: Setup DVC
        uses: iterative/setup-dvc@v1

      - name: Update library submodule and auto-commit library update
        id: update-library

        # In the future we can add a filter to pull only Base images
        # This action could return a json list of imported images
      - name: Import changed dvc files
        run: nautilus-librarian dvc pull --from-diff
        env:
          nl_dvc_diff: ${{ steps.update-library.outputs.dvc_diff }}
          AZURE_STORAGE_ACCOUNT: ${{ secrets.AZURE_STORAGE_ACCOUNT }}
          AZURE_STORAGE_SAS_TOKEN: ${{ secrets.AZURE_STORAGE_SAS_TOKEN }}

        # This action is a specific action only used in this repo.
        # Filter Base images from previous output.
      - name: Resize Base images into a different tmp dir
        id: resize-image
        uses: ./.github/actions/resize-image
        with:
          dvc_diff: ${{ steps.update-library.outputs.dvc_diff }}
          dest_dir: /tmp/whatever/resized
          width: 512
          height: 512

      - name: Change file format of synched images (and copy to final dir)
        id: change-image-file-format
        uses: ./.github/actions/change-image-file-format
        with:
          dest_dir: public/images
          format: jpg

      - name: Auto-commit files
        id: auto-commit
        run: nautilus-librarian git auto-commit --file-list {...} --message"new large drawing {filename}"

      - name: Build website if Base images have been added
        if: ${{ steps.auto-commit.outputs.changes_detected == 'true' }}
        run: |
          npm install
          npm run build

      - name: Deploy website if Base images have been added
        if: ${{ steps.auto-commit.outputs.changes_detected == 'true' }}
        uses: JamesIves/[email protected]
        with:
          branch: gh-pages
          folder: public

Specific actions for this repo (resize-image and change-image-file-format) could be merged into a single embedded action (process-base-images).

NOTES:

  • Notice that we auto-commit the library update at the beginning. That should not be a problem we use a PR. In the current solution, we are pushing directly to the main branch and that can create inconsistencies.

I would start with proposal 2 and move to proposal 1 if needed.

@da2ce7 @yeraydavidrodriguez could you add your thoughts?

@josecelano
Copy link
Member Author

josecelano commented Jan 4, 2022

I have created a proof of concept in an independent repo: auto-update-submodule-poc

This is the workflow: https://github.com/josecelano/auto-update-submodule-poc/blob/main/.github/workflows/library-update.yml

And this is the PR created automatically: josecelano/auto-update-submodule-poc#1

It seems the PR could be even auto merged: https://github.com/marketplace/actions/create-pull-request#auto-merge

This solution has at least two advantages:

  • We do not have to write the code to create the auto-PRs (and maybe even merge them).
  • We only need one workflow.

@josecelano
Copy link
Member Author

For the auto-merge maybe we could create another scheduled workflow (executed hourly) that gets a list of automatic PR and tries to find out if the PR workflow has finished. If the workflow for the PR has finished successfully the PR can be merged automatically. We need a way to know if a workflow for a PR has finished successfully. We have also described that need here.

@josecelano
Copy link
Member Author

hi @da2ce7 @yeraydavidrodriguez I'm trying different approaches in this repo, but I can't find an easy solution.

NOTES:

  • I'm only using text files (no DVC, no signed commits for the time being). I'm just trying to solve concurrency issues.
  • I can create automatic PR but the PR workflows are not executed for those PRs.
  • You can't even auto-merge the PR if it was created automatically because it has not passed any check.
  • Implementing a kind of message queue with git seems to be too complex (instead of using a message queue service provider. There is at least one attempt. I mean not too complex to implement but to use. I would like to find a simpler solution. I do not even like the idea of mixing app data with git (although images are also app data).
  • I have found this service to manage auto-merges.
  • GitHub provides a mechanism to control workflow and job concurrency. You can force running only one workflow at the same time. That could be the solution if we do all the things we need to do in the same workflow without PR. This is the sample workflow. That could lead to very big workflows if we accumulate library changes or we need to do too much stuff for a library update. I can try to apply that solution to this repo and review again the concurrency issues and our requirements.

@josecelano
Copy link
Member Author

We have a discussion today. The starting point for the discussion was this document:

update-git-submodule.pdf

with 2 proposals and notes.

We agreed on trying a simpler solution:

  • We will not use the GitHub lock mechanism.
  • We will force linear history for the main branch.
  • There is going to be only one workflow update-library.yml.
  • We can run only one workflow at a time. The way to achieve this is at the beginning of the workflow it tries to claim a lock. That means we push an empty comment with a lock message. If we do not get any merge conflicts we continue processing the library update. If we get a conflict we retry later.
  • We also have to check if other workflow has already the lock, if there is already a lock we d not even try to create a new one.
  • At the end of the workflow, we have to add another empty commit releasing the lock.

If the git history you will see something like:

| * e230184 - refactor: [#31] ... <Jose Celano>
| * 839456b - refactor: [#31] ... (2021-12-22 08:03:32 +0000) <Jose Celano>
| * 8c59199 - refactor: [#31] ... <Jose Celano>
| * acf4e01 - lock: release update-library <Bot>
| * b157bb0 - lock: claim update-library <Bot>
| * 1eea4c7 - fix: [#31] ... <Jose Celano>

The commit message can contain a json with the job description.

The workflow could be something like this:

name: Update library

n:
  schedule:
    - cron:  '0 * * * *'

jobs:
  build:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v2
        with:  
          submodules: 'true'

        # for all commits find the latest "lock" commit
        #    if it's a release
        #         we create a new "lock" commit and push it to the main
        #    else
        #         exit workflow
      - name: Claim lock
        # after this point we have a new commit in the main branch

      - name: Update library submodule and auto-commit library update
        id: update-library

      - name: Import changed dvc files

      - name: Resize Base images into a different tmp dir

      - name: Change file format of synched images (and copy to final dir)

      - name: Auto-commit files

        # create a commit releasing the lock
      - name: Release lock

Pros:

  • We don't need external services

Cons:

  • Other PRs have also to wait to be merged when this workflow it's being executed. For example, if you are applying a change to the website style have to wait until the lock is released to merge your changes.

We also decided to refactor the workflow before applying these changes.

@josecelano
Copy link
Member Author

hi, @da2ce7 @yeraydavidrodriguez I have implemented the first version here.

git log output:

* cfdbd6d - (HEAD -> main, origin/main) release lock 1694170210 (2022-01-13 18:13:06 +0000) <github-actions[bot]>
* 375c7dd - update library aaa to commit ac456cbfd91f3ad79480a4d119fcd6566a73c5cc (2022-01-13 18:12:50 +0000) <github-actions[bot]>
* eb1afbe - claim lock 1694170210 (2022-01-13 18:12:32 +0000) <github-actions[bot]>

I'm still using the concurrency option:

image

but I'm planning to add a clone of the workflow with a different concurrency group, so I can run them manually and reproduce any problem I want.

NOTE: please ignore the Python code, I have a "Typer" app scaffolding from a previous test.

The commit full message is still empty (no json data).

@josecelano
Copy link
Member Author

josecelano commented Jan 14, 2022

hi @da2ce7 @yeraydavidrodriguez I have finished the example:

https://github.com/josecelano/library-consumer

It seems to work, at least with two treads. I wrote some notes on the README file.

I have added a cronjob on the library to generate random content every hour.

@josecelano
Copy link
Member Author

josecelano commented Jan 17, 2022

hi @da2ce7 I think we can split this problem into 2 steps:

1.- We refactor the workflow without implementing the lock system using git.
2.- In the meantime, we can only work on the other repo sample repo (library consumer) until we get an implementation we like.

Regarding point 1 (this issue) I have been discussing it with @yeraydavidrodriguez and we think there are a lot of alternatives. It depends on how we want to use the Librarian in other projects. We have identified these options:

  1. This website only uses some Librarian small actions (get dvc diff, pull dvc, auto-commit image). We continue using embedded actions for the website. We can end up having what we did not like too much for the librarian: a lot of small GitHub actions that we need to interconnect with a lot of inputs/outputs processing.
  2. This website has its own Typer app and it can call methods from the Librarian Python API. We have to define a Python API for the Librarian.
  3. This website has its own Typer app but it only uses the Librarian from the console. The only API available is the public Librarian documentation for the commands. We have to create small commands for all the actions we need on this repo: resize-image, auto-commit a file, ...
  4. The librarian is a framework. You can clone it and update it, and build your own console app on top of it.
  5. The librarian allows you to load your custom commands. We can create a Typer command to import, resize and commit Base images in this repo and dynamically load it into the Librarian. I think this forces us to define a Python API because the custom command should be able to use other mods or core methods.
  6. We implement this workflow inside the Librarian as a new command, even if 20% of the command is going to be specific only for this site. After finishing the implementation we can decide which of the previous options it's better.

I like option 6 because maybe it's too early to find the best way to use the Librarian without implementing even one case. And it should not be a big effort to extract the code once it's implemented. I mean, we can focus on solving our problem right now, and think later about how this solution can be used to solve other problems. There are still a lot of things pending to fix from the previous POC version.

@josecelano
Copy link
Member Author

Regarding point 1 we (@da2ce7 , @yeraydavidrodriguez , @cgbosse ) were discussing today that the first option should be approach we must follow:

1. This website only uses some Librarian small actions (get dvc diff, pull dvc, auto-commit image). We continue using embedded actions for the website. We can end up having what we did not like too much for the librarian: a lot of small GitHub actions that we need to interconnect with a lot of inputs/outputs processing.

Since we have already defined the lock system here we will work on a proposal to replace the current import-base-images.yml by the work-allocator and worker solution using the git-queue action and defining which steps are going to be implemented in the LIbrarian and which of them are exclusive for this repo.

As a starting point, you can see my previous proposal in this issue description. At that time did not have the git-queue action implementation, so it does not include the lock system and it's only one workflow (worker).

@josecelano
Copy link
Member Author

I'm going to start working again on this issue now that:

I'm going to replace the import-base-images.yml workflow with the new two workflows work-allocator.yml and worker.

@josecelano josecelano linked a pull request Feb 25, 2022 that will close this issue
josecelano added a commit that referenced this issue Feb 25, 2022
We decided to use only the local branch and push at the en don the
workflow:

See: #19
josecelano added a commit that referenced this issue Jun 13, 2022
We decided to use only the local branch and push at the en don the
workflow:

See: #19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
1 participant