Wayback machine image URLs still loading images from original Amazon S3 URL #1379

jywarren · 2023-03-06T01:00:44Z

I found a strange issue when I pointed at a collection of JSON files which have had images routed to the Internet Archive's Wayback Machine caches.

As you can see, the image links are routed to Wayback URLs: https://ia601603.us.archive.org/20/items/mapknitter-wayback/ceres--2.json :

i.e.: https://web.archive.org/web/0id_/https://s3.amazonaws.com/grassrootsmapping/warpables/305268/PuglisiTerrazzeHaghiaTriadaCretaAntica2007-28.jpg

However, when I actually load a page like this, somehow it still loads images directly from Amazon s3, not the Internet Archive:

https://publiclab.github.io/Leaflet.DistortableImage/examples/archive?json=https://archive.org/download/mapknitter-wayback/ceres--2.json

I inspected in the console and still can't figure it out.

@segun-codes @7malikk I was curious, if you had an interest in this, what do you think is happening here? Could any application logic we've written be causing this?

See for example the images at https://publiclab.github.io/Leaflet.DistortableImage/examples/archive?json=https://archive.org/download/mapknitter-wayback/ceres--2.json

still loads https://s3.amazonaws.com/grassrootsmapping/warpables/306187/DJI_1207.JPG

segun-codes · 2023-03-06T06:56:39Z

Hi @jywarren, I am happy to check this out.

segun-codes · 2023-03-12T18:47:59Z

Hi @jywarren, I checked the code. The transformation that takes place in the function (in archive.js) below is responsible for the behaviour you are talking about. If my memory serves me right, I think we designed it this way at the time because of issues related to accessing the images programmatically via IA. I also observed the wayback machine itself simply loads the images from s3. What do you think?

// where imageSrc is in format: https://web.archive.org/web/20220803171120/https://s3.amazonaws.com/grassrootsmapping/warpables/48659/t82n_r09w_01-02_1985.jpg
// returns https://s3.amazonaws.com/grassrootsmapping/warpables/48659/t82n_r09w_01-02_1985.jpg or
// returns same url unchanged (no transformation required)
function extractImageSource(imageSrc) {
  if (imageSrc.startsWith('https://web.archive.org/web/')) {
    return imageSrc.substring(imageSrc.lastIndexOf('https'), imageSrc.length);
  }
  return imageSrc;
}

Illustration 1:

jywarren · 2023-03-14T12:15:24Z

Hmm, did this apply only to JSON maybe? Would you mind trying removing that so that it loads directly from the wayback machine? Thanks for finding that!!!

…

On Sun, Mar 12, 2023, 2:48 PM Segun ***@***.***> wrote: Hi @jywarren <https://github.com/jywarren>, I checked the code. The transformation that takes place in the function (in archive.js) below is responsible for the behaviour you are talking about. If my memory serves me right, I think we designed it this way at the time because of issues related to accessing the images programmatically via IA. I also observed the wayback machine itself simply loads the images from s3. What do you think? // where imageSrc is in format: https://web.archive.org/web/20220803171120/https://s3.amazonaws.com/grassrootsmapping/warpables/48659/t82n_r09w_01-02_1985.jpg // returns https://s3.amazonaws.com/grassrootsmapping/warpables/48659/t82n_r09w_01-02_1985.jpg or // returns same url unchanged (no transformation required) function extractImageSource(imageSrc) { if (imageSrc.startsWith('https://web.archive.org/web/')) { return imageSrc.substring(imageSrc.lastIndexOf('https'), imageSrc.length); } return imageSrc; } *Illustration 1:* [image: img] <https://user-images.githubusercontent.com/1612359/224565688-4ebdb4cc-6b7b-4ba1-919b-18e1fa965c06.PNG> — Reply to this email directly, view it on GitHub <#1379 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAAF6J3CHQMYKTAMZ5DZ7HTW3YK6VANCNFSM6AAAAAAVQP3O4Y> . You are receiving this because you were mentioned.Message ID: ***@***.***>

segun-codes · 2023-03-30T10:30:01Z

Okay @jywarren, I'll look into this. Many thanks!

jywarren · 2023-04-02T22:25:44Z

Ah yes. I see - we get this error if we don't do that --

Access to image at 'https://web.archive.org/web/0id_/https://s3.amazonaws.com/grassrootsmapping/warpables/409/IMG_4155.JPG' from origin 'http://localhost:8082' has been blocked by CORS policy: No 'Access-Control-Allow-Origin' header is present on the requested resource.

I'm not sure... is there another way to access https://web.archive.org/web/20200506081918id_/http://s3.amazonaws.com/grassrootsmapping/warpables/417/img_0135.jpg without CORS issues? Otherwise, we could... upload that entire directory into an Archive collection, and serve it from there.

That is, wayback URLs have CORS limitations, but images in regular archive.org/download/_____ archive.org URLs do not.

segun-codes · 2023-04-03T06:33:13Z

Yes, I pointed out the fact of CORS limitation in my previous message. It was the reason I fetched from s3 directly.

Okay, but is there something wrong with fetching from s3 given that the legacy json files all have the image sources pointing to s3 either directly or indirectly ? For instance, https://web.archive.org/web/20200506081918id_/http://s3.amazonaws.com/grassrootsmapping/warpables/417/img_0135.jpg simply points to s3 indirectly nothing more.

jywarren · 2023-04-03T13:23:17Z

Yes, sorry, just agreeing and confirming from my test. Thank you!

The only issue with s3 is that it costs Public Lab money to host -- it's not forever storage. I think perhaps the best choice is to create an archive.org collection and add to this logic in extractImageSource(), where we replace http://s3.amazonaws.com/grassrootsmapping with https://archive.org/download/mapknitter-wayback

I'm working on uploading all the files, but it'll be a while. We can check in here again once it's complete!

segun-codes · 2023-04-03T13:31:54Z

Ha! okay, I understand now. So archive.org option is definitely the route to take. I will check back then.

jywarren · 2023-04-03T13:33:16Z

gosh it's going to take a while! it's 631,813 files, i'm only at downloading 3875...

I may try another way at a remote server that's faster... we'll see!

segun-codes · 2023-04-03T13:35:19Z

Yeah... this has to take a while

Mustafa-Hersi · 2023-08-07T12:28:22Z

is this issue being worked on?

jywarren · 2023-08-07T12:52:46Z

Hi, we are still working on uploading the archive.org collection, apologies!

jywarren added the bug label Mar 6, 2023

jywarren mentioned this issue Apr 2, 2023

Publishing 0.21.10 on npm #1380

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wayback machine image URLs still loading images from original Amazon S3 URL #1379

Wayback machine image URLs still loading images from original Amazon S3 URL #1379

jywarren commented Mar 6, 2023

segun-codes commented Mar 6, 2023

segun-codes commented Mar 12, 2023

jywarren commented Mar 14, 2023 via email

segun-codes commented Mar 30, 2023

jywarren commented Apr 2, 2023

segun-codes commented Apr 3, 2023 •

edited

Loading

jywarren commented Apr 3, 2023

segun-codes commented Apr 3, 2023

jywarren commented Apr 3, 2023

segun-codes commented Apr 3, 2023

Mustafa-Hersi commented Aug 7, 2023

jywarren commented Aug 7, 2023

Wayback machine image URLs still loading images from original Amazon S3 URL #1379

Wayback machine image URLs still loading images from original Amazon S3 URL #1379

Comments

jywarren commented Mar 6, 2023

segun-codes commented Mar 6, 2023

segun-codes commented Mar 12, 2023

jywarren commented Mar 14, 2023 via email

segun-codes commented Mar 30, 2023

jywarren commented Apr 2, 2023

segun-codes commented Apr 3, 2023 • edited Loading

jywarren commented Apr 3, 2023

segun-codes commented Apr 3, 2023

jywarren commented Apr 3, 2023

segun-codes commented Apr 3, 2023

Mustafa-Hersi commented Aug 7, 2023

jywarren commented Aug 7, 2023

segun-codes commented Apr 3, 2023 •

edited

Loading